NGS foundation - Interpretation and conversion of GTF/GFF file format In this article, readers leave a message to extract the sequences of exon, intron, promoter, genome, non coding region, coding region, 1500 upstream of TSS and 500 downstream of TSS. Let's demonstrate how to extract these sequences.
NGS foundation - reference genome and gene annotation documents It mentioned how to download the corresponding genome sequence and gene annotation file.
If we have got the genome sequence file grch38 FA and gene annotation file grch38 GTF, which can also be obtained from the link at the end of the text.
View the file content and format
The genome sequence file is in FASTA format. The viewing commands and contents are as follows (test file, only 1 chromosome):
# View the first 10 lines, each line with the first 40 characters # FASTA sequences are generally long, and it is a common way to view the first part of the characters head GRCh38.fa | cut -c 1-40 >chr20 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
The gene annotation file is in GTF format. Only the first six columns of information are available (the third column contains different component annotations)
cut -f 1-6 GRCh38.gtf | head chr20 ensembl_havana gene 87250 97094 . chr20 havana transcript 87250 97094 . chr20 havana exon 87250 87359 . chr20 havana exon 96005 97094 . chr20 ensembl_havana transcript 87710 96533 . chr20 ensembl_havana exon 87710 87767 . chr20 ensembl_havana CDS 87710 87767 . chr20 ensembl_havana start_codon 87710 87712 . chr20 ensembl_havana exon 96005 96533 . chr20 ensembl_havana CDS 96005 96414 .
Install extraction tool gffread
Gffread is used here( https://github.com/gpertea/gffread ), the installation method is as follows (if not understood, see This open source Linux tutorial for Shengxin learning is really delicious Software installation part of:
git clone https://github.com/gpertea/gffread cd gffread make release
Transcripts, CDS and protein sequences were extracted
gffread -h can refer to all available parameters. If there are special circumstances to be considered, it needs to be used in conjunction with other parameters.
1. Obtain transcript sequence
gffread GRCh38.gtf -g GRCh38.fa -w GRCh38.transcripts.fa
The contents are as follows:
head GRCh38.transcripts.fa >ENST00000608838 ACAGGAATTCATATCGGGGTGATCACTCAGAAGAAAAGGTGAATACCGGATGTTGTAAGCTATTGAACTG CCACAAGTGATATCTTTACACACCATTCTGCTGTCATTGGGTAGCTTTGAACCCCAAAAATGTTGGAAGA ATAATGTAGGACATTGCAGAAGACGATGTTTAGATACTGAAAGGTACATACTTCTTTGTAGGAACAAGCT ATCATGCTGCATTTCTATAATATCACATGAATATACTCGACGACCAGCATTTCCTGTGATTCACCTAGAG
2. Obtain CDS sequence
# Get CDS sequence gffread GRCh38.gtf -g GRCh38.fa -x GRCh38.cds.fa
The contents are as follows
head GRCh38.cds.fa >ENST00000382410 ATGAATATCCTGATGCTGACCTTCATTATCTGTGGGTTGCTAACTCGGGTGACCAAAGGTAGCTTTGAAC CCCAAAAATGTTGGAAGAATAATGTAGGACATTGCAGAAGACGATGTTTAGATACTGAAAGGTACATACT TCTTTGTAGGAACAAGCTATCATGCTGCATTTCTATAATATCACATGAATATACTCGACGACCAGCATTT CCTGTGATTCACCTAGAGGATATAACATTGGATTATAGTGATGTGGACTCTTTTACTGGTTCCCCAGTAT CTATGTTGAATGATCTGATAACATTTGACACAACTAAATTTGGAGAAACCATGACACCTGAGACCAATAC TCCTGAGACTACTATGCCACCATCTGAGGCCACTACTCCCGAGACTACTATGCCACCATCTGAGACTGCT ACTTCCGAGACTATGCCACCACCTTCTCAGACAGCTCTTACTCATAATTAA >ENST00000382398 ATGAAGTCCCTACTGTTCACCCTTGCAGTTTTTATGCTCCTGGCCCAATTGGTCTCAGGTAATTGGTATG
3. Obtain protein sequence
# Obtain protein sequence gffread GRCh38.gtf -g GRCh38.fa -y GRCh38.protein.fa
The contents are as follows
head GRCh38.protein.fa >ENST00000382410 MNILMLTFIICGLLTRVTKGSFEPQKCWKNNVGHCRRRCLDTERYILLCRNKLSCCISIISHEYTRRPAF PVIHLEDITLDYSDVDSFTGSPVSMLNDLITFDTTKFGETMTPETNTPETTMPPSEATTPETTMPPSETA TSETMPPPSQTALTHN >ENST00000382398 MKSLLFTLAVFMLLAQLVSGNWYVKKCLNDVGICKKKCKPEEMHVKNGWAMCGKQRDCCVPADRRANYPV FCVQTKTTRISTVTATTATTTLMMTTASMSSMAPTPVSPTG >ENST00000382388 MGLFMIIAILLFQKPTVTEQLKKCWNNYVQGHCRKICRVNEVPEALCENGRYCCLNIKELEACKKITKPP RPKPATLALTLQDYVTIIENFPSLKTQST
Parsing the structure of GTF files
For this GTF, for gene elements, the gene symbol is in column 14.
head -n 1 GRCh38.gtf | sed 's/"/\t/g' | tr '\t' '\n' | sed = | sed 'N;s/\n/\t/' 1 chr20 2 ensembl_havana 3 gene 4 87250 5 97094 6 . 7 + 8 . 9 gene_id 10 ENSG00000178591 11 ; gene_version 12 6 13 ; gene_name 14 DEFB125 15 ; gene_source 16 ensembl_havana 17 ; gene_biotype 18 protein_coding 19 ;
For this GTF, for the transcript ion element, the gene symbol is in column 18.
sed -n '2p' GRCh38.gtf | sed 's/"/\t/g' | tr '\t' '\n' | sed = | sed 'N;s/\n/\t/' 1 chr20 2 havana 3 transcript 4 87250 5 97094 6 . 7 + 8 . 9 gene_id 10 ENSG00000178591 11 ; gene_version 12 6 13 ; transcript_id 14 ENST00000608838 15 ; transcript_version 16 1 17 ; gene_name 18 DEFB125 19 ; gene_source 20 ensembl_havana 21 ; gene_biotype 22 protein_coding 23 ; transcript_name 24 DEFB125-202 25 ; transcript_source 26 havana 27 ; transcript_biotype 28 processed_transcript 29 ; transcript_support_level 30 2 31 ;
This is a very common way to check the file structure and extract the corresponding information. It is simplified into a script checkcol sh
Check the specified line of a file (the default is the first line)
checkCol.sh -f GRCh38.gtf 1 chr20 2 ensembl_havana 3 gene 4 87250 5 97094 6 . 7 + 8 . 9 gene_id "ENSG00000178591"; gene_version "6"; gene_name "DEFB125"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
Check the first line of standard input
sed 's/"/\t/g' GRCh38.gtf | checkCol.sh -f - 1 chr20 2 ensembl_havana 3 gene 4 87250 5 97094 6 . 7 + 8 . 9 gene_id 10 ENSG00000178591 11 ; gene_version 12 6 13 ; gene_name 14 DEFB125 15 ; gene_source 16 ensembl_havana 17 ; gene_biotype 18 protein_coding 19 ;
Extraction of gene promoter sequence
Firstly, the promoter region was determined. Here, 1000 bp upstream and 500 bp downstream of the transcription start site were defined as the promoter region.
sed 's/"/\t/g' GRCh38.gtf | awk 'BEGIN{OFS=FS="\t"}{if($3=="gene") {if($7=="+") {start=$4-1000; end=$4+500;} else {if($7=="-") start=$5-500; end=$5+1000; } if(start<0) start=0; print $1,start,end,$14,$10,$7;}}' >GRCh38.promoter.bed
The promoter region is as follows (this bed file can also be used for chip SEQ data analysis to determine whether peak is in the promoter region)
head GRCh38.promoter.bed chr20 86250 87750 DEFB125 ENSG00000178591 + chr20 141369 142869 DEFB126 ENSG00000125788 + chr20 156470 157970 DEFB127 ENSG00000088782 + chr20 189181 190681 DEFB128 ENSG00000185982 - chr20 226258 227758 DEFB129 ENSG00000125903 + chr20 256736 258236 DEFB132 ENSG00000186458 + chr20 266186 267686 AL034548.1 ENSG00000272874 + chr20 290278 291778 C20orf96 ENSG00000196476 - chr20 295968 297468 ZCCHC3 ENSG00000247315 + chr20 347724 349224 NRSN2-AS1 ENSG00000225377 -
Then extract the sequence. The bedtools tool is used here. The official provides compiled binary files, which can be downloaded and used.
# -Name: output gene name (the fourth column of bed file) # -s: Considering the positive and negative chain (for the promoter region, whether the information relationship of the chain is considered is not too large) bedtools getfasta -name -s -fi GRCh38.fa -bed GRCh38.promoter.bed >GRCh38.promoter.fa
The sequence information is as follows:
head GRCh38.promoter.fa | cut -c 1-60 >DEFB125::chr20:86250-87750(+) ATAATTTGAAGTGAGGTAATGTGATTCCTCTAGTTTTGTTCTTTTTGCTTAGGATGGCTT >DEFB126::chr20:141369-142869(+) AATATTCAAGAGAATGCCAAGAAAGCTACAAGAACAAATAGCAGGTCAGTCGTTGCCTGG >DEFB127::chr20:156470-157970(+) ATATCCGTCACCTCAAACATTTATCATTTGTATTGGGAACATTCAAAATCCTCTCTTCTA >DEFB128::chr20:189181-190681(-) AAAAAAGAAAAAGAACTCCAAGTCTAATAAGACCAGAGACCTGCCCTTTATGGGTCTGCA >DEFB129::chr20:226258-227758(+) GAGTGGAAGGTGGGAGGAGGGAGAGGATGAGGAAAAATAACTAATGGACACTAGGCTTAA
If you do not want coordinate information, you can simplify the sequence name
cut -d ':' -f 1 GRCh38.promoter.fa >GRCh38.promoter.simplename.fa head GRCh38.promoter.simplename.fa | cut -c 1-60 >DEFB125 ATAATTTGAAGTGAGGTAATGTGATTCCTCTAGTTTTGTTCTTTTTGCTTAGGATGGCTT >DEFB126 AATATTCAAGAGAATGCCAAGAAAGCTACAAGAACAAATAGCAGGTCAGTCGTTGCCTGG >DEFB127 ATATCCGTCACCTCAAACATTTATCATTTGTATTGGGAACATTCAAAATCCTCTCTTCTA >DEFB128 AAAAAAGAAAAAGAACTCCAAGTCTAATAAGACCAGAGACCTGCCCTTTATGGGTCTGCA >DEFB129 GAGTGGAAGGTGGGAGGAGGGAGAGGATGAGGAAAAATAACTAATGGACACTAGGCTTAA
Extract gene sequence
The operation of extracting gene sequence is also similar to extracting promoter sequence. Note that the sequence position of the GFF file starts from 1, while the position of the bed file starts from 0, which is closed before and opened after. Therefore, the starting position of the sequence should be operated with - 1.
type="gene" sed 's/"/\t/g' GRCh38.gtf | awk -v type="${type}" 'BEGIN{OFS=FS="\t"}{if($3==type) {print $1,$4-1,$5,$14,".",$7}}' >GRCh38.gene.bed head GRCh38.gene.bed chr20 87249 97094 DEFB125 . + chr20 142368 145751 DEFB126 . + chr20 157469 159163 DEFB127 . + chr20 187852 189681 DEFB128 . - chr20 227257 229886 DEFB129 . + chr20 257735 261096 DEFB132 . +
Extract gene sequence
bedtools getfasta -name -s -fi GRCh38.fa -bed GRCh38.gene.bed >GRCh38.gene.fa # View sequence head GRCh38.gene.fa | cut -c 1-60 >DEFB125::chr20:87249-97094(+) ACAGGAATTCATATCGGGGTGATCACTCAGAAGAAAAGGTGAATACCGGATGTTGTAAGC >DEFB126::chr20:142368-145751(+) GCCATACACTTCAGCAGAGTTTGCAACTTCTCTTCTAAGTCTTTATCCTTCCCCCAAGGC >DEFB127::chr20:157469-159163(+) CTCTGAGGAAGGTAGCATAGTGTGCAGTTCACTGGACCAAAAGCTTTGGCTGCACCTCTT >DEFB128::chr20:187852-189681(-) GGCACACAGACCACTGGACAAAGTTCTGCTGCCTCTTTCTCTTGGGAAGTCTGTAAATAT
Extract the sequence of noncoding RNA
There are notes of transcript type in GTF file, including the following note types
ntisense_RNA lincRNA miRNA misc_RNA processed_pseudogene processed_transcript protein_coding rRNA scaRNA sense_intronic sense_overlapping snoRNA snRNA TEC transcribed_processed_pseudogene transcribed_unitary_pseudogene transcribed_unprocessed_pseudogene unitary_pseudogene unprocessed_pseudogene
We only screened lincRNA
grep 'transcript_biotype "lincRNA"' GRCh38.gtf >GRCh38.lincRNA.gtf gffread GRCh38.lincRNA.gtf -g GRCh38.fa -w GRCh38.lincRNA.fa head GRCh38.lincRNA.fa | cut -c 1-60 >ENST00000608495 GTCGCACGCGCTGGCCAAACGGGCGCACCAGACACTTTTCAGGGCCCTGCCAAAGACCTC CTGGCGTCCCAGACACAAGAGATCCAGGCCAAGACTCACACTTCACAAGATACACAGACA GGAACAGGAAATTCCATGAAACTTCCATTTACCCAATTAGCCGGACTCACTGAGCCCCAG TCAACCAACTCCTACTAAAATTAAAAAGTAATGTGTGGTATAGATTGGAATAATAGACAT AAACGATGGGAGGCGGAGAGGGGTGAGGGTTGAAAAATTACCTATTGGGTGCAACATTCA AATGGGGCACTAGAAGCCCACTCCACCACTATGCAATATATGTATTTGTACCCCGTAAAT
Extract exon sequences
Gets the coordinates of the exon
type="exon" sed 's/"/\t/g' GRCh38.gtf | awk -v type="${type}" 'BEGIN{OFS=FS="\t"}{if($3==type) {print $1,$4-1,$5,$14,$20,$7}}' >GRCh38.exon.bed # view file contents head GRCh38.exon.bed chr20 87249 87359 ENST00000608838 DEFB125 + chr20 96004 97094 ENST00000608838 DEFB125 + chr20 87709 87767 ENST00000382410 DEFB125 + chr20 96004 96533 ENST00000382410 DEFB125 + chr20 142368 142686 ENST00000382398 DEFB126 + chr20 145414 145751 ENST00000382398 DEFB126 + chr20 142633 142686 ENST00000542572 DEFB126 + chr20 145414 145488 ENST00000542572 DEFB126 + chr20 145578 145749 ENST00000542572 DEFB126 + chr20 157469 157593 ENST00000382388 DEFB127 +
Extraction sequence
# -Name: output gene name (the fourth column of bed file) # -s: Considering the positive and negative chains (for the promoter region, whether the information relationship of the chain is considered is not too large) bedtools getfasta -name -s -fi GRCh38.fa -bed GRCh38.exon.bed >GRCh38.exon.fa # View sequence information head GRCh38.exon.fa | cut -c 1-60 >ENST00000608838::chr20:87249-87359(+) ACAGGAATTCATATCGGGGTGATCACTCAGAAGAAAAGGTGAATACCGGATGTTGTAAGC >ENST00000608838::chr20:96004-97094(+) GTAGCTTTGAACCCCAAAAATGTTGGAAGAATAATGTAGGACATTGCAGAAGACGATGTT >ENST00000382410::chr20:87709-87767(+) ATGAATATCCTGATGCTGACCTTCATTATCTGTGGGTTGCTAACTCGGGTGACCAAAG >ENST00000382410::chr20:96004-96533(+) GTAGCTTTGAACCCCAAAAATGTTGGAAGAATAATGTAGGACATTGCAGAAGACGATGTT
Extract intron sequences
Determination of intron region
sed 's/"/\t/g' GRCh38.gtf | awk 'BEGIN{OFS=FS="\t";oldtr="";}{if($3=="exon") {tr=$14; if(oldtr!=tr) {start=$5; oldtr=tr;} else {print $1,start,$4-1,tr,$20,$7; start=$5;} } }' >GRCh38.intron.bed # view file contents head GRCh38.intron.bed chr20 87359 96004 ENST00000608838 DEFB125 + chr20 87767 96004 ENST00000382410 DEFB125 + chr20 142686 145414 ENST00000382398 DEFB126 + chr20 142686 145414 ENST00000542572 DEFB126 + chr20 145488 145578 ENST00000542572 DEFB126 + chr20 157593 158773 ENST00000382388 DEFB127 + chr20 189681 187852 ENST00000334391 DEFB128 - chr20 227346 229277 ENST00000246105 DEFB129 +
The extraction sequence is the same as above.