RNA 9. GO annotation of gene expression in SCI articles

Posted by Sean_J on Tue, 01 Mar 2022 06:04:51 +0100

The most complete GO annotation result drawing in the whole network, directly click the SCI drawing annotation, pay attention to me, your best choice!


1. GO principle

Genome sequencing has shown that most genes that specify core biological functions are common to all eukaryotes. This knowledge can often be transferred from one organism to other organisms. The goal of gene ontology alliance is to produce a dynamic and controllable vocabulary that can be applied to all eukaryotes, even if the knowledge of the role of genes and proteins in cells is accumulating and changing. To this end, three independent ontologies can be accessed on the Web all over the world.

When we get a project of non model organisms or no reference genome, we often need to annotate the function of genes before we can analyze the data of bioinformatics. Although, ah, the functions of genes with similar sequences are often conservative in different species, in the past, there were often different descriptions of the functions of the same genes in different laboratories because of the fuzziness of our natural language. Obviously, we need a unified term to describe the functions of these cross species homologous genes and their gene products, otherwise this ambiguity will greatly limit the academic communication between different researchers.

With the accumulation of bioinformatics data, there are different classification databases used to describe gene function. The goal of these classification systems is to elaborate the biological functions of these cross species homologous genes. However, because the results of gene function annotation between classification systems may be different in natural language description, and they are all their own dialects, most classification results can hardly be translated directly between classification systems. The Gene Ontology (GO) project is the result of efforts to make the functional description of gene products obtained by genes in various databases consistent.

In the GO classification system, there are three basic classifications (officially called the namespace of entries):

  • cellular component: each part of the cell and the extracellular environment.

  • molecular function: it can be described as activity at the molecular level, such as catalytic or binding activity.

  • Biological process: biological process refers to a series of events generated by the orderly combination of one or more molecular functions.

Similar to GO classification, another attempt to clarify the essence of biological function is the relatively early birth of the very classic NCBI lineal homologous protein cluster, namely cog classification (COG is also divided into four versions: prokaryotic bacterial branch is cog, eukaryotic branch is KOG, archaeal branch is arCOG and phage branch is POG). The cog classification system has the following four basic classifications:



  • Metabolically related (METABOLISM)

  • Difficult to classify

Another very important KEGG lineal homology for gene function classification in the system field is further subdivided into seven basic functional classifications compared with COG classification:

  • Metabolism related

  • Genetic Information Processing

  • Environmental Information Processing

  • Intracellular processes

  • Organic systems

  • Human Diseases

  • Not Included in Pathway or Brite

Gene Ontology is only from the full name of go Gene Ontology. GO annotation system is a classification system established to elaborate the essential attributes of biological macromolecules, where essential attributes refer to biological functions. Go classification uses four elements to describe the essential information of gene units:

  • concepts: an essential concept is our description of the functional annotation results of a gene and its products. In GO annotation, a concept is a Term entry specific to a GO id number.

  • Relationship: in the Go annotation system, there can be an is between two entries_ a,part_of,has_part and regulations. If we regard the entry as a node and the relationship between entries as an edge connection between two nodes, we can build a directed acyclic graph based on many nodes or a relational network.

  • Instances: instances are biological macromolecules that play specific biological functions in life activities, such as proteins with GO annotation information. In UniProt protein database, all protein sequences with UniProt number are the target instance object of GO annotation.

  • axioms: Axiomatic inference rules are the most important rules in the GO annotation system to explain the essential concept. axioms refer to the rules for inferring the relationship between entries, such as a is_ A, B, and B is_a C, then we can infer A is_a C, the inference rule information used in the process of inferring a to C, is an axiom. Because we infer a is from the relationship_ A C, so C is the essence of A. similarly, C is also the essence of B.

2. Example analysis

1. Data reading

We still use the data set of TCGA-COAD for data reading. We mentioned the reading of expression data and the acquisition of clinical information grouping in the previous period. We use the differential expression results calculated by edgeR software package to extract the ENSEMBL number of up-regulated gene 2832,

###########Gene list
DEG=read.table("DEG-resdata.xls",sep="\t",check.names=F,header = T)

## Down   Up 
## 1296 2832

Read the sample grouping information, 41 normal tissues and 478 cancer tissues, as follows:

######Sample grouping information
group<-read.table("DEG-group.xls",sep="\t",check.names=F,header = T)

##  NT  TP 
##  41 478

2. GO annotation results

First, we also need to install the software package and load it. The main program is the clusterProfiler software package, as follows:


According to the data comparison, we found some genes in 1726 GO data. At the same time, we obtained three naming methods of a gene, namely "ENTREZID", "ENSEMBL", "SYMBOL", as follows:

eg <- bitr(geneList, 

## 'select()' returned 1:many mapping between keys and columns

## Warning in bitr(geneList, fromType = "ENSEMBL", toType = c("ENTREZID",
## "ENSEMBL", : 39.51% of input gene IDs are fail to map...


## [1] 1726    3


## 1 ENSG00000062038     1001    CDH3
## 2 ENSG00000175832     2118    ETV4
## 3 ENSG00000167767   144501   KRT80
## 4 ENSG00000164283    11082    ESM1
## 5 ENSG00000120254    25902 MTHFD1L
## 6 ENSG00000129474    84962   AJUBA

GO annotation is carried out for the obtained genes. The ont parameters have four selection modes, ALL, MF, CC and BP, among which (MF, CC and BP) are the three basic classifications we mentioned above. In the go classification system, there are three basic classifications, and ALL annotates these three classifications at the same time. We select "ALL" here, and the thresholds are pvvaluecutoff = 0.01 and qvalueCutoff = 0.05 respectively, During annotation, we can select three different gene names in the parameter keyType, as follows:

# Run GO enrichment analysis 
go <- enrichGO(eg$SYMBOL, 
               OrgDb = org.Hs.eg.db, 
               keyType = "SYMBOL",
               pAdjustMethod = 'BH',
               pvalueCutoff = 0.01, 
               qvalueCutoff = 0.05,
               #keyType = 'ENTREZID'

## [1] 232  10

Look at the final results and keep the results. We can see that BP, CC and MF are 193, 10 and 29 respectively, as follows:


##            ONTOLOGY         ID
## GO:0007389       BP GO:0007389
## GO:0019730       BP GO:0019730
## GO:0061844       BP GO:0061844
## GO:0003002       BP GO:0003002
## GO:0048568       BP GO:0048568
## GO:0045165       BP GO:0045165
##                                                                        Description
## GO:0007389                                           pattern specification process
## GO:0019730                                          antimicrobial humoral response
## GO:0061844 antimicrobial humoral immune response mediated by antimicrobial peptide
## GO:0003002                                                         regionalization
## GO:0048568                                             embryonic organ development
## GO:0045165                                                    cell fate commitment
##            GeneRatio   BgRatio       pvalue     p.adjust       qvalue
## GO:0007389    64/927 436/18722 4.639425e-15 2.023253e-11 1.787888e-11
## GO:0019730    31/927 122/18722 2.490775e-14 5.431135e-11 4.799330e-11
## GO:0061844    24/927  79/18722 2.963461e-13 4.307885e-10 3.806748e-10
## GO:0003002    51/927 331/18722 4.507505e-13 4.914307e-10 4.342625e-10
## GO:0048568    57/927 427/18722 8.233615e-12 6.338388e-09 5.601042e-09
## GO:0045165    42/927 258/18722 8.720552e-12 6.338388e-09 5.601042e-09
##                                                                                                                                                                                                                                                                                                                                                                        geneID
## GO:0019730                                                                                                                                                                     KLK7/CXCL3/CXCL1/SLC11A1/CXCL2/PGC/CXCL8/CXCL5/PPBP/WFDC10A/F2/WFDC10B/CXCL6/KRT6A/CXCL11/PGLYRP3/H2BC10/H2BC6/PRSS2/H2BC7/S100A7/IL36RN/FGB/PGLYRP4/IL17A/H2BC11/SEMG1/KLK5/SEMG2/REG3A/IL17F
## GO:0061844                                                                                                                                                                                                                    KLK7/CXCL3/CXCL1/CXCL2/PGC/CXCL8/CXCL5/PPBP/F2/CXCL6/KRT6A/CXCL11/PGLYRP3/H2BC10/H2BC6/H2BC7/S100A7/PGLYRP4/IL17A/H2BC11/SEMG1/KLK5/REG3A/IL17F
## GO:0003002                                                                              WNT2/OTX1/MDFI/XRCC2/WNT3/MSX1/FEZF1/SHH/WNT7B/LEF1/NKD1/AXIN2/GBX2/TDGF1/SIX1/WT1/HOXB8/DMRTA2/TDGF1P3/DMRT3/WNT8B/IRX3/SP8/RIPPLY1/HES7/WNT11/SOX1/ROBO2/DLL3/FGF8/DKK1/FOXD1/SIX2/GSC/LRP2/TBX20/NODAL/NKX2-1/SIX3/GATA4/PAX2/WNT3A/FEZF2/LHX1/IRX2/EN1/LHX3/WNT7A/NTF4/FOXG1/TBR1
## GO:0045165                                                                                                                        WNT2/EPOP/WNT3/SHH/WNT7B/IL23A/SIX1/WT1/TBX15/ONECUT2/DMRTA2/DMRT3/WNT8B/POU3F2/WNT11/SOX1/FGF8/DKK1/POU4F1/SIX2/EYA1/GSC/TBX4/TBX20/NODAL/PTF1A/NKX2-1/GATA4/GCM1/PAX2/WNT3A/GSX1/FEZF2/POU6F2/TLX3/LHX3/TBX5/WNT7A/NTF4/FOXG1/TBR1/NKX6-3
##            Count
## GO:0007389    64
## GO:0019730    31
## GO:0061844    24
## GO:0003002    51
## GO:0048568    57
## GO:0045165    42

##  BP  CC  MF 
## 193  10  29

3. Result display

For the GO annotation results, we give a variety of display methods here. According to our own needs and the design of the article, we can choose the one suitable for ourselves.


Draw a bubble chart. The interpretation of the bubble chart needs to explain that the enrichment degree of GO is measured by Gene ratio, Pvalue and the number of genes enriched on this GO term.

  • The abscissa is Gene ratio. The greater the value, the greater the enrichment degree. Count the number of differentially expressed genes under the GO term

  • The ordinate is the GO term with high enrichment degree (generally, 20 items with the most significant enrichment are selected for display, and less than 20 items are listed).

  • The value range of Pvalue [0, 1] is expressed in color. The redder it is, the smaller Pvalue is, indicating that the enrichment is more obvious.

  • The size of the dot indicates the number of differential genes under the term. The larger the dot, the more genes.

        color ="pvalue")

Because the description fonts of GO coincide, we use scale_ y_ Set discrete to prevent fonts from overlapping each other, as follows:

        color ="pvalue")+
  scale_y_discrete(labels=function(x) stringr::str_wrap(x, width=60))

Draw a histogram as follows:

        color ="pvalue")+
  scale_y_discrete(labels=function(x) stringr::str_wrap(x, width=60))

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

We can use split = "ONTOLOGY" method to draw three biological classifications separately on a map, as follows:

        color ="pvalue")+ 
  scale_y_discrete(labels=function(x) str_wrap(x, width=60))

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

We can use split = "ONTOLOGY" method to draw three biological classifications separately on a map, as follows:

        color ="pvalue")+ 
  scale_y_discrete(labels=function(x) str_wrap(x, width=60))

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

Draw the network diagram of gene concept and the network diagram of the relationship between GO term and differential gene gene gene concept network to show and explain the corresponding relationship between genes and enriched GO terms: the gray points in the figure represent genes and the Yellow points represent enriched GO terms; If a gene is located under a go term, connect the gene with go; The size of the Yellow node corresponds to the number of genes enriched, and the GO terms enriched by top2 are as follows:


## Warning: ggrepel: 14 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

The heat map shows which genes are enriched to which go terms, as follows:

 scale_y_discrete(labels=function(x) stringr::str_wrap(x, width=60))

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

The software is installed and loaded as follows:


You can choose a more beautiful network diagram as follows:

ed = enrichDO(eg$ENTREZID, pvalueCutoff=0.05)
met <- pairwise_termsim(ed)
emapplot(met,showCategory = 10)

Let's look at several ways to draw the tree diagram of GO term. First, we annotate go with BP type, as follows:

###########BP classification display
BP <- enrichGO(eg$SYMBOL, 
               OrgDb = org.Hs.eg.db, 
               keyType = "SYMBOL",
               pAdjustMethod = 'BH',
               pvalueCutoff = 0.01, 
               qvalueCutoff = 0.05,
               #keyType = 'ENTREZID'

## [1] 193   9

GO DAG graph, a directed acyclic graph, calls the topGO software package. We find that the font on this graph is very small, and we need to modify it ourselves later, as follows:

plotGOgraph(BP, sigForAll =FALSE,firstSigNodes = 3)

Use the enrichPlot software package to draw the diagram, as follows:

goplot(BP, showCategory = 5)

The results of GO annotation are displayed in various forms, and I feel that the summary has been very comprehensive. If there is anything I can't take into account, you can point out that I'll modify it and select the model based on my own article.

Pay attention to the official account, update every day, sweep the code into groups and communicate with each other without stopping, immediately release the video version, pay attention to me, your best choice!

Huanfeng gene

Bioinformatics analysis, SCI article writing and bioinformatics basic knowledge learning: R language learning, perl basic programming, linux system commands, Python meet you better

46 original content

official account


  1. Ashburner M, Ball CA, Blake JA, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25(1):25-29.

  2. Yu G, Wang L, Han Y and He Q*. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.

Topics: Go Back-end