In our previous pan cancer series tutorials, we directly use the count matrix for analysis. At most, we just make some transformations. But many readers insist that we use TPM expression matrix because it is more authoritative or more popular!
First, you need to download all the data of 33 cancers of TCGA, especially the expression matrix and clinical phenotype information. Here we recommend downloading it in xena of ucsc: https://xenabrowser.net/datapages/ , you can see that the TPM expression matrix is not provided, but you can convert it yourself! No matter how criticized RPKM, FPKM or TPM format is, its real demand still exists, so how can we reasonably define the length of genes?
At this time, we need to understand the concepts of genes, transcripts (isoform, mRNA sequence), EXON region, cDNA sequence, UTR region, ORF sequence and CDS sequence. A gene can be transcribed into multiple transcripts. Each transcript in eukaryotes is usually composed of one or more exons, and the EXON region that can be translated into protein is the CDS region, The beginning and end of exons that cannot be translated are UTR regions, the translation regions together are ORF sequences, and the reverse transcription of transcripts is cDNA sequences.
Discussion on gene length
Five years ago, I shared several ways of defining gene length in the former mainstream.
- Select the longest transcript of the gene
- Select the average length of multiple transcripts
- Sum of non redundant exon (EXON) lengths
- Sum of length of non redundant CDS (Coding DNA Sequence)
Refer to my blog: define gene length as the sum of non redundant CDS http://www.bio-info-trainee.com/3298.html
Other reference examples:
- https://www.uniprot.org/uniprot/A2CE44
- http://www.informatics.jax.org/marker/MGI:3694898
- http://www.informatics.jax.org/sequence/marker/MGI:3694898?provider=RefSeq
- https://asia.ensembl.org/Mus_musculus/Gene/Summary?g=ENSMUSG00000073062
- https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=CCDS&DATA=CCDS41065.1
CCDS file:
X NC_000086.7 Zxdb 668166 CCDS41065.1 Public + 94724674 94727295 [94724674-94727295] Identical
It can be seen that the definitions of gene length are very different.
I have a code here to obtain the gene length of mice, based on the sum of non redundant exon lengths:
# BiocManager::install('TxDb.Hsapiens.UCSC.hg38.knownGene') library("TxDb.Hsapiens.UCSC.hg38.knownGene") txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene ## The following is the definition of gene length as the sum of non redundant exon lengths if(!file.exists('gene_length.Rdata')){ exon_txdb=exons(txdb) genes_txdb=genes(txdb) o = findOverlaps(exon_txdb,genes_txdb) o t1=exon_txdb[queryHits(o)] t2=genes_txdb[subjectHits(o)] t1=as.data.frame(t1) t1$geneid=mcols(t2)[,1] # If you think the speed is not enough, you can refer to R language to realize parallel computing # http://www.bio-info-trainee.com/956.html g_l = lapply(split(t1,t1$geneid),function(x){ # x=split(t1,t1$geneid)[[1]] head(x) tmp=apply(x,1,function(y){ y[2]:y[3] }) length(unique(unlist(tmp))) }) head(g_l) g_l=data.frame(gene_id=names(g_l),length=as.numeric(g_l)) save(g_l,file = 'gene_length.Rdata') } load(file = 'gene_length.Rdata') head(g_l) # If you need other species, just change a bag! if(F){ # BiocManager::install('TxDb.Mmusculus.UCSC.mm10.knownGene') library("TxDb.Mmusculus.UCSC.mm10.knownGene") txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene }
With gene length information, it's easy to convert!
First, check the expression matrix frontal gene length information:
> load(file = 'gene_length.Rdata') > head(g_l) gene_id length 1 1 9488 2 10 1285 3 100 2809 4 1000 9054 5 10000 13977 6 100009613 1004 > load('../expression/Rdata/TCGA-ACC.htseq_counts.Rdata') # If you don't know tcga-acc.htseq_ Count.rdata file, go to the previous series of tutorials! > pd_mat[1:4,1:4] TCGA-OR-A5JP-01A TCGA-OR-A5JG-01A TCGA-OR-A5K1-01A TCGA-OR-A5JR-01A ZZZ3 1687 1583 784 1555 ZZEF1 1955 2036 697 2191 ZYX 2528 4567 1924 4639 ZYG11B 1541 1225 996 1944 >
You can see that at this time, one is the id of the gene and the other is the name of the gene. It needs to be matched after continuous conversion!
If you don't know tcga-acc.htseq_ Count.rdata file, go to the previous series of tutorials! The previous pan cancer directory is:
- The two scoring values of estimate are essentially ssGSEA analysis of two gene sets
- Batch run estimate for the expression matrix of all cancers in TCGA database
- Different cancers are grouped according to the two score values of estimate to see the difference of protein coding gene expression
Next is the core conversion:
library(org.Hs.eg.db) g2s=toTable(org.Hs.egSYMBOL) genelength <- merge(g_l,g2s,by='gene_id') head(genelength) genelength=genelength[,c(3,2)] colnames(genelength) <- c("gene","length") #sort counts <- pd_mat[rownames(pd_mat) %in% genelength$gene,] #Convert to matrix count<- as.matrix(counts) #Storing gene length information genelength[match(rownames(counts), genelength$gene),"length"] # eff_length<- genelength$length eff_length <- genelength[match(rownames(counts),genelength$gene),"length"]/1000 x <- count / eff_length mat_tpm <- t( t(x) / colSums(x) ) * 1e6 rownames(mat_tpm)<- rownames(count) mat_tpm[1:4,1:4]
In the future, such TPM matrix can also be filtered, normalized and visualized, such as:
data:image/s3,"s3://crabby-images/6035f/6035fee4b75c16334c0b35e469676d02e5b54312" alt=""
The code is as follows:
# Removing genes that are not expressed table(rowSums(mat_tpm==0)== dim(mat_tpm)[2]) # Which values in mat_tpm are greater than 0 keep.exprs <- mat_tpm > 0 # This produces a logical matrix with TRUEs and FALSEs head(keep.exprs) # Summary of how many TRUEs there are in each row table(rowSums(keep.exprs)) # we would like to keep genes that have at least 2 TRUES in each row of thresh # keep <- rowSums(keep.exprs) >= round(dim(mat_tpm)[2]*0.1) # keep <- rowSums(keep.exprs) >= 0 keep <- rowSums(keep.exprs) > 0 summary(keep) # Subset the rows of countdata to keep the more highly expressed genes mat_tpm.filtered <- mat_tpm[keep,] rm(keep.exprs) rm(eff_length) mat_tpm.filtered[1:4,1:4] #--------------------TPM boxplot---------------# ################################################ ############## normalize.quantiles ############# ################################################ par(mfrow=c(2,1)); # unnormalised boxplot(log2(mat_tpm.filtered+1),xlab="",ylab="Log2 transcript per million unnormalised",las=2,outline = F) # Let's add a blue horizontal line that corresponds to the median log2TPM abline(h=median(log2(mat_tpm.filtered+1)),col="blue") title("Boxplots of log2TPM (unnormalised)") # normalised library(preprocessCore) mat_tpm.norm<- normalize.quantiles(log2(mat_tpm.filtered+1)); colnames(mat_tpm.norm)<- colnames(mat_tpm.filtered); rownames(mat_tpm.norm)<- rownames(mat_tpm.filtered); boxplot(mat_tpm.norm,xlab="",ylab="Log2 transcript per million normalised",las=2,outline = F) abline(h=median(mat_tpm.norm),col="blue") title("Boxplots of log2TPM (normalised)") write.table(mat_tpm.filtered,file = "mat_tpm_filtered.txt",quote = FALSE,sep = "\t") write.table(mat_tpm.norm,file = "mat_tpm_norm.txt",quote = FALSE,sep = "\t") save(mat_tpm.norm,file = "mat_tpm_norm.Rdata")