If you must the TPM expression matrix of transcriptome sequencing in TCGA database, you might as well convert it yourself!

Posted by Casalen on Fri, 26 Nov 2021 01:03:33 +0100

In our previous pan cancer series tutorials, we directly use the count matrix for analysis. At most, we just make some transformations. But many readers insist that we use TPM expression matrix because it is more authoritative or more popular!

First, you need to download all the data of 33 cancers of TCGA, especially the expression matrix and clinical phenotype information. Here we recommend downloading it in xena of ucsc: https://xenabrowser.net/datapages/ , you can see that the TPM expression matrix is not provided, but you can convert it yourself! No matter how criticized RPKM, FPKM or TPM format is, its real demand still exists, so how can we reasonably define the length of genes?

At this time, we need to understand the concepts of genes, transcripts (isoform, mRNA sequence), EXON region, cDNA sequence, UTR region, ORF sequence and CDS sequence. A gene can be transcribed into multiple transcripts. Each transcript in eukaryotes is usually composed of one or more exons, and the EXON region that can be translated into protein is the CDS region, The beginning and end of exons that cannot be translated are UTR regions, the translation regions together are ORF sequences, and the reverse transcription of transcripts is cDNA sequences.

Discussion on gene length

Five years ago, I shared several ways of defining gene length in the former mainstream.

  • Select the longest transcript of the gene
  • Select the average length of multiple transcripts
  • Sum of non redundant exon (EXON) lengths
  • Sum of length of non redundant CDS (Coding DNA Sequence)

Refer to my blog: define gene length as the sum of non redundant CDS http://www.bio-info-trainee.com/3298.html

Other reference examples:

  • https://www.uniprot.org/uniprot/A2CE44
  • http://www.informatics.jax.org/marker/MGI:3694898
  • http://www.informatics.jax.org/sequence/marker/MGI:3694898?provider=RefSeq
  • https://asia.ensembl.org/Mus_musculus/Gene/Summary?g=ENSMUSG00000073062
  • https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=CCDS&DATA=CCDS41065.1

CCDS file:

X NC_000086.7 Zxdb 668166 CCDS41065.1 Public + 94724674 94727295 [94724674-94727295] Identical

It can be seen that the definitions of gene length are very different.

I have a code here to obtain the gene length of mice, based on the sum of non redundant exon lengths:

# BiocManager::install('TxDb.Hsapiens.UCSC.hg38.knownGene')
library("TxDb.Hsapiens.UCSC.hg38.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

## The following is the definition of gene length as the sum of non redundant exon lengths
if(!file.exists('gene_length.Rdata')){
  exon_txdb=exons(txdb)
  genes_txdb=genes(txdb)
  
  o = findOverlaps(exon_txdb,genes_txdb)
  o
  t1=exon_txdb[queryHits(o)]
  t2=genes_txdb[subjectHits(o)]
  t1=as.data.frame(t1)
  t1$geneid=mcols(t2)[,1]
  
  # If you think the speed is not enough, you can refer to R language to realize parallel computing
  # http://www.bio-info-trainee.com/956.html
  g_l = lapply(split(t1,t1$geneid),function(x){
    # x=split(t1,t1$geneid)[[1]]
    head(x)
    tmp=apply(x,1,function(y){
      y[2]:y[3]
    })
    length(unique(unlist(tmp)))
  })
  head(g_l)
  g_l=data.frame(gene_id=names(g_l),length=as.numeric(g_l))
  
  save(g_l,file = 'gene_length.Rdata')
} 

load(file = 'gene_length.Rdata')
head(g_l)
# If you need other species, just change a bag!
if(F){
  # BiocManager::install('TxDb.Mmusculus.UCSC.mm10.knownGene')
  library("TxDb.Mmusculus.UCSC.mm10.knownGene")
  txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
  
  
}

With gene length information, it's easy to convert!

First, check the expression matrix frontal gene length information:

> load(file = 'gene_length.Rdata')
> head(g_l)
    gene_id length
1         1   9488
2        10   1285
3       100   2809
4      1000   9054
5     10000  13977
6 100009613   1004
> load('../expression/Rdata/TCGA-ACC.htseq_counts.Rdata')
# If you don't know tcga-acc.htseq_ Count.rdata file, go to the previous series of tutorials!
> pd_mat[1:4,1:4]
       TCGA-OR-A5JP-01A TCGA-OR-A5JG-01A TCGA-OR-A5K1-01A TCGA-OR-A5JR-01A
ZZZ3               1687             1583              784             1555
ZZEF1              1955             2036              697             2191
ZYX                2528             4567             1924             4639
ZYG11B             1541             1225              996             1944
> 

You can see that at this time, one is the id of the gene and the other is the name of the gene. It needs to be matched after continuous conversion!

If you don't know tcga-acc.htseq_ Count.rdata file, go to the previous series of tutorials! The previous pan cancer directory is:

Next is the core conversion:

library(org.Hs.eg.db)
g2s=toTable(org.Hs.egSYMBOL)
genelength <-  merge(g_l,g2s,by='gene_id')
head(genelength)
genelength=genelength[,c(3,2)]
colnames(genelength) <- c("gene","length")
#sort
counts <- pd_mat[rownames(pd_mat) %in% genelength$gene,]
#Convert to matrix
count<- as.matrix(counts)
#Storing gene length information
genelength[match(rownames(counts),
                 genelength$gene),"length"]
# eff_length<- genelength$length
eff_length <- genelength[match(rownames(counts),genelength$gene),"length"]/1000
x <- count / eff_length
mat_tpm <- t( t(x) / colSums(x) ) * 1e6 
rownames(mat_tpm)<- rownames(count) 
mat_tpm[1:4,1:4] 

In the future, such TPM matrix can also be filtered, normalized and visualized, such as:

The code is as follows:

# Removing genes that are not expressed
table(rowSums(mat_tpm==0)== dim(mat_tpm)[2])
# Which values in mat_tpm are greater than 0
keep.exprs <- mat_tpm > 0
# This produces a logical matrix with TRUEs and FALSEs
head(keep.exprs)
# Summary of how many TRUEs there are in each row
table(rowSums(keep.exprs))
# we would like to keep genes that have at least 2 TRUES in each row of thresh
# keep <- rowSums(keep.exprs) >= round(dim(mat_tpm)[2]*0.1)
# keep <- rowSums(keep.exprs) >= 0
keep <- rowSums(keep.exprs) > 0
summary(keep)
# Subset the rows of countdata to keep the more highly expressed genes
mat_tpm.filtered <- mat_tpm[keep,]
rm(keep.exprs)
rm(eff_length)
mat_tpm.filtered[1:4,1:4]

#--------------------TPM boxplot---------------#
################################################
############## normalize.quantiles #############
################################################
par(mfrow=c(2,1));
# unnormalised
boxplot(log2(mat_tpm.filtered+1),xlab="",ylab="Log2 transcript per million unnormalised",las=2,outline = F)
# Let's add a blue horizontal line that corresponds to the median log2TPM
abline(h=median(log2(mat_tpm.filtered+1)),col="blue")
title("Boxplots of log2TPM (unnormalised)")
# normalised
library(preprocessCore)
mat_tpm.norm<- normalize.quantiles(log2(mat_tpm.filtered+1));
colnames(mat_tpm.norm)<- colnames(mat_tpm.filtered);
rownames(mat_tpm.norm)<- rownames(mat_tpm.filtered);
boxplot(mat_tpm.norm,xlab="",ylab="Log2 transcript per million normalised",las=2,outline = F)
abline(h=median(mat_tpm.norm),col="blue")
title("Boxplots of log2TPM (normalised)")

write.table(mat_tpm.filtered,file = "mat_tpm_filtered.txt",quote = FALSE,sep = "\t")
write.table(mat_tpm.norm,file = "mat_tpm_norm.txt",quote = FALSE,sep = "\t")
save(mat_tpm.norm,file = "mat_tpm_norm.Rdata")