A 10x single cell transcriptome project from fastq to cell subsets

Posted by scott_ttocs46 on Thu, 03 Mar 2022 07:43:46 +0100

Recently, when arranging apprentice single-cell sharing, an apprentice mentioned GSE168522, which is a very standard six 10x single-cell transcriptome samples, as shown below:

GSM5145401 Sample 16_Normal-1
GSM5145402 Sample 17_Normal-2
GSM5145403 Sample 23_Early AD-1
GSM5145404 Sample 28_Early AD-2
GSM5145405 Sample 24_Late AD-1
GSM5145406 Sample 27_Late AD-2

However, the expression matrix provided by the author is not pure, but divided into different single-cell subsets:

GSE168522_B_cell_data.txt.gz 318.7 Kb 
GSE168522_CD4_T_cell_data.txt.gz 317.3 Kb 
GSE168522_CD8_T_cell_data.txt.gz 320.2 Kb 
GSE168522_Gene_expression_of_six_samples_data.txt.gz 277.5 Kb 
GSE168522_HSC_data.txt.gz 271.0 Kb 
GSE168522_Monocytes_data.txt.gz 322.1 Kb 
GSE168522_NK_cell_data.txt.gz 319.8 Kb 

It's obvious that such a document can't give us a single-cell transcriptome process. Read the original: "single cell RNA sequencing requests B cell – related molecular biomarkers for Alzheimer's disease". In fact, it is introduced in single cell world: Single cell sequencing reveals B cell related markers of Alzheimer's disease

It can be seen that the number of cells in each 10x single-cell transcriptome sample mentioned in the original text is quite reasonable (5-8000), as shown below:

The number of cells is quite reasonable

The author's dimensionality reduction clustering is also super simple, which is just the first level. The immune cell subsets are subdivided, including two categories of lymphatic system (T,B,NK cells) and myeloid system (mononuclear, dendritic, macrophagy, granulocyte) as the second subdivision subsets:

First level classification

Refer to the previous example: Single cell clustering and clustering annotation that everyone can learn It is also a simple look at the change of proportion:

Proportional change

Since the author did not provide an expression matrix that can be analyzed directly, we need to start with the fastq sequencing data provided by him. Next, we will demonstrate this process:

Build Linux server operating environment

Install your own conda in the blank users of your server. Each user operates independently. The installation method code is as follows:

# First, download the file. It takes a few seconds for 20M/S
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Next, use the bash command to run the file we downloaded. Remember to yes all the way
bash Miniconda3-latest-Linux-x86_64.sh 
#  After successful installation, the system environment variable file needs to be updated
source ~/.bashrc

Next, use conda to install aspera. The code is as follows:

conda create -n download 
conda activate download 
conda install -y -c hcc aspera-cli
conda install -y -c bioconda sra-tools
which ascp 
## Be sure to find out where your software is installed by conda
ls -lh ~/miniconda3/envs/download/etc/asperaweb_id_dsa.openssh

We have introduced the details of conda many times, so we won't repeat it here.

Next, configure the database file and software of cellranger:

Download the fastq file from the ENA database using aspera

First of all, you need the following address information. Go to the ENA database and follow the diagram to download it and build it into FQ Txt file:

fasp.sra.ebi.ac.uk:/vol1/fastq/SRR139/009/SRR13911909/SRR13911909_1.fastq.gz
fasp.sra.ebi.ac.uk:/vol1/fastq/SRR139/009/SRR13911909/SRR13911909_2.fastq.gz
fasp.sra.ebi.ac.uk:/vol1/fastq/SRR139/010/SRR13911910/SRR13911910_1.fastq.gz
fasp.sra.ebi.ac.uk:/vol1/fastq/SRR139/010/SRR13911910/SRR13911910_2.fastq.gz
fasp.sra.ebi.ac.uk:/vol1/fastq/SRR139/011/SRR13911911/SRR13911911_1.fastq.gz
fasp.sra.ebi.ac.uk:/vol1/fastq/SRR139/011/SRR13911911/SRR13911911_2.fastq.gz
fasp.sra.ebi.ac.uk:/vol1/fastq/SRR139/012/SRR13911912/SRR13911912_1.fastq.gz
fasp.sra.ebi.ac.uk:/vol1/fastq/SRR139/012/SRR13911912/SRR13911912_2.fastq.gz
fasp.sra.ebi.ac.uk:/vol1/fastq/SRR139/013/SRR13911913/SRR13911913_1.fastq.gz
fasp.sra.ebi.ac.uk:/vol1/fastq/SRR139/013/SRR13911913/SRR13911913_2.fastq.gz
fasp.sra.ebi.ac.uk:/vol1/fastq/SRR139/014/SRR13911914/SRR13911914_1.fastq.gz
fasp.sra.ebi.ac.uk:/vol1/fastq/SRR139/014/SRR13911914/SRR13911914_2.fastq.gz

Then write scripts and download them in batches;

cat fq.txt |while read id
do
ascp -QT -l 300m -P33001  \
-i ~/miniconda3/envs/download/etc/asperaweb_id_dsa.openssh   \
era-fasp@$id  .
done

This script will call aspera installed by conda to read FQ The paths in the txt file can be downloaded one by one, as shown below. From 8:30 p.m. to 1 a.m., all the fastq data files of our six 10x single cell transcriptome samples were downloaded successfully.

~$ ls -lh *gz|cut -d" " -f 5-
7.5G 1 September 29-20:25 SRR13911909_1.fastq.gz
 31G 1 September 29-20:53 SRR13911909_2.fastq.gz
7.5G 1 September 29-20:58 SRR13911910_1.fastq.gz
 31G 1 September 29-21:20 SRR13911910_2.fastq.gz
7.3G 1 September 29-21:28 SRR13911911_1.fastq.gz
 31G 1 September 29-21:53 SRR13911911_2.fastq.gz
7.2G 1 September 29-21:58 SRR13911912_1.fastq.gz
 30G 1 September 29-22:58 SRR13911912_2.fastq.gz
7.5G 1 September 29-23:15 SRR13911913_1.fastq.gz
 31G 1 June 30 00:12 SRR13911913_2.fastq.gz
7.2G 1 June 30 00:21 SRR13911914_1.fastq.gz
 30G 1 June 30 00:49 SRR13911914_2.fastq.gz 

Run cellranger

At present, the single cell transcriptome is the mainstream of 10X company. We also introduced the details and processes of cellranger in detail in the single cell world official account.

However, this series of notes two years ago is based on V2 and V3 versions of cellranger. In July 2020, I saw that it was updated to V4, and a summary was also written in it. See: Cell Ranger updated to 4 (new tutorial)

The use method is super simple, that is to build a simple script. The code is as follows:

$ cat run-cellranger.sh
bin=/home/data/bylin/cellranger-6.1.2/bin/cellranger
db=/home/data/bylin/refdata-gex-GRCh38-2020-A
ls $bin; ls $db

fq_dir=/home/x9/raw
$bin count --id=$1 \
--localcores=4 \
--transcriptome=$db \
--fastqs=$fq_dir \
--sample=$1   \
--expect-cells=5000

The next step is to use this script to batch process our previous six 10x single cell transcriptome data.

The fastq file name starting with SRR in the front is modified uniformly:

Early_AD-1_S1_L001_R1_001.fastq.gz
Early_AD-1_S1_L001_R2_001.fastq.gz
Early_AD-2_S1_L001_R1_001.fastq.gz
Early_AD-2_S1_L001_R2_001.fastq.gz
Late_AD-1_S1_L001_R1_001.fastq.gz
Late_AD-1_S1_L001_R2_001.fastq.gz
Late_AD-2_S1_L001_R1_001.fastq.gz
Late_AD-2_S1_L001_R2_001.fastq.gz
Normal-1_Homo_S1_L001_R1_001.fastq.gz
Normal-1_Homo_S1_L001_R2_001.fastq.gz
Normal-2_Homo_S1_L001_R1_001.fastq.gz
Normal-2_Homo_S1_L001_R2_001.fastq.gz

You can see the file name of the comparison rule. Each sample is_ S1_L001_R1_001.fastq.gz and_ S1_ L001_ R2_ 001.fastq. At the end of GZ, the prefix in front is different for each sample, which is also the prefix used to distinguish different samples when we run the process later.

Therefore, you need to build a text file id.txt, which is as follows:

x9@bio1-2288H-V5:/home/data/x9$ cat id.txt
Early_AD-1
Early_AD-2
Late_AD-1
Late_AD-2
Normal-1_Homo
Normal-2_Homo 

It has only six simple lines, which is the prefix of our fastq file.

A simple script can process all six 10x single cell transcriptome data files:

 cat id.txt |while read id;do (nohup  bash run-cellranger.sh $id 1>log-$id.txt 2>&1 & );done

You can see:

drwxrwxr-x 4 x9 x9 4.0K 1 February 30:55 Early_AD-1
drwxrwxr-x 4 x9 x9 4.0K 1 February 30:54 Early_AD-2
-rw-rw-r-- 1 x9 x9  140 1 September 29-20:41 id
-rw-rw-r-- 1 x9 x9   70 1 September 29-20:42 id.txt
drwxrwxr-x 4 x9 x9 4.0K 1 March 30:00 Late_AD-1
drwxrwxr-x 4 x9 x9 4.0K 1 February 30:51 Late_AD-2
-rw-rw-r-- 1 x9 x9  95K 1 February 30:55 log-Early_AD-1.txt
-rw-rw-r-- 1 x9 x9  96K 1 February 30:54 log-Early_AD-2.txt
-rw-rw-r-- 1 x9 x9  96K 1 March 30:00 log-Late_AD-1.txt
-rw-rw-r-- 1 x9 x9  95K 1 February 30:51 log-Late_AD-2.txt
-rw-rw-r-- 1 x9 x9  88K 1 February 30:44 log-Normal-1_Homo.txt
-rw-rw-r-- 1 x9 x9 9.6K 1 September 29-21:09 log-Normal-2_Homo.txt
drwxrwxr-x 4 x9 x9 4.0K 1 February 30:44 Normal-1_Homo
drwxrwxr-x 5 x9 x9 4.0K 1 September 29-21:09 Normal-2_Homo

Among them, 4.0K is the folder, 6 samples output 6 folders, and then 95K is the txt log. The log of each run of cellranger is kept and can be checked later. I'm running the command with x9 this user.

As you can see, it's basically a task that ends at 3 a.m. Next, you need to sort out the necessary contents in all successful folders. By default, multiple items are run under the same folder. You can see that the format in the folder corresponding to each sample is similar

The format in the folder corresponding to each sample is similar

The most important one is filtered_ feature_ bc_ Contents in matrix folder and web_summary.html this report:

Still need to write a code:

pro=tmp
pro_folder=$HOME/$pro
mkdir -p $pro_folder
ls -lh  $pro_folder
 
 
mkdir -p $pro_folder/html 
ls */outs/web_summary.html |while read id;do (cp $id $pro_folder/html/${id%%/*}.html );done

mkdir -p $pro_folder/matrix 
ls -d */outs/filtered_feature_bc_matrix |while read id;do (cp -r  $id $pro_folder/matrix/${id%%/*} );done

In this way, our report and expression matrix are sorted out:

The report and expression matrix are sorted out

This is the end of the upstream process. The next step is to read the expression matrix of each sample and run the seurat process in R language. Each sample is an expression matrix composed of three files:

Each sample is an expression matrix composed of three documents

With such an expression matrix, it is easy to read it and follow the standard dimensionality reduction clustering process.

The first is to read six 10x single-cell transcriptome samples in batch

rm(list=ls())
options(stringsAsFactors = F) 
library(Seurat)
library(ggplot2)
library(clustree)
library(cowplot)
library(dplyr)

###### step1:Import data ######     
library(data.table)
dir='matrix/' 
samples=list.files( dir )
samples 

library(data.table)
sceList = lapply(samples,function(pro){ 
  # pro=samples[1] 
  print(pro) 
  sce=CreateSeuratObject( Read10X(file.path(dir,pro)), 
                          project = pro,
                         min.cells = 5,
                         min.features = 300 ) 
  return(sce)
})
names(sceList)  
 
samples
sce.all=merge(x=sceList[[1]],
              y=sceList[ -1 ],
              add.cell.ids = samples)

as.data.frame(sce.all@assays$RNA@counts[1:10, 1:2])
head(sce.all@meta.data, 10)
table(sce.all$orig.ident)
 

Then routine quality control and harmony integration are required:

sce = sce.all 
sce <- NormalizeData(sce, 
                         normalization.method = "LogNormalize",
                         scale.factor = 1e4) 
sce <- FindVariableFeatures(sce)
sce <- ScaleData(sce)
sce <- RunPCA(sce, features = VariableFeatures(object = sce))

library(harmony)
seuratObj <- RunHarmony(sce, "orig.ident")
names(seuratObj@reductions)
seuratObj <- RunUMAP(seuratObj,  dims = 1:15, 
                     reduction = "harmony")
DimPlot(seuratObj,reduction = "umap",label=T ) 

sce=seuratObj
sce <- FindNeighbors(sce, reduction = "harmony",
                     dims = 1:15) 
sce.all=sce
#Set different resolutions and observe the clustering effect (which one to choose?)
for (res in c(0.01, 0.05, 0.1, 0.2, 0.3, 0.5,0.8,1)) {
  sce.all=FindClusters(sce.all, #graph.name = "CCA_snn", 
                       resolution = res, algorithm = 1)
}

The next step is to visualize the marker gene we gave you and give it a reasonable name:

sce.all
sce=sce.all 
library(ggplot2) 
genes_to_check = c('PTPRC', 'CD3D', 'CD3E', 'CD4','CD8A',
                   'CD19', 'CD79A', 'MS4A1' ,
                   'IGHG1', 'MZB1', 'SDC1',
                   'CD68', 'CD163', 'CD14', 
                   'TPSAB1' , 'TPSB2',  # mast cells,
                   'RCVRN','FPR1' , 'ITGAM' ,
                   'C1QA',  'C1QB',  # mac
                   'S100A9', 'S100A8', 'MMP19',# monocyte
                   'FCGR3A',
                   'LAMP3', 'IDO1','IDO2',## DC3 
                   'CD1E','CD1C', # DC2
                   'KLRB1','NCR1', # NK 
                   'FGF7','MME', 'ACTA2', ## fibo 
                   'DCN', 'LUM',  'GSN' , ## mouse PDAC fibo 
                   'MKI67' , 'TOP2A', 
                   'PECAM1', 'VWF',  ## endo 
                   'EPCAM' , 'KRT19', 'PROM1', 'ALDH1A1' )

library(stringr)  
genes_to_check=str_to_upper(genes_to_check)
genes_to_check
p <- DotPlot(sce.all, features = unique(genes_to_check),
             assay='RNA'  )  + coord_flip()

p 
ggsave('check_last_markers.pdf',height = 11,width = 11)

As follows:

Dimension Reduction Clustering

It can be seen that immune cell subsets are subdivided, including two categories: lymphatic system (T,B,NK cells) and myeloid system (mononuclear, dendritic, macrophagy, granulocyte).

We give the name as follows:

Biological nomenclature

Among them, I can't give the name for the first time on 18, 20 and 21. I need to follow up and explore slowly, but the results are consistent with the original text on the whole. What I didn't give a name may be that its marker genes are ncor2, NKX3-1, Hlx and PRNP, but they are not in the gene list I recited before.

However, knowledge is slowly increasing in this way!

If you really think my tutorial is helpful to your scientific research project and makes you enlightened, or your project uses a lot of my skills, please add a short thank you when publishing your achievements in the future, as shown below:

We thank Dr.Jianming Zeng(University of Macau), and all the members of his bioinformatics team, biotrainee, for generously sharing their experience and codes.

If I have traveled around the world in universities and research institutes (including Chinese mainland) in ten years, I will give priority to seeing you if you have such a friendship.