1. Read data
library(dplyr) library(Seurat) library(patchwork) # Load the PBMC dataset pbmc.data <- Read10X(data.dir ="../data/pbmc3k/filtered_gene_bc_matrices/hg19/") # Initialize the Seurat object with the raw (non-normalized data). pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200) pbmc
After reading the data, the Read10X function returns the UMI count matrix. Next, the count matrix is used to create the seurat object. In this step, quality control can be done. min.cell = n means that a gene is expressed in at least n cells, and min.features=m means that a cell expresses at least m genes.
2. Standard process
This step includes QC, data standardization, identification of high variation genes, scaling (normalization).
2.1 QC
Indicators of low cell quality
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
Used to calculate the proportion of a pattern gene in each cell
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
You can also see the linear relationship of different characteristics
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
#Make a scatter diagram of two features FeatureScatter( object,#seurat object feature1,#Feature 1 feature2, cells = NULL,#Which cells are used group.by = NULL, cols = NULL, pt.size = 1, shape.by = NULL, span = NULL, smooth = FALSE, combine = TRUE, slot = "data",#Data source 'counts',' data 'or' scale data' plot.cor = TRUE,#Show relevance in title raster = NULL )
2.2 standardization
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
2.3 identification of highly mutated genes
Identify genes with high expression level variability between cells and use these genes for downstream dimensionality reduction analysis. The function returns 2000 feature s by default. I don't know how this parameter affects the downstream?
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000) # Identify the 10 most highly variable genes top10 <- head(VariableFeatures(pbmc), 10) # plot variable features with and without labels plot1 <- VariableFeaturePlot(pbmc) plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE) plot1 + plot2
For seurat objects:
FindVariableFeatures( object, assay = NULL, selection.method = "vst", loess.span = 0.3, clip.max = "auto", mean.function = FastExpMean, dispersion.function = FastLogVMR, num.bin = 20, binning.method = "equal_width", nfeatures = 2000, mean.cutoff = c(0.1, 8), dispersion.cutoff = c(1, Inf), verbose = TRUE, ... )
2.4 zoom
The data is linearly transformed before PCA dimensionality reduction. ScaleData() has the following functions: 1 Convert the expression value of each gene so that the average expression value between cells is 0. 2. Scale the expression value of each gene so that the variance between cells is 1. This step allows all genes to have the same weight in the downstream analysis to avoid the dominance of highly mutated genes. 3. The results are stored in PBMC [["RNA"]] @ scale Data.
The default ScaleData function only uses the high variation genes returned in the previous step, but you can choose the genes participating in ScaleData by yourself. DoHeatmap() uses scale Data, in order not to affect hetamap, all genes can be scaled.
all.genes <- rownames(pbmc) pbmc <- ScaleData(pbmc, features = all.genes)
This step of ScaleData() can eliminate the influence of the heterogeneity of cell cycle or mitochondrial pollution sources on cell clustering. Use vars to. Register parameter, such as PBMC < - scaledata (PBMC, vars. To. Register = "percent. MT").
# S3 method for Seurat ScaleData( object, features = NULL,#The default is a highly mutated gene assay = NULL, vars.to.regress = NULL,#For example, uUMI or percent mito split.by = NULL, model.use = "linear", use.umi = FALSE, do.scale = TRUE, do.center = TRUE, scale.max = 10, block.size = 1000, min.cells.to.block = 3000, verbose = TRUE, ... )
3. Linear dimensionality reduction
By default, only the previously determined high variation genes are used as input, but you can use the features parameter to define the gene set you want to use.
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
seurat provides several visualization methods, such as vizdimeducation(), DimPlot(), and DimHeatmap(). Among these visualizations, dimheatmap () has the greatest reference value. It allows us to quickly view the source of heterogeneity. The cells and genes in the figure will be sorted according to the PCA score. By setting the cells parameter, we can quickly view the extreme cells at both ends of the expression profile.
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca") DimPlot(pbmc, reduction = "pca")
DimHeatmap( object, dims = 1,#The number of dimensions used for drawing, a vector nfeatures = 30, cells = NULL,#The number of cells used for drawing. If a number is set, the extreme cells at both ends will be displayed reduction = "pca", disp.min = -2.5, disp.max = NULL, balanced = TRUE,#Genes showing positive and negative extremes projected = FALSE, ncol = NULL, fast = TRUE, raster = TRUE, slot = "scale.data", assays = NULL, combine = TRUE )
About dimension selection
In addition to the above DimHeatmap, there are jackdraw and Elbow methods. For large data, it is recommended to use Eblow with high speed,
ElbowPlot(pbmc)
ElbowPlot(object, ndims = 20, reduction = "pca")