Seurat PBMC tutorial understanding and sharing

Posted by fazzfarrell on Wed, 09 Feb 2022 02:03:41 +0100

1. Read data

library(dplyr)
library(Seurat)
library(patchwork)
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir ="../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc

After reading the data, the Read10X function returns the UMI count matrix. Next, the count matrix is used to create the seurat object. In this step, quality control can be done. min.cell = n means that a gene is expressed in at least n cells, and min.features=m means that a cell expresses at least m genes.

2. Standard process

This step includes QC, data standardization, identification of high variation genes, scaling (normalization).

2.1 QC

Indicators of low cell quality

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

Used to calculate the proportion of a pattern gene in each cell

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

You can also see the linear relationship of different characteristics

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
#Make a scatter diagram of two features
FeatureScatter(
  object,#seurat object
  feature1,#Feature 1
  feature2,
  cells = NULL,#Which cells are used
  group.by = NULL,
  cols = NULL,
  pt.size = 1,
  shape.by = NULL,
  span = NULL,
  smooth = FALSE,
  combine = TRUE,
  slot = "data",#Data source 'counts',' data 'or' scale data'
  plot.cor = TRUE,#Show relevance in title
  raster = NULL
)

2.2 standardization

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

2.3 identification of highly mutated genes

Identify genes with high expression level variability between cells and use these genes for downstream dimensionality reduction analysis. The function returns 2000 feature s by default. I don't know how this parameter affects the downstream?

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2

For seurat objects:

FindVariableFeatures(
  object,
  assay = NULL,
  selection.method = "vst",
  loess.span = 0.3,
  clip.max = "auto",
  mean.function = FastExpMean,
  dispersion.function = FastLogVMR,
  num.bin = 20,
  binning.method = "equal_width",
  nfeatures = 2000,
  mean.cutoff = c(0.1, 8),
  dispersion.cutoff = c(1, Inf),
  verbose = TRUE,
  ...
)

2.4 zoom

The data is linearly transformed before PCA dimensionality reduction. ScaleData() has the following functions: 1 Convert the expression value of each gene so that the average expression value between cells is 0. 2. Scale the expression value of each gene so that the variance between cells is 1. This step allows all genes to have the same weight in the downstream analysis to avoid the dominance of highly mutated genes. 3. The results are stored in PBMC [["RNA"]] @ scale Data.
The default ScaleData function only uses the high variation genes returned in the previous step, but you can choose the genes participating in ScaleData by yourself. DoHeatmap() uses scale Data, in order not to affect hetamap, all genes can be scaled.

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

This step of ScaleData() can eliminate the influence of the heterogeneity of cell cycle or mitochondrial pollution sources on cell clustering. Use vars to. Register parameter, such as PBMC < - scaledata (PBMC, vars. To. Register = "percent. MT").

# S3 method for Seurat
ScaleData(
  object,
  features = NULL,#The default is a highly mutated gene
  assay = NULL,
  vars.to.regress = NULL,#For example, uUMI or percent mito
  split.by = NULL,
  model.use = "linear",
  use.umi = FALSE,
  do.scale = TRUE,
  do.center = TRUE,
  scale.max = 10,
  block.size = 1000,
  min.cells.to.block = 3000,
  verbose = TRUE,
  ...
)

3. Linear dimensionality reduction

By default, only the previously determined high variation genes are used as input, but you can use the features parameter to define the gene set you want to use.

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

seurat provides several visualization methods, such as vizdimeducation(), DimPlot(), and DimHeatmap(). Among these visualizations, dimheatmap () has the greatest reference value. It allows us to quickly view the source of heterogeneity. The cells and genes in the figure will be sorted according to the PCA score. By setting the cells parameter, we can quickly view the extreme cells at both ends of the expression profile.

VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
DimPlot(pbmc, reduction = "pca")
DimHeatmap(
  object,
  dims = 1,#The number of dimensions used for drawing, a vector
  nfeatures = 30,
  cells = NULL,#The number of cells used for drawing. If a number is set, the extreme cells at both ends will be displayed
  reduction = "pca",
  disp.min = -2.5,
  disp.max = NULL,
  balanced = TRUE,#Genes showing positive and negative extremes
  projected = FALSE,
  ncol = NULL,
  fast = TRUE,
  raster = TRUE,
  slot = "scale.data",
  assays = NULL,
  combine = TRUE
)

About dimension selection
In addition to the above DimHeatmap, there are jackdraw and Elbow methods. For large data, it is recommended to use Eblow with high speed,

ElbowPlot(pbmc)
ElbowPlot(object, ndims = 20, reduction = "pca")

4. Cell clustering