Introduction to ScanpyAnnData data structure and some API usage

Posted by jweissig on Fri, 08 Oct 2021 01:18:28 +0200

Scanpy introduction and installation

Scanpy is an extensible toolkit for analyzing single cell analysis data constructed in conjunction with Ann data (a data structure).

For Windows systems, it is best to use the whl file for Scanpy installation. First download the whl file: scanpy-1.7.2-py3-none-any.whl

Through conda, use the command cd to enter the directory where the whl file is located, and then install through pip:

pip install scanpy-1.7.2-py3-none-any.whl

AnnData

Structure of AnnData

In scanpy, our most common data structure is AnnData, which is an object used to store data. Its data structure can be described as follows:

We record the above object as data. We need to understand the following parts:

	function	type
adata.X	matrix	numpy matrix
adata.obs	Observation measurement	pandas Dataframe
adata.var	Characteristic quantity	pandas Dataframe
adata.uns	Unstructured data	Dictionary dict

To further understand this data structure, we manually build an AnnData object:

import numpy as np
import pandas as pd
import anndata as ad
from string import ascii_uppercase

# Set the number of observation samples
n_obs=1000
# obs is used to save observational information
obs=pd.DataFrame()

# numpy.random.choice(a, size=None, p=None)
# Randomly extract elements from a(ndarray, but must be one-dimensional) and form an array of the specified size
# Array p: corresponds to array a, indicating the probability of taking each element in array A. by default, the probability of selecting each element is the same
obs['time']=np.random.choice(['day1','day2','day4','day8'],size=n_obs)

# Set feature name var_names
print(ascii_uppercase) # ABCDEFGHIJKLMNOPQRSTUVWXYZ
var_names=[i*letter for i in range(1,10) for letter in ascii_uppercase]
print(var_names)
# ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', ......, 'X', 'Y', 'Z',
# ......
# 'AAAAAAAAA', 'BBBBBBBBB', 'CCCCCCCCC', ......, 'YYYYYYYYY', 'ZZZZZZZZZ']

# Number of features
n_vars=len(var_names) # 234

# Defining a feature to a Dataframe
var=pd.DataFrame(index=var_names)
print(var.head()) # Now var has no columns (column index), only index (row index)

# Create data matrix data. X
X=np.arange(n_obs*n_vars).reshape(n_obs,n_vars)

Then initialize the andata object. The data type of the andata object is float32 by default. In order to facilitate later observation of the print results, we set the data type to int32:

adata=ad.AnnData(X,obs=obs,var=var,dtype='int32')

# View data
print(adata)
"""
AnnData object with n_obs × n_vars = 1000 × 234
    obs: 'time'
"""

# View the X matrix of data
print(adata.X)
"""
[[     0      1      2 ...    231    232    233]
 [   234    235    236 ...    465    466    467]
 [   468    469    470 ...    699    700    701]
 ...
 [233298 233299 233300 ... 233529 233530 233531]
 [233532 233533 233534 ... 233763 233764 233765]
 [233766 233767 233768 ... 233997 233998 233999]]
"""

Generally, for data. X, rows correspond to observations (i.e., cells) and columns correspond to features (i.e., genes);

Each time we operate AnnData, instead of creating another AnnData to store data, we directly find the memory address of the previously initialized AnnData and directly change the value of AnnData through the memory address. The benefits are:

No need to allocate extra memory;
You can directly modify the initialized AnnoData object;

For example:

# View the first three elements of the 'A' column
print(adata[:3, 'A'].X)
"""
[[  0]
 [234]
 [468]]
"""

# Set the first three elements of the 'A' column
adata[:3, 'A'].X = [0, 0, 0]

# Looking at the first five elements of column 'A', we found that the value was modified
print(adata[:5, 'A'].X)
"""
[[  0]
 [  0]
 [  0]
 [702]
 [936]]
"""

However, if a part of the AnnData object is assigned to a new object, the object will get a new memory for storing the actual data instead of referring to the memory address of the original data object, such as:

adata_subset = adata[:5, ['A', 'B']]
print(adata_subset)
"""
View of AnnData object with n_obs × n_vars = 5 × 2
    obs: 'time'
"""

# Create data for the new object_ Subset add observation 'foo'
adata_subset.obs['foo'] = range(5)
print(adata_subset)
"""
AnnData object with n_obs × n_vars = 5 × 2
    obs: 'time', 'foo'
"""

h5ad: writing and reading of AnnData

We can save the AnnData object to disk through h5ad file. The saving process is as follows:

# Calculate the size of the object
def print_size_in_MB(x):
    print('{:.3} MB'.format(x.__sizeof__()/1e6))

# View object size
print_size_in_MB(adata) # 0.187 MB

# Check whether to back up
print(adata.isbacked) # False

# Set backup address
adata.filename = './test.h5ad'

# Check whether the backup was successful
print(adata.isbacked) # True

When the status of data.isbacked is True, the object has been written to the disk;

On the contrary, we can use scanpy to easily read the file and obtain the AnnData object:

import scanpy as sc

Myadata=sc.read('./test.h5ad')
print(Myadata)
"""
AnnData object with n_obs × n_vars = 1000 × 234
    obs: 'time'
"""

Usage of some common APIs in Scanpy

Import Scanpy first:

import scanpy as sc

sc.pp.filter_cells

sc.pp.filter_cells(data, min_genes=None, max_genes=None)

It is often used in pretreatment to do some cell screening, and the function is retained for at least min_genes are cells with more than one gene, or retain at most max_genes: cells with multiple genes;

Also note that the parameter min_genes and parameter max_genes cannot be passed at the same time;

example:

# Import data
adata=sc.datasets.krumsiek11() # Five types of cells, 640 cell samples, and a total of 11 genes were measured
print(adata)
"""
AnnData object with n_obs × n_vars = 640 × 11
    obs: 'cell_type'
    uns: 'iroot', 'highlights'
"""

print(adata.n_obs) # 640 cells

# 11 genes (i.e. characteristics)
print(adata.var_names)
"""
Index(['Gata2', 'Gata1', 'Fog1', 'EKLF', 'Fli1', 'SCL', 'Cebpa', 'Pu.1',
       'cJun', 'EgrNab', 'Gfi1'],
      dtype='object')
"""

### Observe the change of cell number ###
sc.pp.filter_cells(adata,min_genes=0) # Equivalent to no filtering

print(adata.n_obs) # 640

print(adata.obs)
"""
      cell_type  n_genes
0    progenitor        9
..          ...      ...
159         Neu        8
"""
print(set(adata.obs['cell_type'].values)) # Class 5 cells {'Neu', 'progenitor', 'Ery', 'Mo', 'Mk'}
print(adata.obs['n_genes'].min()) # 4. At least four genes were measured in each cell

sc.pp.filter_cells(adata,min_genes=6) # Cells with more than 6 genes were selected and measured
print(adata.n_obs) # 630
print(adata.obs['n_genes'].min()) # 6

sc.pp.filter_genes

sc.pp.filter_genes(data, min_cells=None, max_cells=None)

This function is used to keep at least min_cells are genes that appear in more than one cell or remain at most_ Cells are genes that appear in cells;

Parameter min_cells and parameter max_cells cannot be passed at the same time;

Compare sc.pp.filter_cells can be found, sc.pp.filter_genes is used to select genes (screening column), sc.pp.filter_cells are used to select cells (filter rows);

sc.pp.highly_variable_genes

sc.pp.highly_variable_genes(
							data, 
							n_top_genes=None, 
							min_disp=0.5, 
							max_disp=inf, 
							min_mean=0.0125, 
							max_mean=3)

This function is used to determine hypervariable genes;

Description of common parameters:

data: AnnData Matrix, the row corresponds to the cell, and the column corresponds to the gene
n_top_genes: the number of hypervariable genes to retain

sc.pp.normalize_total

sc.pp.normalize_total(adata, target_sum=None, inplace=True)

The function can standardize each cell so that each cell can sum along the gene direction after standardization and have the same total target_sum；