How to use CNVkit -- column of tumor genome sequencing data analysis

Posted by leon_zilber on Mon, 03 Jan 2022 22:02:06 +0100

brief introduction

In the analysis of tumor genome data, copy number variation analysis is one of the common analysis points. CNVkit is one of the tools used for genome sequencing WGS and WES to call CNV. It is written based on python. The CNV results are relatively reliable and easy to operate.

Download and install

conda is recommended to solve the problem of environment dependency:

conda activate wes
conda install cnvkit --help

usage method

The official website tutorial of the tool has been written in detail. Please see: github: You can directly use the following code to call the somatic CNV of tumor samples: batch 5.gatk/Tumor*_bqsr.bam    \
    --normal  5.gatk/Normal*_bqsr.bam     \
    --output-reference ./8.cnv/cnvkit/my_reference.cnn \
    --output-dir ./8.cnv/cnvkit \
    --targets  ${bed}           \
    --fasta ${GENOME}           \
    -p ${nthread}               \
    --method hybrid             \
    --annotate ${refFlat}       \
    --drop-low-coverage         \
    --scatter --diagram         \
    1>./8.cnv/cnvkit/cnvkit.log 2>&1

Where -- targets {bed} specifies the bed file. For WES data, that is, the capture interval-- fasta {GENOME} specifies the reference genome fasta file, and - p {nthread} specifies the number of threads. The -- annotate {refFlat} parameter is optional and has no effect. It specifies the annotation file and can be downloaded with the following code:

wget -c
wget -c

The seg file can be extracted from the cns results. Here are two methods: one is provided by cnvkit, and the other is the awk command written by yourself.

# Method 1 export seg *bqsr.cns -o gistic.seg
sed 's/_bqsr//' gistic.seg
# Method 2
awk '{print FILENAME"\t"$0}' *bqsr.cns  | grep -v chromosome |sed 's/_bqsr.cns//g' | awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$8"\t"$6}' >final.seg

The obtained SEG file can be loaded into IGV for visualization, and IGV will recognize the file suffix seg.

However, it should be noted that the last step of cnvkit is call.

CNVkit subcommand call

The last step of cnvkit is call, which can correct the number of copies based on B allele frequency, tumor ploidy, tumor purity, patient gender and other indicators (not necessary). The newly generated cns file is listed as cn, that is, the absolute copy number copy number:

Based on B allele frequency

If it is based on the B allele frequency, it is necessary to provide the VCF file. The tumor and normal pairing are provided. In the document, the author gives the VCF obtained by mutect, and then carry out certain processing: , after processing, run with the following command: call Sample.cns -y -v Sample.vcf -o

However, the problem is that the vcf of mutect is somatic mutation, which is not in line with conventional cognition. It should be germline mutation. This point has been raised on GitHub, but the reply given by the author remains unchanged:


In fact, when running mutect2, you can choose to keep the germline site. Just add these two parameters -- genotype germline sites true -- genotype PON sites true when running mutet2. Example code:

$GATK --java-options "-Xmx20G" Mutect2 \
    -R ${GENOME}                   \
    -I ./5.gatk/${tumor}_bqsr.bam  \
    -I ./5.gatk/${normal}_bqsr.bam \
    -normal ${normal}              \
    -L ${interval}                 \
    --panel-of-normals ${pon}      \
    --germline-resource ${gnomad}  \
    --genotype-germline-sites true \
    --genotype-pon-sites true      \
    --f1r2-tar-gz ./6.mutect/${id}_f1r2.tar.gz \
    -O ./6.mutect/${id}_mutect2.vcf.gz \
    1>./6.mutect/${id}_mutect.log 2>&1

$GATK  --java-options "-Xmx20G" FilterMutectCalls \
    -R ${GENOME}                                               \
    -V ./6.mutect/${tumor}_mutect2.vcf.gz                         \
    -O ./6.mutect/${tumor}_somatic.vcf.gz                         \
    1>>./6.mutect/${tumor}_mutect.log 2>&1

Process the output vcf file. A code processing sentence is provided here, where config is normal and tumor pairing:

$ cat  config
Normal1    Tumor1
Normal2    Tumor2
Normal3    Tumor3
Normal4    Tumor4

What is needed is the vcf files marked with FilterMutectCalls. Note that only the filtering sites are marked, but there is no real filtering. These vcf files need to be processed. The following processing methods are summarized according to the methods given in the cmvkit document and personal understanding. They may not be very mature. Please read the cnvkit document carefully:

cat config  | while read line 

  zcat 6.mutect/${id}_somatic.vcf.gz | \
  awk 'BEGIN{FS="\t";OFS="\t"} {if($0~"^#") {print $0} else {if($7!="PASS"){$7="REJECT"} print $0} }'  > 8.cnv/cnvkit/${id}_cnvkit.vcf
  sed -i '2 i ##FILTER=<ID=REJECT,Description="Not somatic due to normal call frequency or phred likelihoods: tumor: 35, normal 35.">' 8.cnv/cnvkit/${tumor}_cnvkit.vcf
  sed -i "2 i ##PEDIGREE=<Derived=${tumor},Original=${normal}>" 8.cnv/cnvkit/${tumor}_cnvkit.vcf call 8.cnv/cnvkit/${tumor}_bqsr.cns \
  -v 8.cnv/cnvkit/${tumor}_cnvkit.vcf      \
  --drop-low-coverage                   \
  -o 8.cnv/cnvkit/${tumor}_call.cns 

Based on tumor purity and tumor ploidy

The tumor purity evaluated by Clinicopathology can be used for correction: call Sample.cns -y -m clonal --purity 0.65 --ploidy 2 -o

You can also use other tools to calculate the results, such as PureCN, THetA2, PyClone or BubbleTree. The tutorial of THetA2 is given in the document: If purecn is used, you can refer to: Of course, in addition to the tools given by the author, the results of Absolute or Sequenza can also be used. It should be noted that the - M / - method parameter in the above code specifies the call method. There are three kinds: threshold, clone and none

  • Threshold: calculate the absolute copy number cn column based on the threshold. Is the default parameter, and the default threshold is: - t=-1.1,-0.4,0.3,0.7

If log2 value is up to

Copy number











  • clonal: using the given tumor purity and tumor ploidy, convert the log2 value to the absolute copy number, and then simply round the absolute copy value to an integer. The method is applicable to germ line samples and high-purity tumor samples (such as cell lines)
  • none: do not calculate cn

In addition, sex chromosomes can also be corrected based on gender, but the code is not shown here because it is troublesome to batch process this parameter. Finally, the calibration is correct, and the drawing visualization can be carried out based on the final results:

cat config  | while read line 
  tumor=${arr[1]} scatter                 \
  8.cnv/cnvkit/${tumor}_bqsr.cnr       \
  -s 8.cnv/cnvkit/${tumor}_call.cns    \
  --title ${tumor} --segment-color red \
  --y-max 2  --y-min -2  -m 3       \
  -v 8.cnv/cnvkit/${tumor}_cnvkit.vcf  \
  -o 8.cnv/cnvkit/${tumor}_scatter.png \ diagram                 \
  8.cnv/cnvkit/${tumor}_bqsr.cnr       \
  -s 8.cnv/cnvkit/${tumor}_call.cns    \
  --title ${tumor}                     \
  -o 8.cnv/cnvkit/${tumor}_diagram.pdf \