python general class for reading vcf files (you can copy and paste directly)

Posted by NerdConcepts on Sat, 29 Feb 2020 07:00:04 +0100


                          . So many people will choose some python vcf libraries, but first you have to install this library, and there are some libraries that fix the content that can be read. If your vcf information is not in its fixed content, it cannot be read. For example, recently I want to read a sample AF, but it is placed in the GT column of the last sample, not in the INFO column, and some libraries are unable to help.


  • First, copy the code of the class, and then you can use it directly
import sys
import os
import subprocess

class Record(object):
    One line information in vcf file
    def __init__(self, line):
        info = line.split("\t")
        self.line = line
        self.CHROM =  info[0] 
        self.POS = info[1]
        self.ID = info[2]
        self.REF = info[3]
        self.ALT = info[4]
        self.QUAL = info[5]
        self.FILTER = info[6]
        self.INFO = [{pair_lst[0]: pair_lst[1] if len(pair_lst)> 1 else ""} for pair_lst in [pair.split("=") for pair in info[7].split(";")]]
        self.FORMAT = info[8].split(":")
        self.sample_num = len(info) -7
        self.GT = []
        for i in range(2):
           GT_value = info[8 + i +1].split(":") 
           GT_dict = {}
           for g in range(len(GT_value)):
               GT_dict[self.FORMAT[g]] = GT_value[g] 

class VCF(object):
    VCF class, read VCF, write VCF, get VCF information
    def __init__(self, uncompress_vcf):
        self.header = []
        self.reader = open(uncompress_vcf, 'r')
        self.line = self.reader.readline().strip()
        while self.line.startswith('#'):
            self.line = self.reader.readline().strip()
        self.record = Record(self.line) 
    def __iter__(self): 
        return self 
    def __next__(self): 
        self.line = self.reader.readline().strip()
        if self.line != "":
            self.record = Record(self.line) 
            return self.record
            raise StopIteration()
    def reader_close(self):
  • There are two main classes, one is VCF, which stores VCF information and VCF file operations, the other is Record, which includes all the information stored in a row of VCF
  • Read in vcf file
gatk_result = "realignment.vcf"
gatk = VCF(gatk_result)
  • View the header of vcf
  • View the information stored in the current VCF row, starting with the first row. It is saved as a Record. Note that the VCF class is an iterator class. You can use the next and for loops to read in the information of each line
record = gatk.record  #Here record stores the address of the record class
  • Check the properties of the record, including line (the content of the row, which is convenient to write out a row), CHROM, POS, ID, ref, alt, qual, filter, info (the form storage of Dictionary), format, sample "num (how many samples), GT (the genotype information of the sample, here in vcf is generally the column represented by the sample name)
record.ID #Other attributes are the same
  • Read INFO
    This is the original representation of INFO in vcf


Its storage form in record


[{'CONTQ': '28'}, {'DP': '38'}, {'ECNT': '1'}, {'GERMQ': '76'}, {'MBQ': '20,37'}, {'MFRL': '171,229'}, {'MMQ': '60,60'}, {'MPOS': '26'}, {'NALOD': '1.16'}, {'NLOD': '3.91'}, {'POPAF': '6.00'}, {'RCNTS': '0,0'}, {'ROQ': '14'}, {'SEQQ': '1'}, {'STRANDQ': '11'}, {'TLOD': '4.56'}]

  • Read GT
    This is the storage form of GT in vcf. FORMAT corresponds to the value of GT

GT:AD:AF:DP:F1R2:F2R1:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SB 0/1:21,2:0.120:23:7,1:13,1:false:true:0.500:0.078:100.00:41.80:12,9,1,1 0/0:13,0:0.065:13:7,0:6,0:false:false:.:.:.:.:10,3,0,0
They are the values corresponding to FORMAT, tumor sample GT and normal sample GT respectively

This is the storage form in record


[{'GT': '0/1', 'OBQRC': '41.80', 'SB': '12,9,1,1', 'DP': '23', 'OBF': '0.500', 'OBAM': 'false', 'OBP': '0.078', 'AD': '21,2', 'F2R1': '13,1', 'F1R2': '7,1', 'AF': '0.120', 'OBQ': '100.00', 'OBAMRC': 'true'}, {'GT': '0/0', 'OBQRC': '.', 'SB': '10,3,0,0', 'DP': '13', 'OBF': '.', 'OBAM': 'false', 'OBP': '.', 'AD': '13,0', 'F2R1': '6,0', 'F1R2': '7,0', 'AF': '0.065', 'OBQ': '.', 'OBAMRC': 'false'}]
The first dictionary is the GT of tumor, and the second dictionary is the GT of normal. Of course, there will be multiple dictionaries according to the number of samples. Here, you can retrieve them by index. For example, if you want to retrieve the first sample, you only need record.GT[0]

  • Print out the line with more than 0.5 tumor AF
for record in gatk:
    # compare GATK tumor AF to 0.05
    if float(record.GT[0]['AF']) > 0.05:
  • Write FILTER as PASS and more AF > 0.05 to the list and write the final VCF file
snv = "filter.vcf"

result = gatk.header
for record in gatk:
    if record.FILTER == "PASS" and float(record.GT[0]['AF']) > 0.05:

# write out result
with open(snv, 'w+') as snvf:
    for line in result:
        print(line, file = snvf)
  • Check the next record of the gatk. Because the VCF class is iterative, next is also supported in addition to for
record = next(gatk)

Topics: Python