Question

我有几个制表符分隔文件，我想用csvDictreader读入dicts。在实际数据开始之前，每个文件都包含几个以“＃”或“\ t”开头的注释行。注释行数因文件而异。我一直在尝试this post中概述的方法，但似乎无法使其正常工作。

这是我目前的代码：

def load_database_snps(inputFile):
    '''This function takes a txt tab delimited input file (in house database) and returns a list of dictionaries for each variant'''
    idStore = [] #empty list for storing variant records                                                                                                                                                                         
    with open(inputFile, 'r+') as varin:
        idStoreDictgroup = csv.DictReader((row for row in  varin if row.startswith('hr', 1, 2)),delimiter='\t') #create a generator; dictionary per snp (row) in the file                                                        
        idStoreDictgroup.fieldnames = [field.strip() for field in idStoreDictgroup.fieldnames] #strip whitespace from field names                                                                                                
        print(type(idStoreDictgroup))
        for d in idStoreDictgroup: #iterate over dictionaries in varin_dictgroup                                                                                                                                                 
            print(d)
            idStore.append(d) #attach to var_list                                                                                                                                                                               
    return idStore

以下是输入文件的示例：

## SM=Sample,AD=Total Allele Depth, DP=Total Depth
## het;;; and homo;;; are breakdowns of variant read counts per sample - chr1:10002921 T>G AD=34 het:4;11;7;12 (sum=34)


        Hetereozygous                                       Homozygous                                      
    Chr     Start      End            ref           |A|     |C|     |G|     |T|     HetCount        |A|     |C|     |G|     |T|     HomCount        TotalCount      SampleCount
    chr1    10001102        10001102        T       0       0       SM=1;AD=22;DP=38        0       1       0       0       0       0       0       1       138     het:22; homo:-  
    chr1    10002921        10002921        T       0       0       SM=4;AD=34;DP=63        0       4       0       0       0       0       0       4       138     het:4;11;7;12;  homo:-

我想读的所有行都以'Chr'或'chr'开头。我认为它不起作用，因为我需要迭代它以使用生成器重新格式化字段名称，在将行读入字典之前将其耗尽。

我得到的错误信息是：

Traceback (most recent call last):
  File "snp_freq_V1-1_export.py", line 99, in <module>
    snp_check_wrapper(inputargs.snpstocheck, inputargs.snp_database_location)
  File "snp_freq_V1-1_export.py", line 92, in snp_check_wrapper
    snpDatabase = load_database_snps(databaseInputFile) #store database variants in snp_database (a dictionary)
  File "snp_freq_V1-1_export.py", line 53, in load_database_snps
    idStoreDictgroup.fieldnames = [field.strip() for field in idStoreDictgroup.fieldnames] #strip whitespace from field names
TypeError: 'NoneType' object is not iterable

我尝试过对当前代码进行反转，并明确排除以“＃”和“\ t”开头的行。但这也行不通，只给了我一个空白的字典。

Answer 1

你应该做的是跳过前面的所有行直到开始chr的事情，例如：

import csv
from itertools import dropwhile

with open('somefile') as fin:
    start = dropwhile(lambda L: not L.lower().lstrip().startswith('chr'), fin)
    for row in csv.DictReader(start, delimiter='\t'):
        # do something

在csv.Dictreader中跳过不同类型的注释行

1 个答案: