IndexError:列表索引超出范围CSV解析器

时间:2017-02-19 21:43:47

标签: python csv parsing

我正在尝试使用此代码来解析csv文件但无法找到解决此错误的方法:

“文件”(文件位置)“,第438行,在parser_42中

position = tmp2 [1]

IndexError:列表索引超出范围“

我的csv文件的结构如下:

突变系数分数

Q41V -0.19 0.05

Q41L -0.08 0.26

Q41T -0.21 0.43

I23V -0.02 0.45

I61V 0.01 1.12

我想采取突变体并将'Q''41'和'V'分开。 然后我想创建位置和wt的列表并按数字顺序排列。

目标是将字符串“seq”写入新的csv文件

显然,我是python和数据操作的初学者。我想我只是在忽视一些愚蠢的事情......任何人都可以引导我朝着正确的方向前进吗?

def parser_42(csv_in, fasta_in, *args):

    with open(csv_in, 'r') as tsv_in:
        tsv_in = csv.reader(tsv_in, delimiter='\t')
        next(tsv_in) # data starts on line 7
        next(tsv_in)
        next(tsv_in)
        next(tsv_in)
        next(tsv_in)
        next(tsv_in)

        for row in tsv_in:
            tmp = row[0].split(',')
            tmp2 = re.split('(\d+)', tmp[0])
            wt = tmp2[0]
            position = tmp2[1]
            substitution = tmp[2]

            seq = ""
            current_positions = []


            if position not in current_positions:
                current_positions += [position]
                print(current_positions)
                seq += wt
            else:
                continue

        print(seq)

1 个答案:

答案 0 :(得分:1)

对于任何可能感兴趣的人,这就是我如何解决我的问题...如果有人对如何使这一点更简洁有任何建议,建议将不胜感激。我知道这似乎是解决一个小问题的迂回方式,但我在这个过程中学到了相当多的东西,所以我并不过分担心:)。我基本上用正则表达式替换了.split(),这似乎更干净。

def parser_42(csv_in, fasta_in, *args):
    dataset = pd.DataFrame(columns=get_column_names())
    with open(csv_in) as tsv_in:
        tsv_in = csv.reader(tsv_in, delimiter='\t')
        next(tsv_in) #data starts on row 7
        next(tsv_in)
        next(tsv_in)
        next(tsv_in)
        next(tsv_in)
        next(tsv_in)
        save_path = '(directory path)'
        complete_fasta_filename = os.path.join(save_path, 'dataset_42_seq.fasta.txt')
        output_fasta_file = open(complete_fasta_filename, 'w')

        seq = ''
        current_positions = []

        for row in tsv_in:

         # regular expressions to split numbers and characters in single cell
            regepx_match = re.match(r'([A-Z])([0-9]+)([A-Z,*])', row[0], re.M | re.I)
            wt = regepx_match.group(1)
            position = int(regepx_match.group(2))
            substitution = regepx_match.group(3)

            if position not in current_positions:
                current_positions += [position]
                seq += wt
            else:
                continue
        seq = list(seq)

    # this zips seq and current_positions and sorts seq
        sorted_y_idx_list = sorted(range(len(current_positions)), key=lambda x: current_positions[x])
        Xs = [seq[i] for i in sorted_y_idx_list]

        seq1 = '>dataset_42 fasta\n'
        seq1 = seq1 + ''.join(Xs) # join to string


        output_fasta_file.write(seq1)
        output_fasta_file.close()