Question

我有一个CSV（约1.5米行），格式如下：

id, tag1, tag2, name1, value1

有几行具有相同的ID。如果一行具有相同的id，则它将具有相同的tag1和tag2。所以，我想要做的是在行的末尾附加name1, value1，这将是不同的。

示例：

Original:
    id,tag1,tag2,name1,value1
    12,orange,car,john,32
    13,green,bike,george,23
    12,orange,car,elen,21
Final:
    id,tag1,tag2,name1,value1
    12,orange,car,john,32,elen,21
    13,green,bike,george,23

我能做到的唯一方法是使用Python中的暴力脚本。使用id的键创建一个字典，然后创建一个包含所有其他参数的列表。每当我找到一个已经在字典中的id时，我只是将字典值中的最后两个字段作为列表附加。

但是，在这么大的文件中执行它并不是最有效的方法。有没有其他方法可以做到，也许有一个图书馆？

Answer 1

Kay使用排序输入数据的建议可能如下所示：

with open('in.txt') as infile, open('out.txt', mode='w') as outfile:
    # Prime the first line
    line = infile.readline()
    # When collating lines, running_line will look like:
    # ['id,tag1,tag2', 'name1', 'value1', 'name2', 'value2', ...]
    # Prime it with just the 'id,tag1,tag2' of the first line
    running_line = [line[:-1].rsplit(',', 2)[0]]
    while line:
        curr_it12, name, value = line[:-1].rsplit(',', 2)
        if running_line[0] == curr_it12:
            # Current line's id/tag1/tag2 matches previous line's.
            running_line.extend([name, value])
        else:
            # Current line's id/tag1/tag2 doesn't match. Output the previous...
            outfile.write(','.join(running_line) + '\n')
            # ...and start a new running_line
            running_line = [curr_it12, name, value]
        # Grab the next line
        line = infile.readline()
    # Flush the last line
    outfile.write(','.join(running_line) + '\n')

使用Python将CSV中的行连接起来

1 个答案: