Question

我正在尝试编写一个python（2.7）脚本来将多个CSV列表添加到一起（简单追加），但不添加文件X中与文件Y共享元素（第一列除外）的任何行。这里是我的试用剧本：

import csv
import glob

with open('merged.csv','wb') as out:
    seen = set()
    output = []
    out_writer = csv.writer(out)
    csv_files = glob.glob('*.csv')
    for filename in csv_files:
        with open(filename, 'rb') as ifile:
            read = csv.reader(ifile)
            for row in read:
                if {row[1] not in seen} & {row[2] not in seen} & {row[3] not in seen}:
                    seen.add(row[1])
                    seen.add(row[2])
                    seen.add(row[3])
                    output.append(row)
    out_writer.writerows(output)

我确定这可以清理一些，但这是试运行 - 为什么不正确地将第2,3和4列中的元素添加到所看到的集合中，然后如果它们出现在那里则不会附加行考虑过的行？除了正确检查重复之外，它还成功输出了合并文件。（如果合并的文件已经存在于目录中，这也可以工作，或者我会遇到麻烦吗？）

非常感谢！：）

Answer 1

我怀疑这条线没有做你想做的事情：

if {row[1] not in seen} & {row[2] not in seen} & {row[3] not in seen}:

这是一个集合的交集。演示：

>>> {False} & {True}
set([])
>>> {True} & {True}
set([True])
>>> {False} & {False}
set([False])
>>> bool(set([False]))
True    #non-empty set is True in boolean context

也许你打算

if row[1] not in seen and row[2] not in seen and row[3] not in seen:

或（几乎*）等效

if all(value not in seen for value in row[1:4]):

（*）如果行

中的值较少，则不会引发异常

将CSV文件与Python结合使用，无需重复元素

1 个答案: