有选择地删除csv文件中的行(Python)

时间:2017-07-21 10:09:49

标签: python csv parsing

我正在读取一个空格分隔数据的.csv文件,其中包含一些不需要的单词。我需要查找给定行的任何列中是否存在不需要的单词,并删除该行。

例如,如果是unwanted_list = ['one', 'on'],那么输入.csv文件的列为 name class label test ;

输入:

ne two 1 five,
on one 2 we.
as we 20 on
cast as none vote

代表性产出:

ne two 1 five,
cast as none vote

2 个答案:

答案 0 :(得分:1)

使用python set对象的简单脚本应该可以解决问题。这将检查不需要的单词组和输入文件行中的单词组没有共同的单词;

def filter_unwanted_words():
    unwanted_words = {'one', 'on'}
    with open('input.csv', 'r') as f:
        for line in f:
            if set(line.split()).isdisjoint(unwanted_words):
                yield line


def write_output():
    with open('output.csv', 'w') as f:
        f.writelines((line for line in filter_unwanted_words()))

if __name__ == '__main__':
    write_output()

output.csv中的输出是;

ne two 1 five,
cast as none vote

答案 1 :(得分:0)

您可以查看csv模块文档https://docs.python.org/2/library/csv.html

以下是ipython中的示例代码。

In [1]: import csv

In [2]: f = open('plop.csv')

In [3]: exclude = set(('on', 'one'))

In [4]: reader = csv.reader(f, delimiter=' ')

In [5]: for row in reader:
   ...:     if any(val in exclude for val in row):
   ...:         continue
   ...:     else:
   ...:         print row
   ...:         
['name', 'class', 'label', 'test']
['ne', 'two', '1', 'five,']
['cast', 'as', 'none', 'vote']

随意根据您的需要调整脚本。

请注意我没有为标题提供特殊处理,这可以通过这种方式处理。这不是你应该如何处理非常大的文件,因为整个文件被读入放入ram。

In [9]: f=open('plop.csv')

In [10]: reader = csv.reader(f.readlines()[1:], delimiter=' ') #skip headers

In [11]: for row in reader:
    ...:     if any(val in exclude for val in row):
    ...:         continue
    ...:     else:
    ...:         print row
    ...:         
['ne', 'two', '1', 'five,']
['cast', 'as', 'none', 'vote']