我正在读取一个空格分隔数据的.csv
文件,其中包含一些不需要的单词。我需要查找给定行的任何列中是否存在不需要的单词,并删除该行。
例如,如果是unwanted_list = ['one', 'on']
,那么输入.csv
文件的列为 name class label test ;
输入:
ne two 1 five,
on one 2 we.
as we 20 on
cast as none vote
代表性产出:
ne two 1 five,
cast as none vote
答案 0 :(得分:1)
使用python set
对象的简单脚本应该可以解决问题。这将检查不需要的单词组和输入文件行中的单词组没有共同的单词;
def filter_unwanted_words():
unwanted_words = {'one', 'on'}
with open('input.csv', 'r') as f:
for line in f:
if set(line.split()).isdisjoint(unwanted_words):
yield line
def write_output():
with open('output.csv', 'w') as f:
f.writelines((line for line in filter_unwanted_words()))
if __name__ == '__main__':
write_output()
output.csv
中的输出是;
ne two 1 five,
cast as none vote
答案 1 :(得分:0)
您可以查看csv模块文档https://docs.python.org/2/library/csv.html
以下是ipython中的示例代码。
In [1]: import csv
In [2]: f = open('plop.csv')
In [3]: exclude = set(('on', 'one'))
In [4]: reader = csv.reader(f, delimiter=' ')
In [5]: for row in reader:
...: if any(val in exclude for val in row):
...: continue
...: else:
...: print row
...:
['name', 'class', 'label', 'test']
['ne', 'two', '1', 'five,']
['cast', 'as', 'none', 'vote']
随意根据您的需要调整脚本。
请注意我没有为标题提供特殊处理,这可以通过这种方式处理。这不是你应该如何处理非常大的文件,因为整个文件被读入放入ram。
In [9]: f=open('plop.csv')
In [10]: reader = csv.reader(f.readlines()[1:], delimiter=' ') #skip headers
In [11]: for row in reader:
...: if any(val in exclude for val in row):
...: continue
...: else:
...: print row
...:
['ne', 'two', '1', 'five,']
['cast', 'as', 'none', 'vote']