python - 提取CSV文件中没有列表中元素的行

时间:2016-05-08 23:07:41

标签: python regex list csv tokenize

我有一个包含子字符串的列表,如果列表中存在的任何子字符串存在于CSV文件的该列中,我需要将其与CSV文件中的列进行比较。我想写那些在该字符串列中没有这些子串的行。 这个文件中有很多列,我只看一列。

示例my_string列具有值{"这只是可能令牌的比较","这是多么艰难的事情?"}

de = ["只是","不","真的","帽子"]

我只想写一行有哪些"多么艰难的事情?"

如果列中的列表中只有单词,则此方法可以正常工作。例如,如果my_string列有"真的"它不会写入新文件。但是,如果列表中的项目附带其他字符串,则无法通过。

with open(infile, 'rb') as inFile, open(outfile, 'wb') as outfile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outfile, delimiter=',')


for row[1] in reader:


    if any(d in row[1] for d in de):
        pass
    else:
        writer.writerow(row[1])

3 个答案:

答案 0 :(得分:1)

您可以将单词编译成单个正则表达式,甚至可以按如下方式进行不区分大小写的匹配:

r = re.compile('\\b('+"|".join(de)+')\\b', re.IGNORECASE)

那么您的代码可能只是:

with open(infile, 'rb') as inFile, open(outfile, 'wb') as outfile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outfile, delimiter=',')

for row in reader:
    if not r.search(row[1]):
        writer.writerow(row[1])

答案 1 :(得分:0)

要检查子字符串列表中是否存在字符串,我通常使用集合。

list1 = ['a','b','c']
list2 = ['c','d','e']

现在,要找到差异,

list3 = list(set(a) - set(b))

它为您提供了[' a'' b'](列表1中的内容不在列表2中)并且您有字符串可供您使用感兴趣的是。做

list(set(b) - set(a)) 

会为你提供" list2中不在list1中的内容的字符串?",即[' e'' d']

答案 2 :(得分:0)

听起来你想要搜索单词而不仅仅是子串,这样,例如,“hat”就不会匹配“What”。想要匹配复数,不同情况,带连字符的字符串等时,单词搜索会变得复杂。但如果您不介意忽略这些复杂情况,可以使用正则表达式将列拆分为单词列表,小写它们然后使用set操作进行检查。

import re
import csv

# TEST: write a sample csv file. using col0 to indicate what should be
# in the outfile
open('infile.csv', 'w').write(
"""exclude,This is just a comparison of likely tokens,col02,col03
include,what a tough thing?,col12,col13""")

# the words to find
de = ["just","not","really", "hat"]

# the files
infile = 'infile.csv'
outfile = 'outfile.csv'

# a "normalized set" of words to search
de = set(word.lower() for word in de)

def normalize_text(text):
    """Return a set of all the words in lowercased text"""
    return set(re.findall('\w+', text.lower()))

with open(infile, 'r') as inFile, open(outfile, 'w') as outFile:
    reader = csv.reader(inFile, delimiter=',')
    writer = csv.writer(outFile, delimiter=',')
    for row in reader:
        mycol = normalize_text(row[1])
        if not mycol & de:
            writer.writerow(row)

print("---- output file ----")
print(open(outfile).read())