我有一个包含子字符串的列表,如果列表中存在的任何子字符串存在于CSV文件的该列中,我需要将其与CSV文件中的列进行比较。我想写那些在该字符串列中没有这些子串的行。 这个文件中有很多列,我只看一列。
示例my_string列具有值{"这只是可能令牌的比较","这是多么艰难的事情?"}
de = ["只是","不","真的","帽子"]
我只想写一行有哪些"多么艰难的事情?"
如果列中的列表中只有单词,则此方法可以正常工作。例如,如果my_string列有"真的"它不会写入新文件。但是,如果列表中的项目附带其他字符串,则无法通过。
with open(infile, 'rb') as inFile, open(outfile, 'wb') as outfile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outfile, delimiter=',')
for row[1] in reader:
if any(d in row[1] for d in de):
pass
else:
writer.writerow(row[1])
答案 0 :(得分:1)
您可以将单词编译成单个正则表达式,甚至可以按如下方式进行不区分大小写的匹配:
r = re.compile('\\b('+"|".join(de)+')\\b', re.IGNORECASE)
那么您的代码可能只是:
with open(infile, 'rb') as inFile, open(outfile, 'wb') as outfile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outfile, delimiter=',')
for row in reader:
if not r.search(row[1]):
writer.writerow(row[1])
答案 1 :(得分:0)
要检查子字符串列表中是否存在字符串,我通常使用集合。
list1 = ['a','b','c']
list2 = ['c','d','e']
现在,要找到差异,
list3 = list(set(a) - set(b))
它为您提供了[' a'' b'](列表1中的内容不在列表2中)并且您有字符串可供您使用感兴趣的是。做
list(set(b) - set(a))
会为你提供" list2中不在list1中的内容的字符串?",即[' e'' d']
答案 2 :(得分:0)
听起来你想要搜索单词而不仅仅是子串,这样,例如,“hat”就不会匹配“What”。想要匹配复数,不同情况,带连字符的字符串等时,单词搜索会变得复杂。但如果您不介意忽略这些复杂情况,可以使用正则表达式将列拆分为单词列表,小写它们然后使用set操作进行检查。
import re
import csv
# TEST: write a sample csv file. using col0 to indicate what should be
# in the outfile
open('infile.csv', 'w').write(
"""exclude,This is just a comparison of likely tokens,col02,col03
include,what a tough thing?,col12,col13""")
# the words to find
de = ["just","not","really", "hat"]
# the files
infile = 'infile.csv'
outfile = 'outfile.csv'
# a "normalized set" of words to search
de = set(word.lower() for word in de)
def normalize_text(text):
"""Return a set of all the words in lowercased text"""
return set(re.findall('\w+', text.lower()))
with open(infile, 'r') as inFile, open(outfile, 'w') as outFile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outFile, delimiter=',')
for row in reader:
mycol = normalize_text(row[1])
if not mycol & de:
writer.writerow(row)
print("---- output file ----")
print(open(outfile).read())