我有一个非常大的csv文件,其中包含项目说明;我们将其称为CSVA。项目描述文本位于名为“ L0200_0”,“ L0240_0”,“ L0242_0”等的列标题下。我在另一个csv文件中存储了一个关键字列表,我们将其称为CSVB。 CSV B看起来像这样
artificial intelligence, natural language processing, research & development, machine learning, ...
我想搜索CSV A中有问题的列,并获取CSV B中每个字符串的计数。
我知道我可以通过执行类似的操作来获得字符串计数。
import csv
search_for = ['artificial intelligence', 'natural language processing', 'research & development', 'machine learning']
with open('in.csv') as inf, open('out.csv','w') as outf:
reader = csv.reader(inf)
writer = csv.writer(outf, delimiter='\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in reader:
if row[0] in search_for:
print('Found: {}'.format(row))
writer.writerow(row)
但是,我有很多关键字,而不是将它们单独列出在我的代码中,我宁愿将它们存储在一个csv文件(B)中,而直接在大型csv中从该csv文件(B)搜索文件(A)。
答案 0 :(得分:0)
这听起来确实像是熊猫数据框的工作。 但首先,听起来好像可以通过以下方式设置CSV_A
'L02_A', 'L02_B', 'L02_C'
description for L02_A artificial intelligence, description for L02_B natural language processing, description for L02_C research & development machine learning research & development
如果是这样,您将需要以另一种方式将其翻转(转置),以使描述位于一列中,然后为该列命名。如果不是这种情况,请跳过转置并重命名步骤。
import pandas as pd
import re
df = pd.read_csv("path/to/my.csv")
df = df.transpose()
df = df.rename({0:"description"}, axis=1)
output:
description
'L02_A' description for L02_A artificial intelligence
'L02_B' description for L02_B natural language processing
'L02_C' description for L02_C research & development machine learning research & development
您当然可以从一行csv中读取搜索词,但是我更喜欢将搜索词存储在单独的行中,以便可以使用以下代码加载它们。
search_terms = [term.strip() for term in open("path/to/search_terms.txt", 'r')]
获取计数的最简单方法是先找到所有关键字,然后找到该列表的长度。
re_pattern = "|".join([re.escape(term) for term in search_terms])
df["search_terms_found"] = df["description"].str.findall(re_pattern)
df["num_terms_found"] = df["search_terms_found"].str.len() # in pandas str.len works on lists as well as strings
df
output:
description search_terms_found num_terms_found
'L02_A' description for L02_A artificial intelligence [artificial intelligence] 1
'L02_B' description for L02_B natural language processing [natural language processing] 1
'L02_C' description for L02_C research & development r... [research & development, research & developmen... 3
一个注意事项:如果您有很长的搜索词列表,则Aho-Corasick trie会比正则表达式更快。
我使用noaho
软件包(pip install noaho),可以很容易地找到所有不重叠的关键字。
from noaho import NoAho
trie = NoAho()
for term in search_terms:
trie.add(term, term)
def noaho_find(text):
return [xx for xx in trie.findall_long(text)]
df["search_terms_found"] = df.apply(lambda xx: noaho_find(xx["description"]), axis=1)