我有两个数据框,一个包含PDF和元数据的内容(总共5000行):
PDF_title Content Author Year
a this is cleaned which... Pete 2009
b a pdf about some topi... John 2006
c here is another artic... Tom 1997
etc.
和一个带标签(总共5100行):
Item Label
9308 hello
837_c pdf some
2982 another article
2_hic this cleaned which
2829_d another label
etc.
我想找到标签在内容中出现的次数。由于标签有时不完全匹配部分内容,但中间有单词,我想到以下正则表达式匹配模式(5是主观的):
pat = "\W+(?:\w+\W+){0,5}?"
想象一下,如果'这清理了哪个'(2_hic)将在PDF a等中出现4次,我希望我的结果如下所示,只有打印计数> 0:
a
2_hic: 4
9308: 7
b
837_c: 2
c
2982: 6
现在,我尝试了以下操作,但是我收到了“多次重复”错误。我知道它与正则表达式有关,但我不确定如何解决它!
df2['Label_regex'] = df2['Label'].str.replace(" ", pat)
tup_df1 = [tuple(x) for x in df1.values]
tup_df1 = [tuple(map(str,eachTuple)) for eachTuple in tup_df1]
tup_df2 = [tuple(x) for x in df2.values]
tup_df2 = [tuple(map(str,eachTuple)) for eachTuple in tup_df2]
result = []
for doc in tup_df1:
doc_result = []
for con in tup_df2:
length = len(re.findall(con[2], doc[1]))
if 0<length:
doc_result.append((con[1], length))
if doc_result:
result.append((doc[0],doc_result))
for i in result:
print i[0]
for j in i[1]:
print '\t{}: {}'.format(j[0],j[1])
正如我所说,运行它会给我一个'多次重复'错误。我希望我已经清楚地解释了自己。我是Python的新手,所以欢迎任何反馈: - )