Question

我有两个数据框，一个包含PDF和元数据的内容（总共5000行）：

PDF_title    Content                  Author    Year
a            this is cleaned which... Pete      2009
b            a pdf about some topi... John      2006
c            here is another artic... Tom       1997
etc.

和一个带标签（总共5100行）：

Item                     Label
9308                     hello
837_c                    pdf some
2982                     another article
2_hic                    this cleaned which
2829_d                   another label
etc.

我想找到标签在内容中出现的次数。由于标签有时不完全匹配部分内容，但中间有单词，我想到以下正则表达式匹配模式（5是主观的）：

pat = "\W+(?:\w+\W+){0,5}?"

想象一下，如果'这清理了哪个'（2_hic）将在PDF a等中出现4次，我希望我的结果如下所示，只有打印计数＆gt; 0：

a
2_hic: 4
9308: 7

b
837_c: 2

c
2982: 6

现在，我尝试了以下操作，但是我收到了“多次重复”错误。我知道它与正则表达式有关，但我不确定如何解决它！

df2['Label_regex'] = df2['Label'].str.replace(" ", pat)

tup_df1 = [tuple(x) for x in df1.values]
tup_df1 = [tuple(map(str,eachTuple)) for eachTuple in tup_df1]

tup_df2 = [tuple(x) for x in df2.values]
tup_df2 = [tuple(map(str,eachTuple)) for eachTuple in tup_df2] 

result = []
for doc in tup_df1:
    doc_result = []
    for con in tup_df2:
        length = len(re.findall(con[2], doc[1]))
        if 0<length:          
            doc_result.append((con[1], length))
    if doc_result:
        result.append((doc[0],doc_result))

for i in result:
    print i[0]
    for j in i[1]:
        print '\t{}: {}'.format(j[0],j[1])

正如我所说，运行它会给我一个'多次重复'错误。我希望我已经清楚地解释了自己。我是Python的新手，所以欢迎任何反馈： - ）

在数据框

0 个答案: