在数据框

时间:2018-02-27 14:44:15

标签: python regex pandas

我有两个数据框,一个包含PDF和元数据的内容(总共5000行):

PDF_title    Content                  Author    Year
a            this is cleaned which... Pete      2009
b            a pdf about some topi... John      2006
c            here is another artic... Tom       1997
etc.

和一个带标签(总共5100行):

Item                     Label
9308                     hello
837_c                    pdf some
2982                     another article
2_hic                    this cleaned which
2829_d                   another label
etc.

我想找到标签在内容中出现的次数。由于标签有时不完全匹配部分内容,但中间有单词,我想到以下正则表达式匹配模式(5是主观的):

pat = "\W+(?:\w+\W+){0,5}?"

想象一下,如果'这清理了哪个'(2_hic)将在PDF a等中出现4次,我希望我的结果如下所示,只有打印计数> 0:

a
2_hic: 4
9308: 7

b
837_c: 2

c
2982: 6

现在,我尝试了以下操作,但是我收到了“多次重复”错误。我知道它与正则表达式有关,但我不确定如何解决它!

df2['Label_regex'] = df2['Label'].str.replace(" ", pat)

tup_df1 = [tuple(x) for x in df1.values]
tup_df1 = [tuple(map(str,eachTuple)) for eachTuple in tup_df1]

tup_df2 = [tuple(x) for x in df2.values]
tup_df2 = [tuple(map(str,eachTuple)) for eachTuple in tup_df2] 

result = []
for doc in tup_df1:
    doc_result = []
    for con in tup_df2:
        length = len(re.findall(con[2], doc[1]))
        if 0<length:          
            doc_result.append((con[1], length))
    if doc_result:
        result.append((doc[0],doc_result))

for i in result:
    print i[0]
    for j in i[1]:
        print '\t{}: {}'.format(j[0],j[1])

正如我所说,运行它会给我一个'多次重复'错误。我希望我已经清楚地解释了自己。我是Python的新手,所以欢迎任何反馈: - )

0 个答案:

没有答案