Python循环,需要/无法保留原始数据框的值

时间:2019-05-30 14:22:05

标签: python pandas loops dataframe

我试图遍历短语组以在每个组的所有成员之间进行匹配和评分。即使某些短语是相同的,它们也可能具有不同的代码,这是我从循环输入中删节的代码-但需要保留在最后的df2中。我必须在没有代码的循环中进行比较,但是问题是将其绑定到包含代码的原始df上,以便我可以识别需要标记的行。

下面的代码有效,但是我需要将原始DESCR添加到df2。追加a和b仅包含修剪。

我尝试过df.at[],但结果不一而足。谢谢。

import pandas as pd
from fuzzywuzzy import fuzz as fz
import itertools

data = [[1,'Oneab'],[1,'Onebc'],[1,'Twode'],[2,'Threegh'],[2,'Threehi'],[2,'Fourjk'],[3,'Fivekl'],[3,'Fivelm'],[3,'Fiveyz']]
df = pd.DataFrame(data,columns=['Ids','DESCR'])

n_list = []
a_list = []
b_list = []
pr_list = []
tsr_list = []

groups = df.groupby('Ids')
for n,g in groups:
    for a, b in itertools.product(g['DESCR'].str[:-2],g['DESCR'].str[:-2]):
        if str(a) < str(b):
            try:
                n_list.append(n)
                a_list.append(a)
                b_list.append(b)
                pr_list.append(fz.partial_ratio(a,b))
                tsr_list.append(fz.token_set_ratio(a,b))
            except:
                pass
df2 = pd.DataFrame({'Group': n_list, 'First Comparator': a_list, 'Second Comparator': b_list, 'Partial Ratio': pr_list, 'Token Set Ratio': tsr_list})

代替:

ab bc 50 50
ab de 0 0
bc de 0 0
gh hi 50 50
gh jk 0 0
hi jk 50 50
...

我想看看:

Oneab Onebc 50 50
Oneab Twode 0 0
Onebc Twode 0 0
Threegh Threehi 50 50
Threegh Fourjk 0 0
Threehi Fourjk 50 50
...

1 个答案:

答案 0 :(得分:0)

万一其他人遇到了类似的问题-解决了这个问题,而不是在第二级循环的开始过滤输入,而是将完整值带入第二个循环并将其剥离在那里:

a2 = a[6:]
b2 = b[6:]

所以:

import pandas as pd
from fuzzywuzzy import fuzz as fz
import itertools

data = [[1,'Oneab'],[1,'Onebc'],[1,'Twode'],[2,'Threegh'],[2,'Threehi'],[2,'Fourjk'],[3,'Fivekl'],[3,'Fivelm'],[3,'Fiveyz']]
df = pd.DataFrame(data,columns=['Ids','DESCR'])

n_list = []
a_list = []
b_list = []
pr_list = []
tsr_list = []

groups = df.groupby('Ids')
for n,g in groups:
    for a, b in itertools.product(g['DESCR'],g['DESCR']):
        if str(a) < str(b):
            try:
                a2 = a[:-2]
                b2 = b[:-2]
                n_list.append(n)
                a_list.append(a)
                b_list.append(b)
                pr_list.append(fz.partial_ratio(a2,b2))
                tsr_list.append(fz.token_set_ratio(a2,b2))
            except:
                pass
df2 = pd.DataFrame({'Group': n_list, 'First Comparator': a_list, 'Second Comparator': b_list, 'Partial Ratio': pr_list, 'Token Set Ratio': tsr_list})