我不确定是否可以轻松完成。
我有2个数据帧。在第一个(df1)中有一列有文本('文本'),在第二个有2列,一列有一些排序文本('subString'),第二列有一个得分('得分')。
当这些subString是第一个数据框中文本列的子字符串时,我想要的是总结第二个数据框中与subString字段关联的所有分数。
例如,如果我有这样的数据框:
df1 = pd.DataFrame({
'ID':[1,2,3,4,5,6],
'Texts':['this is a string',
'here we have another string',
'this one is completly different',
'one more',
'this is one more',
'and the last one'],
'c':['C','C','C','C','C','C'],
'd':['D','D','D','D','NaN','NaN']
}, columns = ['ID','Texts','c','d'])
df1
Out[2]:
ID Texts c d
0 1 this is a string C D
1 2 here we have another string C D
2 3 this one is completly different C D
3 4 one more C D
4 5 this is one more C NaN
5 6 and the last one C NaN
还有另一个这样的数据框:
df2 = pd.DataFrame({
'SubString':['This', 'one', 'this is', 'is one'],
'Score':[0.5, 0.2, 0.75, -0.5]
}, columns = ['SubString','Score'])
df2
Out[3]:
SubString Score
0 This 0.50
1 one 0.20
2 this is 0.75
3 is one -0.50
我想得到这样的东西:
df1['Score'] = 0.0
for index1, row1 in df1.iterrows():
score = 0
for index2, row2 in df2.iterrows():
if row2['SubString'] in row1['Texts']:
score += row2['Score']
df1.set_value(index1, 'Score', score)
df1
Out[4]:
ID Texts c d Score
0 1 this is a string C D 0.75
1 2 here we have another string C D 0.00
2 3 this one is completly different C D -0.30
3 4 one more C D 0.20
4 5 this is one more C NaN 0.45
5 6 and the last one C NaN 0.20
是否有更少的乱码和更快的方法呢?
谢谢!
答案 0 :(得分:1)
选项1
In [691]: np.array([np.where(df1.Texts.str.contains(x.SubString), x.Score, 0)
for _, x in df2.iterrows()]
).sum(axis=0)
Out[691]: array([ 0.75, 0. , -0.3 , 0.2 , 0.45, 0.2 ])
选项2
In [674]: df1.Texts.apply(lambda x: df2.Score[df2.SubString.apply(lambda y: y in x)].sum())
Out[674]:
0 0.75
1 0.00
2 -0.30
3 0.20
4 0.45
5 0.20
Name: Texts, dtype: float64
注意:apply
没有摆脱循环,它只是隐藏它们。