我必须将数据帧1中存在的一个句子的所有行与数据帧2(包含所有句子的标记)进行匹配,并从数据帧2返回匹配的行。
我尝试了groupby操作,但是它将为每个匹配的行返回匹配项。我想让df1中的所有标记都匹配,并保持其顺序。
后面的df仅包含一个句子的标记。
pdt1 = pd.DataFrame({'Word':['Obesity','in','Low-','and','Middle-Income','Countries'],
'tag':['O','O','O','O','O','O']})
print(pdt1)
Word tag
0 Obesity O
1 in O
2 Low- O
3 and O
4 Middle-Income O
5 Countries O
其他数据框包含所有句子的标记。
pdt2 = pd.DataFrame([[1, 1, 1, 'Obesity', 'O'],
[2, 1, 1, 'in', 'O'],
[3, 1, 1, 'Low-', 'O'],
[4, 1, 1, 'and', 'O'],
[5, 1, 1, 'Middle-Income', 'O'],
[6, 1, 1, 'Countries', 'O']
[7, 1, 2, 'We', 'O'],
[8, 1, 2, 'have', 'O'],
[9, 1, 2, 'reviewed', 'O'],
[10, 1, 2, 'the', 'O'],
[11, 1, 2, 'distinctive', 'O'],
[12, 1, 2, 'features', 'O'],
[13, 1, 2, 'of', 'O'],
[14, 1, 2, 'excess', 'O'],
[15, 1, 2, 'weight', 'O'],
[16, 1, 2, ',', 'O'],
[17, 1, 2, 'its', 'O'],
[18, 1, 2, 'causes', 'O'],
[19, 1, 2, ',', 'O'],
[20, 1, 2, 'and', 'O'],
[21, 1, 2, 'related', 'O'],
[22, 1, 2, 'prevention', 'O'],
[23, 1, 2, 'and', 'O'],
[24, 1, 2, 'management', 'O'],
[25, 1, 2, 'efforts', 'O']])
pdt2.columns = ['id','Doc_ID','Sent_ID','Word','tag']
print(pdt2)
id Doc_ID Sent_ID Word tag
0 1 1 1 Obesity O
1 2 1 1 in O
2 3 1 1 Low- O
3 4 1 1 and O
4 5 1 1 Middle-Income O
5 6 1 1 Countries O
6 7 1 2 We O
7 8 1 2 have O
8 9 1 2 reviewed O
9 10 1 2 the O
10 11 1 2 distinctive O
11 12 1 2 features O
12 13 1 2 of O
13 14 1 2 excess O
14 15 1 2 weight O
15 16 1 2 , O
16 17 1 2 its O
17 18 1 2 causes O
18 19 1 2 , O
19 20 1 2 and O
20 21 1 2 related O
21 22 1 2 prevention O
22 23 1 2 and O
23 24 1 2 management O
24 25 1 2 efforts O
输出看起来像
id Doc_ID Sent_ID Word tag
0 1 1 1 Obesity O
1 2 1 1 in O
2 3 1 1 Low- O
3 4 1 1 and O
4 5 1 1 Middle-Income O
5 6 1 1 Countries O
答案 0 :(得分:0)
您的意思是:
print(pdt1.pdt2[pdt2['Sent_ID'] == 1])
输出:
id Doc_ID Sent_ID Word tag
0 1 1 1 Obesity O
1 2 1 1 in O
2 3 1 1 Low- O
3 4 1 1 and O
4 5 1 1 Middle-Income O
5 6 1 1 Countries O
编辑:
print(pdt1.merge(pdt2[pdt2['Sent_ID'] == 1],on=['Word','tag']))
输出:
Word tag id Doc_ID Sent_ID
0 Obesity O 1 1 1
1 in O 2 1 1
2 Low- O 3 1 1
3 and O 4 1 1
4 Middle-Income O 5 1 1
5 Countries O 6 1 1
答案 1 :(得分:0)
这应该有效
pdt2[pdt2[['Word', 'tag']].isin(pdt1).all(axis=1)]
id Doc_ID Sent_ID Word tag
0 1 1 1 Obesity O
1 2 1 1 in O
2 3 1 1 Low- O
3 4 1 1 and O
4 5 1 1 Middle-Income O
5 6 1 1 Countries O
答案 2 :(得分:0)
还
df = df2.merge(df1, how = 'inner',on=['Word','tag']).drop_duplicates('Word')