Question

我必须将数据帧1中存在的一个句子的所有行与数据帧2（包含所有句子的标记）进行匹配，并从数据帧2返回匹配的行。

我尝试了groupby操作，但是它将为每个匹配的行返回匹配项。我想让df1中的所有标记都匹配，并保持其顺序。

后面的df仅包含一个句子的标记。

pdt1 = pd.DataFrame({'Word':['Obesity','in','Low-','and','Middle-Income','Countries'], 
             'tag':['O','O','O','O','O','O']})

print(pdt1)

    Word tag
0        Obesity   O
1             in   O
2           Low-   O
3            and   O
4  Middle-Income   O
5      Countries   O

其他数据框包含所有句子的标记。

pdt2 = pd.DataFrame([[1, 1, 1, 'Obesity', 'O'],
       [2, 1, 1, 'in', 'O'],
       [3, 1, 1, 'Low-', 'O'],
       [4, 1, 1, 'and', 'O'],
       [5, 1, 1, 'Middle-Income', 'O'],
       [6, 1, 1, 'Countries', 'O']
       [7, 1, 2, 'We', 'O'],
       [8, 1, 2, 'have', 'O'],
       [9, 1, 2, 'reviewed', 'O'],
       [10, 1, 2, 'the', 'O'],
       [11, 1, 2, 'distinctive', 'O'],
       [12, 1, 2, 'features', 'O'],
       [13, 1, 2, 'of', 'O'],
       [14, 1, 2, 'excess', 'O'],
       [15, 1, 2, 'weight', 'O'],
       [16, 1, 2, ',', 'O'],
       [17, 1, 2, 'its', 'O'],
       [18, 1, 2, 'causes', 'O'],
       [19, 1, 2, ',', 'O'],
       [20, 1, 2, 'and', 'O'],
       [21, 1, 2, 'related', 'O'],
       [22, 1, 2, 'prevention', 'O'],
       [23, 1, 2, 'and', 'O'],
       [24, 1, 2, 'management', 'O'],
       [25, 1, 2, 'efforts', 'O']])

pdt2.columns = ['id','Doc_ID','Sent_ID','Word','tag']
print(pdt2)


     id  Doc_ID  Sent_ID           Word tag
0    1       1        1        Obesity   O
1    2       1        1             in   O
2    3       1        1           Low-   O
3    4       1        1            and   O
4    5       1        1  Middle-Income   O
5    6       1        1      Countries   O
6    7       1        2             We   O
7    8       1        2           have   O
8    9       1        2       reviewed   O
9   10       1        2            the   O
10  11       1        2    distinctive   O
11  12       1        2       features   O
12  13       1        2             of   O
13  14       1        2         excess   O
14  15       1        2         weight   O
15  16       1        2              ,   O
16  17       1        2            its   O
17  18       1        2         causes   O
18  19       1        2              ,   O
19  20       1        2            and   O
20  21       1        2        related   O
21  22       1        2     prevention   O
22  23       1        2            and   O
23  24       1        2     management   O
24  25       1        2        efforts   O

输出看起来像

id  Doc_ID  Sent_ID           Word tag
0    1       1        1        Obesity   O
1    2       1        1             in   O
2    3       1        1           Low-   O
3    4       1        1            and   O
4    5       1        1  Middle-Income   O
5    6       1        1      Countries   O

Answer 1

您的意思是：

print(pdt1.pdt2[pdt2['Sent_ID'] == 1])

输出：

    id  Doc_ID  Sent_ID           Word tag
0    1       1        1        Obesity   O
1    2       1        1             in   O
2    3       1        1           Low-   O
3    4       1        1            and   O
4    5       1        1  Middle-Income   O
5    6       1        1      Countries   O

编辑：

print(pdt1.merge(pdt2[pdt2['Sent_ID'] == 1],on=['Word','tag']))

输出：

            Word tag  id  Doc_ID  Sent_ID
0        Obesity   O   1       1        1
1             in   O   2       1        1
2           Low-   O   3       1        1
3            and   O   4       1        1
4  Middle-Income   O   5       1        1
5      Countries   O   6       1        1

Answer 2

这应该有效

pdt2[pdt2[['Word', 'tag']].isin(pdt1).all(axis=1)]

    id  Doc_ID  Sent_ID Word    tag
0   1   1   1   Obesity          O
1   2   1   1   in               O
2   3   1   1   Low-             O
3   4   1   1   and              O
4   5   1   1   Middle-Income    O
5   6   1   1   Countries        O

Answer 3

还

df = df2.merge(df1, how = 'inner',on=['Word','tag']).drop_duplicates('Word')

比较两个数据框的多行

3 个答案: