Question

我有一个DF，它具有NER分类器的结果，如下所示：

df =

s        token        pred       tokenID
17     hakawati       B-Loc         3
17     theatre        L-Loc         3
17     jerusalem      U-Loc         7
56     university     B-Org         5
56     of             I-Org         5
56     texas          I-Org         5
56     here           L-Org         6
...
5402   dwight         B-Peop        1    
5402   d.             I-Peop        1
5402   eisenhower     L-Peop        1

此DataFrame中还有许多其他列不相关。现在我想根据句子ID（= s）和预测标签对标记进行分组，将它们组合成一个实体：

df2 =


s        token                        pred               
17     hakawati  theatre           Location
17     jerusalem                   Location
56     university of texas here    Organisation
...
5402   dwight d. eisenhower        People

通常我会通过简单地使用类似的行来实现 data_map = df.groupby(["s"],as_index=False, sort=False).agg(" ".join)并使用重命名功能。但是由于数据包含不同类型的字符串（B，I，L - Loc / Org ..），我不知道如何完成它。

任何想法都表示赞赏。

有什么想法吗？

Answer 1

通过辅助列的一种解决方案。

df['pred_cat'] = df['pred'].str.split('-').str[-1]

res = df.groupby(['s', 'pred_cat'])['token']\
        .apply(' '.join).reset_index()

print(res)

      s pred_cat                       token
0    17      Loc  hakawati theatre jerusalem
1    56      Org    university of texas here
2  5402     Peop        dwight d. eisenhower

请注意，这与您想要的输出不完全匹配;似乎有一些数据特定的治疗方法。

Answer 2

您可以按s和tokenID分组并汇总，如下所示：

def aggregate(df):
    token = " ".join(df.token)
    pred = df.iloc[0].pred.split("-", 1)[1]
    return pd.Series({"token": token, "pred": pred})

df.groupby(["s", "tokenID"]).apply(aggregate)

# Output
                             token  pred
s    tokenID                            
17   3            hakawati theatre   Loc
     7                   jerusalem   Loc
56   5         university of texas   Org
     6                        here   Org
5402 1        dwight d. eisenhower  Peop

在DataFrame中组合行

2 个答案: