我有一个有很多条目的DF。 DF的摘录如下所示。
DF_OLD =
...
sID tID NER token Prediction
274 79 U-Peop khrushchev Live_In-ARG2+B
274 79 O 's Live_IN-ARG2+L
807 53 U-Loc louisiana Live_IN-ARG2+U
807 56 B-Peop earl Live_IN-ARG1+B
807 57 L-Peop long Live_IN-ARG1+L
807 13 B-Peop dwight Live_IN-ARG1+B
807 13 I-Peop d. Live_IN-ARG1+I
807 13 L-Peop eisenhower Live_IN-ARG1+L
...
列sID
分隔不同的句子。 Prediction
列显示了机器学习分类器的结果。这些可能很荒谬。我的目标是按照以下方案将所有预测的标签分组:
DF_Expected =
...
sID entity1 tID1 entity2 tID2 Relation
274 NaN NaN khrushchev 's 79 Live_In
807 earl long 56 57 louisiana 53 Live_In
807 dwight d. eisenhower 13 louisiana 53 Live_In
...
“-ARGX-”部分显示表中实体的位置,而第一个“-”之前的部分显示关系。如果缺少参数部分之一,则相应的单元格应为空。
这是我尝试过的:
DF["Live_In_Predict_Split"] = DF["Prediction"].str.split("+").str[0]
DF["token2"] = DF["token"]
DF["tokenID2"] = DF["tokenID"]
DF["Live_In_Predict2"] = DF["Live_In_Predict"]
data_tokeni_map = DF.groupby(["Live_In_Predict_Split","sentenceID"],as_index=True, sort=False).agg(" ".join).reset_index()
s = data_tokeni_map.loc[:,['sentenceID','token2',"tokenID2","Live_In_Predict2"]].merge(data_tokeni_map.loc[:,['sentenceID','token',"tokenID","Live_In_Predict"]],on='sentenceID')
s = s.loc[s.token2!=s.token].drop_duplicates()
我缺少某种将不同的“ -ARGX-”和某种GroupBy函数分开的计数器(GroupingBy tokenID不明智,因为它将产生错误的结果)。因此,我的新DF错误:
DF_EDITED =
...
sID entity1 tID1 entity2 tID2 ...
807 dwight d eisenhower earl long 13 56 57 louisiana 53
807 louisiana 13 56 57 dwight d eisenhower earl long 53
编辑:
稍微更改了我的代码。现在,将删除所有无用的预测,但将所有类似的预测分组在一起。我需要某种数据预处理算法来以这种形式匹配数据,这意味着我需要计算每个sID
的所有预测并对其进行排序。
DF_OLD_Edit =
...
sID tID NER token Prediction
274 79 U-Peop khrushchev Live_In-ARG2+B_1
274 79 O 's Live_IN-ARG2+L_1
807 53 U-Loc louisiana Live_IN-ARG2+U_1
807 56 B-Peop earl Live_IN-ARG1+B_1
807 57 L-Peop long Live_IN-ARG1+L_1
807 13 B-Peop dwight Live_IN-ARG1+B_2
807 13 I-Peop d. Live_IN-ARG1+I_2
807 13 L-Peop eisenhower Live_IN-ARG1+L_2
...
答案 0 :(得分:1)
数据:
df
sID tID NER token Prediction
0 274 79 U-Peop khrushchev Live_IN-ARG2+B_1
1 274 79 O 's Live_IN-ARG2+L_1
2 807 53 U-Loc louisiana Live_IN-ARG2+U_1
3 807 56 B-Peop earl Live_IN-ARG1+B_1
4 807 57 L-Peop long Live_IN-ARG1+L_1
5 807 13 B-Peop dwight Live_IN-ARG1+B_2
6 807 13 I-Peop d. Live_IN-ARG1+I_2
7 807 13 L-Peop eisenhower Live_IN-ARG1+L_2
代码:
import numpy as np
import pandas as pd
import typing
# setting up some columns for groupby
df['arg'] = df.Prediction.apply(lambda x: x.split("_")[1].split("-")[1].split("+")[0])
df['Relation'] = df.Prediction.apply(lambda x: x.split("-")[0])
df['ingroup_id'] = df.Prediction.apply(lambda x: x.split("_")[-1])
# groupby and collect relevant tID and token
df1 = df.groupby(['sID', 'arg', 'ingroup_id']).tID.apply(list)
df2 = df.groupby(['sID', 'arg', 'ingroup_id']).token.apply(list)
df3 = pd.concat([df1, df2], axis=1).reset_index()
df3.tID = df3.tID.apply(lambda x: list(set(x)))
# setting up columns that we finally use
df3.loc[df3.arg == 'ARG1', 'tID1'] = df3.tID
df3.loc[df3.arg == 'ARG2', 'tID2'] = df3.tID
df3.loc[df3.arg == 'ARG1', 'entity1'] = df3.token
df3.loc[df3.arg == 'ARG2', 'entity2'] = df3.token
# sort values and then ffill/bfill within the group
df3 = df3.sort_values(['sID', 'arg']).reset_index(drop=True)
df3.tID1 = df3.groupby(['sID']).tID1.ffill()
df3.entity1 = df3.groupby(['sID']).entity1.ffill()
df3.tID2 = df3.groupby(['sID']).tID2.bfill()
df3.entity2 = df3.groupby(['sID']).entity2.bfill()
df3 = df3[['sID', 'entity1', 'tID1', 'entity2', 'tID2']].set_index('sID')
# converting lists in cells into strings (may be someone can make this as a one liner)
df3.entity1 = df3.entity1.apply(lambda x: ' '.join(x) if isinstance(x, typing.List) else np.nan)
df3.entity2 = df3.entity2.apply(lambda x: ' '.join(x) if isinstance(x, typing.List) else np.nan)
df3.tID1 = df3.tID1.apply(lambda x: ' '.join(str(y) for y in x) if isinstance(x, typing.List) else np.nan)
df3.tID2 = df3.tID2.apply(lambda x: ' '.join(str(y) for y in x) if isinstance(x, typing.List) else np.nan)
df3 = df3.drop_duplicates().reset_index()
df3 = df3.merge(df[['sID', 'Relation']].drop_duplicates(), on='sID', how='left')
输出:
sID entity1 tID1 entity2 tID2 Relation
0 274 NaN NaN khrushchev 's 79 Live_IN
1 807 earl long 56 57 louisiana 53 Live_IN
2 807 dwight d. eisenhower 13 louisiana 53 Live_IN
由于缺乏技巧,代码很冗长,但是基本上它的作用是标题中所建议的groupby
和merge
。希望这会有所帮助。
答案 1 :(得分:0)
不得不混合使用功能和DF操作。这些根本没有效率,但是可以解决问题。
def combine(some_list):
current_group = 0
g_size = 0
for elem in some_list:
g_size += 1
if elem.endswith('U'):
if g_size > 1:
g_size = 1
current_group += 1
yield '{}{}'.format(current_group, elem)
if elem.endswith(('L', 'U')):
g_size = 0
current_group += 1
def splitter(DF):
return re.findall('^\d[\d]?[\d]?', DF)
# Not very efficient
DF["entity2"] = DF["entity"]
DF["tID2"] = DF["tID"]
DF["Prediction2"] = DF["Prediction"]
DF["Pred_Group"] = list(combine(DF["Prediction"].tolist()))
DF["Jojo"] = DF["Pred_Group"].apply(splitter)
DF["Jojo"] = DF["Jojo"].astype(str).apply(ast.literal_eval).apply(lambda x: " ".join(x))
dmap = DF.groupby(["Jojo","sID"],as_index=True, sort=False).agg(" ".join).reset_index()
s = dmap.loc[:,['sID','entity2',"tID2","Prediction2"]].merge(dmap.loc[:,['sID','entity',"tID","Prediction"]],on='sID')
s = s.loc[s.entity2!=s.entity].drop_duplicates()
s = s[s["Prediction"].str.contains(r"-ARG2+")]
DF= s[s["Prediction2"].str.contains(r"-ARG1+")]