我有一个DF,如下所示。
DF_Old =
ID NER tID POS token R
1 B-ORG 1 NNP univesity "OrgBased_In+university of washington seismology lab.*wash"
1 I-ORG 1 IN of "OrgBased_In+university of washington seismology lab.*wash"
1 I-ORG 1 NNP washington"OrgBased_In+university of washington seismology lab.*wash"
1 I-ORG 1 NNP seismology"OrgBased_In+university of washington seismology lab.*wash"
1 L-ORG 1 NNP lab "OrgBased_In+university of washington seismology lab.*wash"
1 U-LOC 22 NNP wash "OrgBased_In+university of washington seismology lab.*wash"
除列R
以外,所有列均应易于解释。此列包含行的标签(OrgBased_In)及其方向。意思是“ +”之后和“ *”之前的字符属于第一个参数,而“ *”之后的字符属于第二个参数。我现在想将该重要信息(以及NER的标签)过滤到新列Relations
中。
我执行了许多必要的步骤来获得所需的DF
DF["Re"]= DF.R.str.findall(r"(Kill|Live_In|Located_In|OrgBased_In|Work_For)\+").str.join(',')
DF["Re"]= DF["Re"].str.split(',').apply(set).str.join(',')
DF["Argument1"] = DF["R"].str.split('+').str[+1]
DF["Argument1"] = DF["Argument1"].str.split('*').str[0]
DF["Argument2"] = DF["R"].str.split('*').str[-1]
DF["Argument2"] = DF["Argument2"].str.split(',').str[0]
DF["Argument1"] = DF["Argument1"].fillna("N")
DF["Argument2"] = DF["Argument2"].fillna("N")
conditions = [[x[0] in x[1] for x in zip(DF['token'].replace("-\d[\d]*","", regex=True), DF['Argument1'])],
[x[0] in x[1] for x in zip(DF['token'].replace("-\d[\d]*","", regex=True), DF['Argument2'])]]
choices = ["ARG1", "ARG2"]
DF["ARG"] = np.select(conditions, choices, default="O")
DF["Re"] = used_testing_global["Re"].str.split(',').str[0]
DF["Relations"] = DF["Re"] + "-" + DF["ARG"] + "-" + DF["NER"].str.split("-").str[0]
然后删除所有不必要的列,我得到以下(正确)结果:
DF_New =
ID NER tID POS token Re ARG Relations
1 B-ORG 1 NNP univesity OrgBased_In ARG1 OrgBased_In-ARG1-B
1 I-ORG 1 IN of OrgBased_In ARG1 OrgBased_In-ARG1-I
1 I-ORG 1 NNP washington OrgBased_In ARG1 OrgBased_In-ARG1-I
1 I-ORG 1 NNP seismology OrgBased_In ARG1 OrgBased_In-ARG1-I
1 L-ORG 1 NNP lab OrgBased_In ARG1 OrgBased_In-ARG1-L
1 U-LOC 22 NNP wash OrgBased_In ARG2 OrgBased_In-ARG2-U
但是我将新数据放入DF,该数据中有多个条目,因此R
列中有更多标签。
DF_2 =
ID NER tID POS token R
1 B-ORG 1 NNP univesity "OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab."
1 I-ORG 1 IN of "OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab."
1 I-ORG 1 NNP washington"OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab."
1 I-ORG 1 NNP seismology"OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab."
1 L-ORG 1 NNP lab "OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab."
1 U-LOC 22 NNP wash "OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab."
1 B-Peop 25 NNP chris ",Work_For+chris jonientz-trisler*university of washington seismology lab."
1 L-Peop 25 NNP jonientz-trisler",Work_For+chris jonientz-trisler*university of washington seismology lab."
如您所见,结构与“,”相同,是两部分的分隔符。数据也可能在R
列中包含2个以上的条目。我的代码无法意识到这是两个不同的关系,因此结果是错误的。
DF_2_Expected =
ID NER tID POS token Re ARG Relations
1 B-ORG 1 NNP univesity OrgBased_In,Work_For ARG1,ARG2 OrgBased_In-ARG1-B, Work_For-ARG2-B
1 I-ORG 1 IN of OrgBased_In,Work_For ARG1,ARG2 OrgBased_In-ARG1-I, Work_For-ARG2-I
1 I-ORG 1 NNP washington OrgBased_In,Work_For ARG1,ARG2 OrgBased_In-ARG1-I, Work_For-ARG2-I
1 I-ORG 1 NNP seismology OrgBased_In,Work_For ARG1,ARG2 OrgBased_In-ARG1-I, Work_For-ARG2-I
1 L-ORG 1 NNP lab OrgBased_In,Work_For ARG1,ARG2 OrgBased_In-ARG1-L, Work_For-ARG2-L
1 U-LOC 22 NNP wash OrgBased_In ARG2 OrgBased_In-ARG2-U
1 B-Peop25 NNP chris Work_For ARG1 Work_For-ARG1-B
1 L-Peop25 NNP jonientz-trisler Work_For ARG1 Work_For-ARG1-L
我受到的侵害:
DF_2_Got =
ID NER tID POS token Re ARG Relations
1 B-ORG 1 NNP univesity OrgBased_In ARG1 OrgBased_In-ARG1-B
1 I-ORG 1 IN of OrgBased_In ARG1 OrgBased_In-ARG1-I
1 I-ORG 1 NNP washington OrgBased_In ARG1 OrgBased_In-ARG1-I
1 I-ORG 1 NNP seismology OrgBased_In ARG1 OrgBased_In-ARG1-I
1 L-ORG 1 NNP lab OrgBased_In ARG1 OrgBased_In-ARG1-L
1 U-LOC 22 NNP wash OrgBased_In ARG2 OrgBased_In-ARG2-U
1 B-Peop25 NNP chris Work_For ARG1 Work_For-ARG1-B
1 L-Peop25 NNP jonientz-trisler Work_For ARG1 Work_For-ARG1-L
我无法更改代码以获取预期的输出。我需要做什么?有什么想法吗?
编辑:基于分隔符“ ,,”拆分行是否明智?
答案 0 :(得分:1)
遇到这些问题,最好从输入字符串开始,并在纯Python中创建一个函数以应用您的变形。基于Pandas字符串的方法也不是特别有效,因此您可以选择永远不要对算法进行Pandorize。
所以让我们举几个例子:
a = 'OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab.'
b = ',Work_For+chris jonientz-trisler*university of washington seismology lab.'
您可以仅使用str.strip
和str.split
来定义将它们拆分的函数。
def splitter(x):
return [i.split('+')[0] for i in x.strip(',').split(',,')]
print(splitter(a))
['OrgBased_In', 'Work_For']
print(splitter(b))
['Work_For']
然后可以在pd.Series.apply
中使用拆分器功能,然后使用列表推导。在Python 3.6及更高版本中可用的格式化字符串文字(f-strings)在这里很有用。
df = pd.DataFrame({'NER': ['B-ORG', 'B-Peop25'],
'Relations': [a, b]})
df['Relations'] = df['Relations'].apply(splitter)
df['Relations'] = [', '.join([f'{k}-ARG{idx}-{j.split("-")[0]}' \
for idx, k in enumerate(i, 1)]) \
for i, j in zip(df['Relations'], df['NER'])]
print(df)
NER Relations
0 B-ORG OrgBased_In-ARG1-B, Work_For-ARG2-B
1 B-Peop25 Work_For-ARG1-B
请注意,我们省略了创建一系列表示存在多少自变量的系列的说明。为此,您可以在内部列表理解中使用enumerate
。
如果您不使用Python 3.6+,则可以用str.format
替换f字符串,即代替f'{k}-ARG{idx}-{j.split("-")[0]}'
使用:
'{0}-ARG{1}-{2}'.format(k, idx, j.split('-')[0])
答案 1 :(得分:1)
如果您想留在熊猫流中,可以执行以下操作-
a = 'OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab.'
b = ',Work_For+chris jonientz-trisler*university of washington seismology lab.'
c = ['university', 'of', 'washington', 'seismology', 'lab', 'wash', 'chris']
df = pd.DataFrame({'NER': ['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'U-LOC', 'B-PEOP'],
'R': [a, a, a, a, a, a, b], 'token' : c})
def function(df):
temp = list(filter(None, re.split(',', df[1])))
temp1 = temp.copy()
for i, x in enumerate(temp1):
if df[2] not in re.split(r'[ `\=~!@#$%^&*()_+\[\]{};\'\\:"|<,./<>?]', x):
del temp[i]
relations = [x.split('+')[0] for x in temp]
temp2 = ['-ARG2' if df[2] in x.split('*')[1] else '-ARG1' for x in temp]
output = []
for i in range(len(relations)):
output.append(relations[i] + temp2[i] + '-' + df[0][0])
return ", ".join(output)
df['Relations'] = df.apply(function, axis = 1)
说明
第一个for循环是从R列中删除与令牌不匹配的条目。类似于预期dataFrame中的“ wash”令牌。正则表达式拆分条目,您可以通过仅保留所需条目来简化它。例如。我从其中移除了'-'
,因为其中一个令牌具有它。
此外,有可能通过删除变量和更多理解来进一步优化代码。
输出
df.Relations
0 OrgBased_In-ARG1-B, Work_For-ARG2-B
1 OrgBased_In-ARG1-I, Work_For-ARG2-I
2 OrgBased_In-ARG1-I, Work_For-ARG2-I
3 OrgBased_In-ARG1-I, Work_For-ARG2-I
4 OrgBased_In-ARG1-I, Work_For-ARG2-I
5 OrgBased_In-ARG2-U
6 Work_For-ARG1-B