一列中的多个条目更改了熊猫数据框的输出

时间:2018-08-15 14:45:45

标签: python string pandas dataframe

我有一个DF,如下所示。

DF_Old =
ID NER   tID POS  token     R
1  B-ORG 1   NNP  univesity "OrgBased_In+university of washington seismology lab.*wash"
1  I-ORG 1   IN   of        "OrgBased_In+university of washington seismology lab.*wash"
1  I-ORG 1   NNP  washington"OrgBased_In+university of washington seismology lab.*wash"
1  I-ORG 1   NNP  seismology"OrgBased_In+university of washington seismology lab.*wash"
1  L-ORG 1   NNP  lab       "OrgBased_In+university of washington seismology lab.*wash"
1  U-LOC 22  NNP  wash      "OrgBased_In+university of washington seismology lab.*wash"

除列R以外,所有列均应易于解释。此列包含行的标签(OrgBased_In)及其方向。意思是“ +”之后和“ *”之前的字符属于第一个参数,而“ *”之后的字符属于第二个参数。我现在想将该重要信息(以及NER的标签)过滤到新列Relations中。

我执行了许多必要的步骤来获得所需的DF

DF["Re"]= DF.R.str.findall(r"(Kill|Live_In|Located_In|OrgBased_In|Work_For)\+").str.join(',')
DF["Re"]= DF["Re"].str.split(',').apply(set).str.join(',')
DF["Argument1"] = DF["R"].str.split('+').str[+1]
DF["Argument1"] = DF["Argument1"].str.split('*').str[0]
DF["Argument2"] = DF["R"].str.split('*').str[-1]
DF["Argument2"] = DF["Argument2"].str.split(',').str[0]
DF["Argument1"] = DF["Argument1"].fillna("N")
DF["Argument2"] = DF["Argument2"].fillna("N")

conditions = [[x[0] in x[1] for x in zip(DF['token'].replace("-\d[\d]*","", regex=True), DF['Argument1'])], 
              [x[0] in x[1] for x in zip(DF['token'].replace("-\d[\d]*","", regex=True), DF['Argument2'])]]
choices = ["ARG1", "ARG2"]

DF["ARG"] = np.select(conditions, choices, default="O")
DF["Re"] = used_testing_global["Re"].str.split(',').str[0]
DF["Relations"] = DF["Re"] + "-" + DF["ARG"] + "-" + DF["NER"].str.split("-").str[0]

然后删除所有不必要的列,我得到以下(正确)结果:

DF_New =
ID NER   tID POS  token      Re           ARG      Relations
1  B-ORG 1   NNP  univesity  OrgBased_In  ARG1     OrgBased_In-ARG1-B
1  I-ORG 1   IN   of         OrgBased_In  ARG1     OrgBased_In-ARG1-I
1  I-ORG 1   NNP  washington OrgBased_In  ARG1     OrgBased_In-ARG1-I
1  I-ORG 1   NNP  seismology OrgBased_In  ARG1     OrgBased_In-ARG1-I
1  L-ORG 1   NNP  lab        OrgBased_In  ARG1     OrgBased_In-ARG1-L
1  U-LOC 22  NNP  wash       OrgBased_In  ARG2     OrgBased_In-ARG2-U

但是我将新数据放入DF,该数据中有多个条目,因此R列中有更多标签。

DF_2 =
ID NER   tID POS  token     R
1  B-ORG 1   NNP  univesity "OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab."
1  I-ORG 1   IN   of        "OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab."
1  I-ORG 1   NNP  washington"OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab."
1  I-ORG 1   NNP  seismology"OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab."
1  L-ORG 1   NNP  lab       "OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab."
1  U-LOC 22  NNP  wash      "OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab."
1  B-Peop 25 NNP chris     ",Work_For+chris jonientz-trisler*university of washington seismology lab."
1  L-Peop 25 NNP jonientz-trisler",Work_For+chris jonientz-trisler*university of washington seismology lab."

如您所见,结构与“,”相同,是两部分的分隔符。数据也可能在R列中包含2个以上的条目。我的代码无法意识到这是两个不同的关系,因此结果是错误的。

DF_2_Expected =
ID NER   tID POS  token      Re                     ARG       Relations
1  B-ORG 1   NNP  univesity  OrgBased_In,Work_For   ARG1,ARG2 OrgBased_In-ARG1-B, Work_For-ARG2-B
1  I-ORG 1   IN   of         OrgBased_In,Work_For   ARG1,ARG2 OrgBased_In-ARG1-I, Work_For-ARG2-I
1  I-ORG 1   NNP  washington OrgBased_In,Work_For   ARG1,ARG2 OrgBased_In-ARG1-I, Work_For-ARG2-I
1  I-ORG 1   NNP  seismology OrgBased_In,Work_For   ARG1,ARG2 OrgBased_In-ARG1-I, Work_For-ARG2-I
1  L-ORG 1   NNP  lab        OrgBased_In,Work_For   ARG1,ARG2 OrgBased_In-ARG1-L, Work_For-ARG2-L
1  U-LOC 22  NNP  wash       OrgBased_In            ARG2      OrgBased_In-ARG2-U
1  B-Peop25  NNP  chris      Work_For               ARG1      Work_For-ARG1-B
1  L-Peop25  NNP  jonientz-trisler Work_For         ARG1      Work_For-ARG1-L

我受到的侵害:

DF_2_Got =
ID NER   tID POS  token      Re                     ARG       Relations
1  B-ORG 1   NNP  univesity  OrgBased_In            ARG1 OrgBased_In-ARG1-B
1  I-ORG 1   IN   of         OrgBased_In            ARG1 OrgBased_In-ARG1-I
1  I-ORG 1   NNP  washington OrgBased_In            ARG1 OrgBased_In-ARG1-I
1  I-ORG 1   NNP  seismology OrgBased_In            ARG1 OrgBased_In-ARG1-I
1  L-ORG 1   NNP  lab        OrgBased_In            ARG1 OrgBased_In-ARG1-L
1  U-LOC 22  NNP  wash       OrgBased_In            ARG2      OrgBased_In-ARG2-U
1  B-Peop25  NNP  chris      Work_For               ARG1      Work_For-ARG1-B
1  L-Peop25  NNP  jonientz-trisler Work_For         ARG1      Work_For-ARG1-L

我无法更改代码以获取预期的输出。我需要做什么?有什么想法吗?

编辑:基于分隔符“ ,,”拆分行是否明智?

2 个答案:

答案 0 :(得分:1)

遇到这些问题,最好从输入字符串开始,并在纯Python中创建一个函数以应用您的变​​形。基于Pandas字符串的方法也不是特别有效,因此您可以选择永远不要对算法进行Pandorize。

所以让我们举几个例子:

a = 'OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab.'
b = ',Work_For+chris jonientz-trisler*university of washington seismology lab.'

您可以仅使用str.stripstr.split来定义将它们拆分的函数。

def splitter(x):
    return [i.split('+')[0] for i in x.strip(',').split(',,')]

print(splitter(a))
['OrgBased_In', 'Work_For']

print(splitter(b))
['Work_For']

然后可以在pd.Series.apply中使用拆分器功能,然后使用列表推导。在Python 3.6及更高版本中可用的格式化字符串文字(f-strings)在这里很有用。

df = pd.DataFrame({'NER': ['B-ORG', 'B-Peop25'],
                   'Relations': [a, b]})

df['Relations'] = df['Relations'].apply(splitter)

df['Relations'] = [', '.join([f'{k}-ARG{idx}-{j.split("-")[0]}' \
                              for idx, k in enumerate(i, 1)]) \
                   for i, j in zip(df['Relations'], df['NER'])]

print(df)

        NER                            Relations
0     B-ORG  OrgBased_In-ARG1-B, Work_For-ARG2-B
1  B-Peop25                      Work_For-ARG1-B

请注意,我们省略了创建一系列表示存在多少自变量的系列的说明。为此,您可以在内部列表理解中使用enumerate


如果您不使用Python 3.6+,则可以用str.format替换f字符串,即代替f'{k}-ARG{idx}-{j.split("-")[0]}'使用:

'{0}-ARG{1}-{2}'.format(k, idx, j.split('-')[0])

答案 1 :(得分:1)

如果您想留在熊猫流中,可以执行以下操作-

a = 'OrgBased_In+university of washington seismology lab.*wash",,Work_For+chris jonientz-trisler*university of washington seismology lab.'
b = ',Work_For+chris jonientz-trisler*university of washington seismology lab.'
c = ['university', 'of', 'washington', 'seismology', 'lab', 'wash', 'chris']

df = pd.DataFrame({'NER': ['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'U-LOC', 'B-PEOP'],
               'R': [a, a, a, a, a, a, b], 'token' : c})

def function(df):
    temp = list(filter(None, re.split(',', df[1])))
    temp1 = temp.copy()
    for i, x in enumerate(temp1):
        if df[2] not in re.split(r'[ `\=~!@#$%^&*()_+\[\]{};\'\\:"|<,./<>?]', x):
            del temp[i]
    relations = [x.split('+')[0] for x in temp]
    temp2 = ['-ARG2' if df[2] in x.split('*')[1] else '-ARG1' for x in temp]
    output = []
    for i in range(len(relations)):
        output.append(relations[i] + temp2[i] + '-' + df[0][0])

    return ", ".join(output)

df['Relations'] = df.apply(function, axis = 1)

说明

第一个for循环是从R列中删除与令牌不匹配的条目。类似于预期dataFrame中的“ wash”令牌。正则表达式拆分条目,您可以通过仅保留所需条目来简化它。例如。我从其中移除了'-',因为其中一个令牌具有它。

此外,有可能通过删除变量和更多理解来进一步优化代码。

输出

df.Relations
0    OrgBased_In-ARG1-B, Work_For-ARG2-B
1    OrgBased_In-ARG1-I, Work_For-ARG2-I
2    OrgBased_In-ARG1-I, Work_For-ARG2-I
3    OrgBased_In-ARG1-I, Work_For-ARG2-I
4    OrgBased_In-ARG1-I, Work_For-ARG2-I
5                     OrgBased_In-ARG2-U
6                        Work_For-ARG1-B