字符串到表格python

时间:2019-03-05 15:54:27

标签: python pandas

我有一个采用以下格式的NLP标签字符串:

ABC [B-ORG] Funding [I-ORG] Angela [I-PER] Ham [I-PER] Stockholm [S-LOC] Chief Executive Officer \n Head of XYZ [E-ORG]

我需要为此提供一个df输出-

 Text        Label
 ABC Funding ORG
 Angela Ham  PER
 Stockholm   LOC
 Chief Executive Officer 
 Head of
 XYZ         ORG

请考虑- 1.上面也有未标记的字符串,如“ CEO Head of of Head”,它们应保留。 2.字符串中有换行符(\ n),这将导致df中的下一行。 3.除非字符串之间没有\ n,否则需要对相同的连续标签文本进行分组,例如“ ABC Funding”。

1 个答案:

答案 0 :(得分:0)

  1. 使用正则表达式提取零件(有不同的方法)并将其插入数据框
    import re
    ner_parts = re.findall('([\w ]+)\s(?:\[\w-([\w]+)]|\n)', string)
    df = pd.DataFrame(ner_parts, columns=['text', 'label'])

                           text label
    0                       ABC   ORG
    1                   Funding   ORG
    2                    Angela   PER
    3                       Ham   PER
    4                 Stockholm   LOC
    5   Chief Executive Officer      
    6               Head of XYZ   ORG

  1. 分组连续标签
groups = (~(df.label == df.label.shift())).cumsum()
groups.name = 'group' #just for nice look at the end result
groups

0    1
1    1
2    2
3    2
4    3
5    4
6    5
  1. 将文本分组在一起
def merge_text(group):
    return pd.Series([group['text'].str.cat(), group['label'].iat[0]],index=['text', 'label'])

df.groupby(groups).apply(merge_text)

                           text label
group                                
1                   ABC Funding   ORG
2                    Angela Ham   PER
3                     Stockholm   LOC
4       Chief Executive Officer      
5                   Head of XYZ   ORG