熊猫将句子和标签短语分开以执行BIO标记

时间:2019-04-17 07:27:48

标签: python string pandas

我已经标记了这样的数据:

    Data = {'text': ['when can I decrease the contribution to my health savings?', 'I love my guinea pig', 'I love my dog'],
        'start':[43, 10, 10],
        'end':[57,19, 12],
        'entity':['hsa', 'pet', 'pet'],
        'value':['health savings', 'guinea pig', 'dog']
       } 
    df = pd.DataFrame(Data)

       text               start  end         entity     value
0   .. health savings      43    57          hsa      health savings
1   I love my guinea pig   10    19          pet      guinea pig
2   I love my dog          10    12          pet         dog

想要将句子拆分为单词并标记每个单词。如果该单词与一个实体相关联,请用该实体标记它。

我已经尝试过此问题的方法: Split sentences in pandas into sentence number and words

但是该方法仅在值是诸如“ dog”之类的单个单词时起作用,而在值是诸如“豚鼠”之类的短语时则不起作用

要执行BIO标记。 B代表词组的开头。我代表一个短语里面。 O代表外面。

因此所需的输出将是:

    Sentence #  Word         Entity
0   Sentence: 0 when            O
1   Sentence: 0 can             O
2   Sentence: 0 I               O
3   Sentence: 0 decrease        O
4   Sentence: 0 the             O
5   Sentence: 0 contribution    O
6   Sentence: 0 to              O
7   Sentence: 0 my              O
8   Sentence: 0 health          B-hsa
9   Sentence: 0 savings?        I-hsa
10  Sentence: 1 I               O
11  Sentence: 1 love            O
12  Sentence: 1 my              O
13  Sentence: 1 guinea          B-pet
14  Sentence: 1 pig             I-pet
15  Sentence: 2 I               O
16  Sentence: 2 love            O
17  Sentence: 2 my              O
18  Sentence: 2 dog             B-pet

2 个答案:

答案 0 :(得分:1)

使用:

df1 = (df.set_index(['value','entity'], append=True)
         .text.str.split(expand=True)
         .stack()
         .reset_index(level=3, drop=True)
         .reset_index(name='Word')
         .rename(columns={'level_0':'Sentence'}))

df1['Sentence'] = 'Sentence: ' + df1['Sentence'].astype(str)
w = df1['Word'].str.replace(r'[^\w\s]+', '')
splitted = df1.pop('value').str.split()
e = df1.pop('entity')

m1 = splitted.str[0].eq(w)
m2 = [b in a for a, b in zip(splitted, w)]

df1['Entity'] = np.select([m1, m2 & ~m1], ['B-' + e, 'I-' + e],  default='O')

print (df1)

       Sentence          Word Entity
0   Sentence: 0          when      O
1   Sentence: 0           can      O
2   Sentence: 0             I      O
3   Sentence: 0      decrease      O
4   Sentence: 0           the      O
5   Sentence: 0  contribution      O
6   Sentence: 0            to      O
7   Sentence: 0            my      O
8   Sentence: 0        health  B-hsa
9   Sentence: 0      savings?  I-hsa
10  Sentence: 1             I      O
11  Sentence: 1          love      O
12  Sentence: 1            my      O
13  Sentence: 1        guinea  B-pet
14  Sentence: 1           pig  I-pet
15  Sentence: 2             I      O
16  Sentence: 2          love      O
17  Sentence: 2            my      O
18  Sentence: 2           dog  B-pet

说明

  1. 首先用DataFrame.set_indexSeries.str.splitDataFrame.stack创建新的DataFrame
  2. 通过DataFrame.rename_axisDataFrame.reset_indexrename清除一些数据
  3. 将字符串添加到Sentence
  4. 使用Series.str.replace删除标点符号
  5. DataFrame.pop用于提取列,将split用于列表
  6. 通过比较拆分列表的第一个值来创建掩码m1
  7. 创建掩码以比较列表的所有值
  8. 通过numpy.select创建新列

答案 1 :(得分:1)

第1步:通过以下代码按空格划分列值:

s = df['value'].str.split(' ').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'value' # needs a name to join
del df['value']
df1 = df.join(s)
df1 =df1.reset_index()

上述步骤会将您的短语分解为单个单词

第2步df1将具有新的值列,现在您需要做的就是将实体列更改为新的value

prev_id = 'x'
for idx,ser in df1.iterrows():
    if ser.text == prev_id:
        df1.loc[idx,'entity'] = 'I-HSA'
    else:
        df1.loc[idx,'entity'] = 'B-HSA'
    prev_id = ser.text

上面的代码用类似的连续文本将要赋值的逻辑来更改entity字段。

第3步:此后,您的数据框类似于您发布的链接,只需应用相同的解决方案即可。

以上答案正在解决您的问题中提到的短语问题