我已经标记了这样的数据:
Data = {'text': ['when can I decrease the contribution to my health savings?', 'I love my guinea pig', 'I love my dog'],
'start':[43, 10, 10],
'end':[57,19, 12],
'entity':['hsa', 'pet', 'pet'],
'value':['health savings', 'guinea pig', 'dog']
}
df = pd.DataFrame(Data)
text start end entity value
0 .. health savings 43 57 hsa health savings
1 I love my guinea pig 10 19 pet guinea pig
2 I love my dog 10 12 pet dog
想要将句子拆分为单词并标记每个单词。如果该单词与一个实体相关联,请用该实体标记它。
我已经尝试过此问题的方法: Split sentences in pandas into sentence number and words
但是该方法仅在值是诸如“ dog”之类的单个单词时起作用,而在值是诸如“豚鼠”之类的短语时则不起作用
要执行BIO标记。 B代表词组的开头。我代表一个短语里面。 O代表外面。
因此所需的输出将是:
Sentence # Word Entity
0 Sentence: 0 when O
1 Sentence: 0 can O
2 Sentence: 0 I O
3 Sentence: 0 decrease O
4 Sentence: 0 the O
5 Sentence: 0 contribution O
6 Sentence: 0 to O
7 Sentence: 0 my O
8 Sentence: 0 health B-hsa
9 Sentence: 0 savings? I-hsa
10 Sentence: 1 I O
11 Sentence: 1 love O
12 Sentence: 1 my O
13 Sentence: 1 guinea B-pet
14 Sentence: 1 pig I-pet
15 Sentence: 2 I O
16 Sentence: 2 love O
17 Sentence: 2 my O
18 Sentence: 2 dog B-pet
答案 0 :(得分:1)
使用:
df1 = (df.set_index(['value','entity'], append=True)
.text.str.split(expand=True)
.stack()
.reset_index(level=3, drop=True)
.reset_index(name='Word')
.rename(columns={'level_0':'Sentence'}))
df1['Sentence'] = 'Sentence: ' + df1['Sentence'].astype(str)
w = df1['Word'].str.replace(r'[^\w\s]+', '')
splitted = df1.pop('value').str.split()
e = df1.pop('entity')
m1 = splitted.str[0].eq(w)
m2 = [b in a for a, b in zip(splitted, w)]
df1['Entity'] = np.select([m1, m2 & ~m1], ['B-' + e, 'I-' + e], default='O')
print (df1)
Sentence Word Entity
0 Sentence: 0 when O
1 Sentence: 0 can O
2 Sentence: 0 I O
3 Sentence: 0 decrease O
4 Sentence: 0 the O
5 Sentence: 0 contribution O
6 Sentence: 0 to O
7 Sentence: 0 my O
8 Sentence: 0 health B-hsa
9 Sentence: 0 savings? I-hsa
10 Sentence: 1 I O
11 Sentence: 1 love O
12 Sentence: 1 my O
13 Sentence: 1 guinea B-pet
14 Sentence: 1 pig I-pet
15 Sentence: 2 I O
16 Sentence: 2 love O
17 Sentence: 2 my O
18 Sentence: 2 dog B-pet
说明:
DataFrame.set_index
和Series.str.split
由DataFrame.stack
创建新的DataFrame
DataFrame.rename_axis
,DataFrame.reset_index
和rename
清除一些数据Sentence
列Series.str.replace
删除标点符号DataFrame.pop
用于提取列,将split
用于列表m1
numpy.select
创建新列答案 1 :(得分:1)
第1步:通过以下代码按空格划分列值:
s = df['value'].str.split(' ').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'value' # needs a name to join
del df['value']
df1 = df.join(s)
df1 =df1.reset_index()
上述步骤会将您的短语分解为单个单词
第2步:df1
将具有新的值列,现在您需要做的就是将实体列更改为新的value
列
prev_id = 'x'
for idx,ser in df1.iterrows():
if ser.text == prev_id:
df1.loc[idx,'entity'] = 'I-HSA'
else:
df1.loc[idx,'entity'] = 'B-HSA'
prev_id = ser.text
上面的代码用类似的连续文本将要赋值的逻辑来更改entity
字段。
第3步:此后,您的数据框类似于您发布的链接,只需应用相同的解决方案即可。
以上答案正在解决您的问题中提到的短语问题