我正在使用一些代码从一个长文本中提取包含某些关键字的所有完整句子,这是数据集的示例:
temp = [[1, 'some other words. The door was closed. some other words'],
[2, 'The door was painted. The door is not opened. some other words'],
[3, 'words and letters and numbers . No door is seen. some other letters words'],
[4, 'other words .Door is green. maybe other words']
]
Data = pd.DataFrame(temp,columns=['ID','Report'])
我需要的结果如下:
0 The door was closed.
1 The door was painted., The door is not opened.
2 No door is seen.
3 Door is green.
以下代码将所有完整句子提取为一系列:
x=Data['Report'].str.extractall(r"([^.]*?door[^.]*\.)",re.IGNORECASE)
如上所示,对于第二行,有2个door实例,因此代码将按如下所示提取它们
0
match
0 0 The door was closed.
1 0 The door was painted.
1 The door is not opened.
2 0 No door is seen.
3 0 Door is green.
作为示例,我将需要添加一些内容以将索引1的两个句子粘合在一起,成为一个单行/句子。
以下代码实际上将使用findall解决rpoblem:
y=Data['Report'].str.findall(r"([^.]*?door[^.]*\.)",flags=re.I).apply(','.join)
0 The door was closed.
1 The door was painted., The door is not opened.
2 No door is seen.
3 Door is green.
但是我正在尝试通过str.extractall获得相同的结果。有什么建议吗?最终目标是将结果与原始数据框合并以获得此输出
ID Report Final
0 1 some other word... The door was closed.
1 2 The door was... The door was painted., The door is not opened.
2 3 words and ... No door is seen.
3 4 other words... Door is green.