我想从一大段文字中提取句子。我的文字就像tihs -
<ul><li>Registered Nurse in <font>Missouri</font>, License number <font>xxxxxxxx</font>, <font>2017</font></li><li>AHA Advanced Cardiac Life Support (ACLS) Certification <font>2016-2018</font></li><li>AHA PALS - Pediatric Advanced Life Support 2017-2019</li><li>AHA Basic Life Support 2016-2018</li></ul>
我想从上面的文字中提取正确的句子。所以预期的输出将是一个列表
['Registered Nurse in Missouri, License number xxxxxxxx, 2017',
'AHA Advanced Cardiac Life Support (ACLS) Certification 2016-2018',
'AHA PALS - Pediatric Advanced Life Support 2017-2019',
'AHA Basic Life Support 2016-2018']
我使用python内置HTMLParser
模块从上面的文本中删除htmls。这是我的代码。
class HTMLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs= True
self.fed = []
def handle_data(self, chunk):
#import pdb; pdb.set_trace()
self.fed.append(chunk.strip())
def get_data(self):
return [x for x in self.fed if x]
def strip_html_tags(html):
try:
s = HTMLStripper()
s.feed(html)
return s.get_data()
except Exception as e:
# Remove html strings from the given string
p = re.compile(r'<.*?>')
return p.sub('', html)
它在上面的文本上调用strip_html_tags
函数给出了以下结果(实际上是它应该由当前实现产生的输出)
['Registered Nurse in', 'Missouri', ', License number', 'xxxxxxx', ',', '2017', 'AHA Advanced Cardiac Life Support (ACLS) Certification', '2016-2018', 'AHA PALS - Pediatric Advanced Life Support 2017-2019', 'AHA Basic Life Support 2016-2018']
我无法对<ul> or <li> tags
进行严格检查,因为不同的文本可能有不同的html标签。我有办法在外html-tags
上拆分上面的文本,而不是在遇到的每个html-tag
上进行拆分
提前致谢。
答案 0 :(得分:1)
为什么不使用已经有效解析html的工具?像BeautifulSoup
:
from bs4 import BeautifulSoup
demo = '<ul><li>Registered Nurse in <font>Missouri</font>, License number <font>xxxxxxxx</font>, <font>2017</font></li><li>AHA Advanced Cardiac Life Support (ACLS) Certification <font>2016-2018</font></li><li>AHA PALS - Pediatric Advanced Life Support 2017-2019</li><li>AHA Basic Life Support 2016-2018</li></ul>'
soup = BeautifulSoup(demo, 'lxml')
sentences = [item.text for item in soup.findAll('li')]
变量sentences
现在完全符合您的要求,请自行测试
根据您的评论,我会使用此代码:
text_without_tags = soup.text
所以现在你没有其他标签需要担心,只需要一个简单的字符串,然后你可以用逗号上的split(',')
转到一个列表(但是如果文本不总是用逗号或点,我不打扰,只使用字符串本身)
注意:如果没有文本的某些已知结构,则无法始终以相同的方式解析它并获得已知结果。这个已知的结构可能是某些html标签,也可能是您事先知道的某些文本功能
答案 1 :(得分:1)
经过深思熟虑,我在这里发布我的解决方案。它对我的各种例子都很好。如果我知道必须提前从中提取文本的标签(以便我可以应用BeautifulSoup
),那么使用soup.findAll(specific_tag)
的方法将有效,但事实并非如此。它们也可以是我需要提取文本的多个标签。例如 -
<p>Science</p><div> Biology </div><div>Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. <span style=\"text-decoration: underline;\">Nature Methods</span> 2017,</div>
我上面的示例我想从<p>
标记和<div>
标记中提取文本。
我修改了上面的代码来处理这种情况 -
import re
import copy
from html.parser import HTMLParser
from sample_htmls import *
class HTMLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs= True
self.feeds = []
self.sentence = ''
self.current_path = []
self.tree = []
self.lookup_tags = ['div', 'span', 'p', 'ul', 'li']
def update_feed(self):
self.tree.append(copy.deepcopy(self.current_path))
self.current_path[:] = []
self.feeds.append(re.sub(' +', ' ', self.sentence).strip())
self.sentence = ''
def handle_starttag(self, tag, attrs):
if tag in self.lookup_tags:
if tag == 'li' and len(self.current_path) > 0:
self.update_feed()
self.current_path.append(tag)
def handle_endtag(self, tag):
if tag in self.lookup_tags:
self.current_path.append(tag)
if tag == self.current_path[0]:
self.update_feed()
def handle_data(self, data):
self.sentence += ' ' + data
def get_tree(self):
return self.tree
def get_data(self):
return [x for x in self.feeds if x]
在上面的例子中运行代码
parser = HTMLStripper()
parser.feed(mystr)
l1 = parser.get_tree()
feed = parser.get_data()
print(l1)
print("\n", mystr)
print("\n", feed)
print("\n\n")
和输出 -
[['ul'], ['li', 'li'], ['li', 'li'], ['li', 'li'], ['li', 'li'], ['ul']]
<ul><li>Registered Nurse in <font>Missouri</font>, License number <font>xxxxxxxx</font>, <font>2017</font></li><li>AHA Advanced Cardiac Life Support (ACLS) Certification <font>2016-2018</font></li><li>AHA PALS - Pediatric Advanced Life Support 2017-2019</li><li>AHA Basic Life Support 2016-2018</li></ul>
['Registered Nurse in Missouri , License number xxxxxxxx , 2017', 'AHA Advanced Cardiac Life Support (ACLS) Certification 2016-2018', 'AHA PALS - Pediatric Advanced Life Support 2017-2019', 'AHA Basic Life Support 2016-2018']
同样适用于混合标签html字符串 -
[['p', 'p'], ['div', 'div'], ['div', 'span', 'span', 'div']]
<p>Science</p><div> Biology </div><div>Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. <span style="text-decoration: underline;">Nature Methods</span> 2017,</div>
['Science', 'Biology', 'Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. Nature Methods 2017,']
很想看到一个角落案例,以便我可以改进文本提取逻辑。