从自由流动的文本中删除html标签以形成单独的句子

时间:2017-07-19 10:17:22

标签: python beautifulsoup html-parsing

我想从一大段文字中提取句子。我的文字就像tihs -

<ul><li>Registered Nurse in <font>Missouri</font>, License number <font>xxxxxxxx</font>, <font>2017</font></li><li>AHA Advanced Cardiac Life Support (ACLS) Certification <font>2016-2018</font></li><li>AHA PALS - Pediatric Advanced Life Support 2017-2019</li><li>AHA Basic Life Support 2016-2018</li></ul>

我想从上面的文字中提取正确的句子。所以预期的输出将是一个列表

['Registered Nurse in Missouri, License number xxxxxxxx, 2017',
'AHA Advanced Cardiac Life Support (ACLS) Certification 2016-2018',
'AHA PALS - Pediatric Advanced Life Support 2017-2019',
'AHA Basic Life Support 2016-2018']

我使用python内置HTMLParser模块从上面的文本中删除htmls。这是我的代码。

class HTMLStripper(HTMLParser):

    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []

    def handle_data(self, chunk):
        #import pdb; pdb.set_trace()
        self.fed.append(chunk.strip())

    def get_data(self):
        return [x for x in self.fed if x]


def strip_html_tags(html):
    try:
        s = HTMLStripper()
        s.feed(html)
        return s.get_data()
    except Exception as e:
        # Remove html strings from the given string
        p = re.compile(r'<.*?>')
        return p.sub('', html)

它在上面的文本上调用strip_html_tags函数给出了以下结果(实际上是它应该由当前实现产生的输出)

['Registered Nurse in', 'Missouri', ', License number', 'xxxxxxx', ',', '2017', 'AHA Advanced Cardiac Life Support (ACLS) Certification', '2016-2018', 'AHA PALS - Pediatric Advanced Life Support 2017-2019', 'AHA Basic Life Support 2016-2018']

我无法对<ul> or <li> tags进行严格检查,因为不同的文本可能有不同的html标签。我有办法在外html-tags上拆分上面的文本,而不是在遇到的每个html-tag上进行拆分

提前致谢。

2 个答案:

答案 0 :(得分:1)

为什么不使用已经有效解析html的工具?像BeautifulSoup

from bs4 import BeautifulSoup

demo = '<ul><li>Registered Nurse in <font>Missouri</font>, License number <font>xxxxxxxx</font>, <font>2017</font></li><li>AHA Advanced Cardiac Life Support (ACLS) Certification <font>2016-2018</font></li><li>AHA PALS - Pediatric Advanced Life Support 2017-2019</li><li>AHA Basic Life Support 2016-2018</li></ul>'
soup = BeautifulSoup(demo, 'lxml')
sentences = [item.text for item in soup.findAll('li')]

变量sentences现在完全符合您的要求,请自行测试

根据您的评论,我会使用此代码:

text_without_tags = soup.text

所以现在你没有其他标签需要担心,只需要一个简单的字符串,然后你可以用逗号上的split(',')转到一个列表(但是如果文本不总是用逗号或点,我不打扰,只使用字符串本身)

注意:如果没有文本的某些已知结构,则无法始终以相同的方式解析它并获得已知结果。这个已知的结构可能是某些html标签,也可能是您事先知道的某些文本功能

答案 1 :(得分:1)

经过深思熟虑,我在这里发布我的解决方案。它对我的各种例子都很好。如果我知道必须提前从中提取文本的标签(以便我可以应用BeautifulSoup),那么使用soup.findAll(specific_tag)的方法将有效,但事实并非如此。它们也可以是我需要提取文本的多个标签。例如 -

<p>Science</p><div> Biology </div><div>Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. <span style=\"text-decoration: underline;\">Nature Methods</span> 2017,</div>

我上面的示例我想从<p>标记和<div>标记中提取文本。

我修改了上面的代码来处理这种情况 -

import re
import copy
from html.parser import HTMLParser
from sample_htmls import *

class HTMLStripper(HTMLParser):

    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.feeds = []
        self.sentence = ''
        self.current_path = []
        self.tree = []
        self.lookup_tags = ['div', 'span', 'p', 'ul', 'li']

    def update_feed(self):
        self.tree.append(copy.deepcopy(self.current_path))
        self.current_path[:] = []
        self.feeds.append(re.sub(' +', ' ', self.sentence).strip())
        self.sentence = ''

    def handle_starttag(self, tag, attrs):
        if tag in self.lookup_tags:
            if tag == 'li' and len(self.current_path) > 0:
                self.update_feed()
            self.current_path.append(tag)

    def handle_endtag(self, tag):
        if tag in self.lookup_tags:
            self.current_path.append(tag)
            if tag == self.current_path[0]:
                self.update_feed()

    def handle_data(self, data):
        self.sentence += ' ' + data

    def get_tree(self):
        return self.tree

    def get_data(self):
        return [x for x in self.feeds if x]

在上面的例子中运行代码

parser = HTMLStripper()
parser.feed(mystr)
l1 = parser.get_tree()
feed = parser.get_data()
print(l1)
print("\n", mystr)
print("\n", feed)
print("\n\n")

和输出 -

[['ul'], ['li', 'li'], ['li', 'li'], ['li', 'li'], ['li', 'li'], ['ul']]

<ul><li>Registered Nurse in <font>Missouri</font>, License number <font>xxxxxxxx</font>, <font>2017</font></li><li>AHA Advanced Cardiac Life Support (ACLS) Certification <font>2016-2018</font></li><li>AHA PALS - Pediatric Advanced Life Support 2017-2019</li><li>AHA Basic Life Support 2016-2018</li></ul>

['Registered Nurse in Missouri , License number xxxxxxxx , 2017', 'AHA Advanced Cardiac Life Support (ACLS) Certification 2016-2018', 'AHA PALS - Pediatric Advanced Life Support 2017-2019', 'AHA Basic Life Support 2016-2018']

同样适用于混合标签html字符串 -

[['p', 'p'], ['div', 'div'], ['div', 'span', 'span', 'div']]

<p>Science</p><div> Biology </div><div>Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. <span style="text-decoration: underline;">Nature Methods</span> 2017,</div>

['Science', 'Biology', 'Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. Nature Methods 2017,']

很想看到一个角落案例,以便我可以改进文本提取逻辑。