我如何计算2个预定义词之间的单词数?

时间:2017-10-13 09:27:16

标签: python html python-2.7 beautifulsoup

<replace-add>我不知道你知道原因</replace-add>我可以帮助你<replace-del></replace-del> <replace-add>我们</replace-add>非常感谢我刚刚从<replace-del>我的女儿</replace-del> tenah dyer <replace-add> </replace-add>明确表达<replace-del> </replace-del><replace-add> </replace-add> {1}}

如何计算文本中<replace-add></replace-add>之间的确切字数。

2 个答案:

答案 0 :(得分:0)

不使用任何库:

def get_tag_indexes(text, tag, start_tag):
    tag_indexes = []
    start_index = -1

    while True:
        start_index = text.find(tag, start_index + 1)

        if start_index != -1:
            if start_tag:
                tag_indexes.append(start_index + len(tag))
            else:
                tag_indexes.append(start_index)
        else:
            return tag_indexes

text = """<replace-add>that i dont know you know cause</replace-add> i could help you with <replace-del>that oh</replace-del> <replace-add>us</replace-add> thanks so i just set up a ride <replace-del>for</replace-del> <replace-add>from</replace-add> my daughter <replace-del>tenah dyer</replace-del> <replace-add>clear dire</replace-add>"""

tag_starts = get_tag_indexes(text, "<replace-add>", True)
tag_ends = get_tag_indexes(text, "</replace-add>", False)

for start, end in zip(tag_starts, tag_ends):
    words = text[start:end].split()
    print "{} words - {}".format(len(words), words)

给你:

7 words - ['that', 'i', 'dont', 'know', 'you', 'know', 'cause']
1 words - ['us']
1 words - ['from']
2 words - ['clear', 'dire']

这使用函数返回任何给定文本的位置列表。然后可以使用它在两个标签之间提取文本。

作为替代方法,实际上也可以使用beautifulsoup来完成:

from bs4 import BeautifulSoup

text = """<replace-add>that i dont know you know cause</replace-add> i could help you with <replace-del>that oh</replace-del> <replace-add>us</replace-add> thanks so i just set up a ride <replace-del>for</replace-del> <replace-add>from</replace-add> my daughter <replace-del>tenah dyer</replace-del> <replace-add>clear dire</replace-add>"""
soup = BeautifulSoup(text, "lxml")

for block in soup.find_all('replace-add'):
    words = block.text.split()
    print "{} words - {}".format(len(words), words)

答案 1 :(得分:0)

根据信任来源的可信度,您可以做两件事。鉴于此

source = """<replace-add>that i dont know you know cause</replace-add> i could help you with <replace-del>that oh</replace-del> <replace-add>us</replace-add> thanks so i just set up a ride <replace-del>for</replace-del> <replace-add>from</replace-add> my daughter <replace-del>tenah dyer</replace-del> <replace-add>clear dire</replace-add>"""

你可以像这样使用正则表达式:

import re

from itertools import chain

word_pattern = re.compile(r"(?<=<replace-add>).*?(?=</replace-add>)")
re_words = list(chain.from_iterable(map(str.split, word_pattern.findall(source))))

这仅在源与这些标签完全匹配时才有效,没有属性等。

标准库中的另一个选项是HTML解析:

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def get_words(self, html):
        self.read_words = False
        self.words = []
        self.feed(html)
        return self.words

    def handle_starttag(self, tag, attrs):
        if tag == "replace-add":
            self.read_words = True

    def handle_data(self, data):
        if self.read_words:
            self.words.extend(data.split())

    def handle_endtag(self, tag):
        if tag == "replace-add":
            self.read_words = False


parser = MyParser()
html_words = parser.get_words(source)

这种方法更可靠,并且可能更高效,因为它使用完全专注于此任务的工具。

现在,做

print(re_words)
print(html_words)

我们得到了

['that', 'i', 'dont', 'know', 'you', 'know', 'cause', 'us', 'from', 'clear', 'dire']
['that', 'i', 'dont', 'know', 'you', 'know', 'cause', 'us', 'from', 'clear', 'dire']

(当然,此列表的len是单词数。)

如果您严格要求单词数量,则可以保留一个运行总计,并为遇到的每个数据添加data.split的长度。

如果你真的无法进行任何导入,你将不得不做出一些牺牲,或者必须实现自己的正则表达式引擎/ html解析器。如果这是家庭作业的要求,实际上你应该先展示一些问题。