Question

我想解析HTML，获取文本，并返回每个单词（或可能是每个文本片段）的标签列表。例如，给定此HTML：

<em>Blah blah blah</em> blah again <i>and then again</i>

会返回类似的内容：

(("Blah", "em"),
 ("blah", "em"),
 ("blah", "em"),
 ("blah", ""),
 ("again", ""),
 ("and", "i"),
 ("then", "i"),
 ("again", "i"))

或：

(("Blah blah blah", "em"),
  ("blah again", ""),
  ("and then again", "i"))

是否有工具或简单方法可以做到这一点？

由于

Answer 1

您可以使用此https://scrapy.org/

例如

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

你可以做这样的事情

>>> title = quote.css("span.text::text").extract_first()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").extract_first()
>>> author
'Albert Einstein'

Answer 2

你可以创建一个带有一堆标签的循环，当你到达一个标签时将它推入堆栈当你获得一个常规单词时，取出堆栈中的最后一项和单词并将它们作为元组添加到列表中。如果列表为空，则当到达结束标记时，使用空字符串而不是标记到元组，弹出堆栈中的最后一项。（通过堆栈我的意思是在python中使用push和pop函数来添加和删除项目中的列表）

Python解析带有标记列表的HTML返回单词

2 个答案: