Python解析带有标记列表的HTML返回单词

时间:2017-10-07 15:07:34

标签: python html

我想解析HTML,获取文本,并返回每个单词(或可能是每个文本片段)的标签列表。 例如,给定此HTML:

<em>Blah blah blah</em> blah again <i>and then again</i>

会返回类似的内容:

(("Blah", "em"),
 ("blah", "em"),
 ("blah", "em"),
 ("blah", ""),
 ("again", ""),
 ("and", "i"),
 ("then", "i"),
 ("again", "i"))

或:

(("Blah blah blah", "em"),
  ("blah again", ""),
  ("and then again", "i"))

是否有工具或简单方法可以做到这一点?

由于

2 个答案:

答案 0 :(得分:0)

您可以使用此https://scrapy.org/

例如

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

你可以做这样的事情

>>> title = quote.css("span.text::text").extract_first()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").extract_first()
>>> author
'Albert Einstein'

答案 1 :(得分:0)

你可以创建一个带有一堆标签的循环,当你到达一个标签时将它推入堆栈当你获得一个常规单词时,取出堆栈中的最后一项和单词并将它们作为元组添加到列表中。如果列表为空,则当到达结束标记时,使用空字符串而不是标记到元组,弹出堆栈中的最后一项。 (通过堆栈我的意思是在python中使用push和pop函数来添加和删除项目中的列表)