我想解析HTML,获取文本,并返回每个单词(或可能是每个文本片段)的标签列表。 例如,给定此HTML:
<em>Blah blah blah</em> blah again <i>and then again</i>
会返回类似的内容:
(("Blah", "em"),
("blah", "em"),
("blah", "em"),
("blah", ""),
("again", ""),
("and", "i"),
("then", "i"),
("again", "i"))
或:
(("Blah blah blah", "em"),
("blah again", ""),
("and then again", "i"))
是否有工具或简单方法可以做到这一点?
由于
答案 0 :(得分:0)
您可以使用此https://scrapy.org/
例如
<div class="quote">
<span class="text">“The world as we have created it is a process of our
thinking. It cannot be changed without changing our thinking.”</span>
<span>
by <small class="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
你可以做这样的事情
>>> title = quote.css("span.text::text").extract_first()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").extract_first()
>>> author
'Albert Einstein'
答案 1 :(得分:0)
你可以创建一个带有一堆标签的循环,当你到达一个标签时将它推入堆栈当你获得一个常规单词时,取出堆栈中的最后一项和单词并将它们作为元组添加到列表中。如果列表为空,则当到达结束标记时,使用空字符串而不是标记到元组,弹出堆栈中的最后一项。 (通过堆栈我的意思是在python中使用push和pop函数来添加和删除项目中的列表)