Question

我正在使用Beautiful Soup从网页中提取“内容”。我知道有些人之前已经问过这个question并且他们都指向了美丽的汤，这就是我开始使用它的方式。

我能够成功获取大部分内容，但我遇到了一些内容标记的挑战。（我开始的基本策略是：如果节点中有多个x-chars，那么它就是内容）。我们以下面的html代码为例：

<div id="abc">
    some long text goes <a href="/"> here </a> and hopefully it 
    will get picked up by the parser as content
</div>

results = soup.findAll(text=lambda(x): len(x) > 20)

当我使用上面的代码来获取长文本时，它会在标记处中断（标识的文本将从'并且希望......'开始）。所以我尝试用纯文本替换标签，如下所示：

anchors = soup.findAll('a')

for a in anchors:
  a.replaceWith('plain text')

以上不起作用，因为Beautiful Soup将字符串作为NavigableString插入，当我使用带有len（x）＆gt;的findAll时会导致同样的问题。 20.我可以使用正则表达式首先将html解析为纯文本，清除所有不需要的标签，然后调用Beautiful Soup。但我想避免两次处理相同的内容 - 我正在尝试解析这些页面，以便我可以显示给定链接的内容片段（非常像Facebook Share） - 如果一切都是用Beautiful Soup完成的，我认为它会更快。

所以我的问题是：有没有办法'清除标签'并用'纯文本'替换它们使用Beautiful Soup。如果没有，最好的方法是什么？

感谢您的建议！

更新： Alex的代码在示例示例中运行良好。我也试过各种边缘情况，它们都运行良好（下面的修改）。所以我在现实生活中的网站上试了一下，然后我遇到了困扰我的问题。

import urllib
from BeautifulSoup import BeautifulSoup

page = urllib.urlopen('http://www.engadget.com/2010/01/12/kingston-ssdnow-v-dips-to-30gb-size-lower-price/')

anchors = soup.findAll('a')
i = 0
for a in anchors:
    print str(i) + ":" + str(a)
    for a in anchors:
        if (a.string is None): a.string = ''
        if (a.previousSibling is None and a.nextSibling is None):
            a.previousSibling = a.string
        elif (a.previousSibling is None and a.nextSibling is not None):
            a.nextSibling.replaceWith(a.string + a.nextSibling)
        elif (a.previousSibling is not None and a.nextSibling is None):
            a.previousSibling.replaceWith(a.previousSibling + a.string)
        else:
            a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
            a.nextSibling.extract()
    i = i+1

当我运行上面的代码时，我收到以下错误：

0:<a href="http://www.switched.com/category/ces-2010">Stay up to date with 
Switched's CES 2010 coverage</a>
Traceback (most recent call last):
  File "parselink.py", line 44, in <module>
  a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
 TypeError: unsupported operand type(s) for +: 'Tag' and 'NavigableString'

当我查看HTML代码时，“保持最新状态...”之前没有任何兄弟姐妹（我没有看到以前的兄弟姐妹如何工作，直到我看到Alex的代码并根据我的测试看起来它正在寻找标签之前的'text'。所以，如果没有先前的兄弟，我很惊讶它没有通过a.previousSibling的if逻辑是None和a; nextSibling是None。

你可以告诉我我做错了什么吗？

-ecognium

Answer 1

适用于您的具体示例的方法是：

from BeautifulSoup import BeautifulSoup

ht = '''
<div id="abc">
    some long text goes <a href="/"> here </a> and hopefully it 
    will get picked up by the parser as content
</div>
'''
soup = BeautifulSoup(ht)

anchors = soup.findAll('a')
for a in anchors:
  a.previousSibling.replaceWith(a.previousSibling + a.string)

results = soup.findAll(text=lambda(x): len(x) > 20)

print results

发出

$ python bs.py
[u'\n    some long text goes  here ', u' and hopefully it \n    will get picked up by the parser as content\n']

当然，您可能需要更加小心，即，如果没有a.string，或a.previousSibling是None，那么您需要更合适if声明照顾此类角落案件。但我希望这个总体思路可以帮到你。（事实上，如果它是一个字符串，您可能希望还合并 next 兄弟 - 不确定它与您的启发式len(x) > 20的关系如何，但比如说你有两个9个字符的字符串，其中<a>在中间包含一个5个字符的字符串，也许你想把这个字体作为“23个字符的字符串”拿起来？我说不清楚因为我不明白你的启发式的动机。

我认为除了<a>代码之外，您还需要删除其他代码，例如或，可能是和/或 等等......？我想这也取决于你的启发式背后的实际想法是什么！

Answer 2

当我尝试在文档中展平标签时，标签的整个内容将被拉到其父节点（我想减少的内容p 标记所有子段落，列表， div 和 span 等内部，但摆脱样式和 font 标签和一些可怕的word-to-html生成器残余物），我发现使用BeautifulSoup本身相当复杂，因为 extract（）也删除了内容并且 replaceWith （）不幸的是不接受 None 作为参数。经过一些疯狂的递归实验后，我最终决定在使用BeautifulSoup处理文档之前或之后使用正则表达式，方法如下：

import re
def flatten_tags(s, tags):
   pattern = re.compile(r"<(( )*|/?)(%s)(([^<>]*=\\\".*\\\")*|[^<>]*)/?>"%(isinstance(tags, basestring) and tags or "|".join(tags)))
   return pattern.sub("", s)

标记参数是单个标记或要展平的标记列表。

使用Beautiful Soup Python模块用纯文本替换标签

2 个答案: