Question

在3.0.5之前，BeautifulSoup用于处理＆lt; textarea＆gt;的内容。作为HTML。它现在将其视为文本。我正在解析的文档在textarea标签中有HTML，我正在尝试处理它。

我试过了：

    for textarea in soup.findAll('textarea'):
        contents = BeautifulSoup.BeautifulSoup(textarea.contents)
        textarea.replaceWith(contents.html(text=True))

但我收到了错误。我在文档中找不到这个，而替代解析器没有帮助。任何人都知道如何将textareas解析为HTML？

编辑：

示例HTML是：

<textarea class="ks-lazyload-custom">
  <div class="product-view product-view-rug">
    Foobar Womble
    <div class="product-view-head">
      <img src="tps/i1/fo-25.gif" />
    </div>
  </div>
</textarea>

错误是：

File "D:\src\cross\tserver\src\tools\sitecrawl\BeautifulSoup.py", line 1913, 
in _detectEncoding '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
TypeError: expected string or buffer

我正在寻找一种方法来获取元素，提取内容，使用BeautifulSoup解析它们，将其折叠到文本，然后用该文本替换原始元素的内容（或替换整个元素）。 / p>

至于现实世界与规格，它实际上并不是特别相关。需要解析数据，我正在寻找这样做的方法。

Answer 1

这似乎运作得相当好（如果我正确地理解了你想要的东西）：

for textarea in soup.findAll('textarea'):
    contents = BeautifulSoup.BeautifulSoup(textarea.contents[0]).renderContents()
    textarea.replaceWith(contents)

Answer 2

我现在正在使用以下代码，这些代码主要起作用。您的里程可能会有所不同。

def _extractText(self, data, encoding):
    if self.isDebug: self._output("_extractText")
    soup = BeautifulSoup.BeautifulSoup(data, fromEncoding=encoding)
    comments = soup.findAll(text=lambda text:isinstance(text, BeautifulSoup.Comment))
    [comment.extract() for comment in comments]
    [script.extract() for script in soup.findAll('script')]
    [css.extract() for css in soup.findAll('style')]
    for textarea in soup.findAll('textarea'):
        textarea.string = self._extractText(textarea.renderContents(), 'UTF-8')
    text = unicode('')
    for line in soup.findAll(text=True):
        line = line.replace('&nbsp;', ' ').strip()  
        if line == '': continue
        if line.startswith('doctype'): continue
        if line.startswith('DOCTYPE'): continue
        text = text + line + '\n'
    return text

如何使BeautifulSoup将textarea标签的内容解析为HTML？

2 个答案: