Question

我正在使用BeautifulSoup（4.4版）对https://dumps.wikimedia.org/enwiki/中的Wikipedia文本转储进行预处理，以进行进一步的解析。

文本转储文档包含多篇文章，每篇文章都包含在<page>标记中。

不幸的是，有关文档结构的某些内容似乎与BeautifulSoup不兼容：在每个<page>中，文章的正文包含在<text>块中：

<text xml:space="preserve">...</text>

选择了某个<page>块之后，我应该能够以page.text.string的身份访问文本块的内容。

在BeautifulSoup中，.text以前是为方括号之间的标签内容保留的。在较新的版本中，.string用于此目的。

不幸的是，为了向后兼容，似乎page.text的解释仍然与page.string相同。（编辑：getattr(page, "text")也是一样。）

有什么办法可以解决这个问题并访问名为<text>的HTML标签吗？

（编辑：有关语法示例，请参见https://pastebin.com/WQvJn0gf。）

Answer 1

使用.find和.text可以按预期工作：

from bs4 import BeautifulSoup

string = '''<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
  <siteinfo>...</siteinfo>
  <page>
    <title>AccessibleComputing</title>
    <ns>0</ns>
    <id>10</id>
    <redirect title="Computer accessibility" />
    <revision>
      <id>854851586</id>
      <parentid>834079434</parentid>
      <timestamp>2018-08-14T06:47:24Z</timestamp>
      <contributor>
        <username>Godsy</username>
        <id>23257138</id>
      </contributor>
      <comment>remove from category for seeking instructions on rcats</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">#REDIRECT [[Computer accessibility]]

{{R from move}}
{{R from CamelCase}}
{{R unprintworthy}}</text>
      <sha1>42l0cvblwtb4nnupxm6wo000d27t6kf</sha1>
    </revision>
  </page>
...
</mediawiki>'''

soup = BeautifulSoup(string, 'html.parser')   
page_tag = soup.find('page')
text_tag = page_tag.find('text')
print(text_tag.text)
# #REDIRECT [[Computer accessibility]]

# {{R from move}}
# {{R from CamelCase}}
# {{R unprintworthy}}

BeautifulSoup无法访问<text>标记的内容

1 个答案: