Question

import requests
import string
from bs4 import BeautifulSoup, Tag
[...]
def disease_spider(maxpages):
    i = 0
while i <= maxpages:
    url = 'http://www.cdc.gov/DiseasesConditions/az/'+ alpha[i]+'.html'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for l in soup.findAll('a', {'class':'noLinking'}):
        x =l.find("em")
        if x is not None:
            return x.em.replaceWith(Tag('a'))

    i += 1

Some of the text from the website uses tags instead of tags and I wanted to replace them with tags. Using this code I get this error:

AttributeError: 'NoneType' object has no attribute 'replaceWith'

Answer 1

据我了解，您希望将em替换为其文字。

换句话说，a元素包含：

<a class="noLinking" href="http://www.cdc.gov/hi-disease/index.html">
    including Hib Infection (<em>Haemophilus influenzae</em> Infection)   
</a>

应替换为：

<a class="noLinking" href="http://www.cdc.gov/hi-disease/index.html">
    including Hib Infection (Haemophilus influenzae Infection) 
</a>

在这种情况下，我会直接在em标记下找到所有a标记，对于找到的每个em标记，请使用{{replace_with()标记替换它1}}：

for em in soup.select('a.noLinking > em'):
    em.replace_with(em.text)

作为旁注，可能没有必要进行替换，因为.text标记的a会为您提供包含其子代的节点的全文：

In [1]: from bs4 import BeautifulSoup

In [2]: data = """
   ...:     <a class="noLinking" href="http://www.cdc.gov/hi-disease/index.html">
   ...:         including Hib Infection (<em>Haemophilus influenzae</em> Infection)   
   ...:     </a>
   ...: """

In [3]: soup = BeautifulSoup(data)

In [4]: print soup.a.text

        including Hib Infection (Haemophilus influenzae Infection)

尝试将标记<em>替换为<a>

1 个答案: