尝试将标记<em>替换为<a>

时间:2015-06-28 18:20:39

标签: python tags beautifulsoup replacewith

import requests
import string
from bs4 import BeautifulSoup, Tag
[...]
def disease_spider(maxpages):
    i = 0
while i <= maxpages:
    url = 'http://www.cdc.gov/DiseasesConditions/az/'+ alpha[i]+'.html'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for l in soup.findAll('a', {'class':'noLinking'}):
        x =l.find("em")
        if x is not None:
            return x.em.replaceWith(Tag('a'))

    i += 1

Some of the text from the website uses tags instead of tags and I wanted to replace them with tags. Using this code I get this error:

AttributeError: 'NoneType' object has no attribute 'replaceWith'

1 个答案:

答案 0 :(得分:0)

据我了解,您希望将em替换为其文字。

换句话说,a元素包含:

<a class="noLinking" href="http://www.cdc.gov/hi-disease/index.html">
    including Hib Infection (<em>Haemophilus influenzae</em> Infection)   
</a>

应替换为:

<a class="noLinking" href="http://www.cdc.gov/hi-disease/index.html">
    including Hib Infection (Haemophilus influenzae Infection) 
</a>

在这种情况下,我会直接在em标记下找到所有a标记,对于找到的每个em标记,请使用{{replace_with()标记替换它1}}:

for em in soup.select('a.noLinking > em'):
    em.replace_with(em.text)

作为旁注,可能没有必要进行替换,因为.text标记的a会为您提供包含其子代的节点的全文:

In [1]: from bs4 import BeautifulSoup

In [2]: data = """
   ...:     <a class="noLinking" href="http://www.cdc.gov/hi-disease/index.html">
   ...:         including Hib Infection (<em>Haemophilus influenzae</em> Infection)   
   ...:     </a>
   ...: """

In [3]: soup = BeautifulSoup(data)

In [4]: print soup.a.text

        including Hib Infection (Haemophilus influenzae Infection)