import requests
import string
from bs4 import BeautifulSoup, Tag
[...]
def disease_spider(maxpages):
i = 0
while i <= maxpages:
url = 'http://www.cdc.gov/DiseasesConditions/az/'+ alpha[i]+'.html'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for l in soup.findAll('a', {'class':'noLinking'}):
x =l.find("em")
if x is not None:
return x.em.replaceWith(Tag('a'))
i += 1
Some of the text from the website uses tags instead of tags and I wanted to replace them with tags. Using this code I get this error:
AttributeError: 'NoneType' object has no attribute 'replaceWith'
答案 0 :(得分:0)
据我了解,您希望将em
替换为其文字。
换句话说,a
元素包含:
<a class="noLinking" href="http://www.cdc.gov/hi-disease/index.html">
including Hib Infection (<em>Haemophilus influenzae</em> Infection)
</a>
应替换为:
<a class="noLinking" href="http://www.cdc.gov/hi-disease/index.html">
including Hib Infection (Haemophilus influenzae Infection)
</a>
在这种情况下,我会直接在em
标记下找到所有a
标记,对于找到的每个em
标记,请使用{{replace_with()
标记替换它1}}:
for em in soup.select('a.noLinking > em'):
em.replace_with(em.text)
作为旁注,可能没有必要进行替换,因为.text
标记的a
会为您提供包含其子代的节点的全文:
In [1]: from bs4 import BeautifulSoup
In [2]: data = """
...: <a class="noLinking" href="http://www.cdc.gov/hi-disease/index.html">
...: including Hib Infection (<em>Haemophilus influenzae</em> Infection)
...: </a>
...: """
In [3]: soup = BeautifulSoup(data)
In [4]: print soup.a.text
including Hib Infection (Haemophilus influenzae Infection)