我正在尝试从wikipedia页面中提取一些数据,而我只想提取 非空链接。空链接具有名为“new”的类,我想按此条件筛选结果。为此,我使用以下代码:
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Visa_requirements_for_Belarusian_citizens"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
visa_req_table = soup.findAll("table", "nowraplinks hlist collapsible autocollapse navbox-inner")[1]
tables_regions = visa_req_table.findAll("table", "nowraplinks navbox-subgroup")
for single_table in tables_regions:
for a in single_table.findAll('a', href=True):
if a.find(attrs={'class': 'new'}):
a.extract()
print a.text, a['href']
但我无法使用上面的代码从最终结果中删除空链接。你能告诉我,我做错了什么?
UPD: 我将代码更正为以下形式后:
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Visa_requirements_for_Belarusian_citizens"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
visa_req_table = soup.findAll("table", "nowraplinks hlist collapsible autocollapse navbox-inner")[1]
tables_regions = visa_req_table.findAll("table", "nowraplinks navbox-subgroup")
for single_table in tables_regions:
non_new_links = lambda tag: (getattr(tag, 'name') == 'a' and
'href' in a.attrs and
'new' not in a.attrs.get('class', []))
for a in single_table.find_all(non_new_links):
print a.text, a['href']
我看到以下错误消息:
Traceback (most recent call last):
File ".../2.py", line 16, in <module>
for a in single_table.find_all(non_new_links):
File "C:\Python27\lib\site-packages\bs4\element.py", line 1180, in find_all
return self._find_all(name, attrs, text, limit, generator, **kwargs)
File "C:\Python27\lib\site-packages\bs4\element.py", line 505, in _find_all
found = strainer.search(i)
File "C:\Python27\lib\site-packages\bs4\element.py", line 1540, in search
found = self.search_tag(markup)
File "C:\Python27\lib\site-packages\bs4\element.py", line 1496, in search_tag
or (markup and self._matches(markup, self.name))
File "C:\Python27\lib\site-packages\bs4\element.py", line 1578, in _matches
return match_against(markup)
File ".../2.py", line 14, in <lambda>
'href' in a.attrs and
NameError: global name 'a' is not defined
我应该在代码中更正哪些内容才能使其正常工作?
答案 0 :(得分:1)
要求BeautifulSoup要求不符合条件的元素的唯一方法是给它一个测试元素的函数:
non_new_links = lambda tag: (getattr(tag, 'name') == 'a' and
'href' in tag.attrs and
'new' not in tag.attrs.get('class', [])
for a in single_table.find_all(non_new_links):
non_new_links
函数仅匹配符合所有3个条件的标记。
我将您的表搜索简化为:
for cell in soup.find_all('td', class_='nav-inner'):