Question

我正在尝试从wikipedia页面中提取一些数据，而我只想提取非空链接。空链接具有名为“new”的类，我想按此条件筛选结果。为此，我使用以下代码：

import urllib2

from bs4 import BeautifulSoup

url = "http://en.wikipedia.org/wiki/Visa_requirements_for_Belarusian_citizens"
page = urllib2.urlopen(url)

soup = BeautifulSoup(page)

visa_req_table = soup.findAll("table", "nowraplinks hlist collapsible autocollapse navbox-inner")[1]
tables_regions = visa_req_table.findAll("table", "nowraplinks navbox-subgroup")
for single_table in tables_regions:
    for a in single_table.findAll('a', href=True):
        if a.find(attrs={'class': 'new'}):
            a.extract()
        print a.text, a['href']

但我无法使用上面的代码从最终结果中删除空链接。你能告诉我，我做错了什么？

UPD： 我将代码更正为以下形式后：

import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Visa_requirements_for_Belarusian_citizens"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
visa_req_table = soup.findAll("table", "nowraplinks hlist collapsible autocollapse navbox-inner")[1]
tables_regions = visa_req_table.findAll("table", "nowraplinks navbox-subgroup")
for single_table in tables_regions:
    non_new_links = lambda tag: (getattr(tag, 'name') == 'a' and
                                 'href' in a.attrs and
                                 'new' not in a.attrs.get('class', []))
    for a in single_table.find_all(non_new_links):
        print a.text, a['href']

我看到以下错误消息：

Traceback (most recent call last):
  File ".../2.py", line 16, in <module>
    for a in single_table.find_all(non_new_links):
  File "C:\Python27\lib\site-packages\bs4\element.py", line 1180, in find_all
    return self._find_all(name, attrs, text, limit, generator, **kwargs)
  File "C:\Python27\lib\site-packages\bs4\element.py", line 505, in _find_all
    found = strainer.search(i)
  File "C:\Python27\lib\site-packages\bs4\element.py", line 1540, in search
    found = self.search_tag(markup)
  File "C:\Python27\lib\site-packages\bs4\element.py", line 1496, in search_tag
    or (markup and self._matches(markup, self.name))
  File "C:\Python27\lib\site-packages\bs4\element.py", line 1578, in _matches
    return match_against(markup)
  File ".../2.py", line 14, in <lambda>
    'href' in a.attrs and
NameError: global name 'a' is not defined

我应该在代码中更正哪些内容才能使其正常工作？

Answer 1

要求BeautifulSoup要求不符合条件的元素的唯一方法是给它一个测试元素的函数：

non_new_links = lambda tag: (getattr(tag, 'name') == 'a' and
                             'href' in tag.attrs and 
                             'new' not in tag.attrs.get('class', [])
for a in single_table.find_all(non_new_links):

non_new_links函数仅匹配符合所有3个条件的标记。

我将您的表搜索简化为：

for cell in soup.find_all('td', class_='nav-inner'):

如何忽略不满足“类”条件的对象？

1 个答案: