Question

我想抓住年龄，出生地和以前的参议员职业。维基百科在各自的页面上提供了每个参议员的信息，还有另一个页面，其中有一个表格，列出了所有参议员的姓名。我如何浏览该列表，点击每个参议员各自页面的链接，并获取我想要的信息？

这是我到目前为止所做的。

1。（没有python）发现DBpedia存在并编写了一个查询来搜索参议员。不幸的是，DBpedia没有对它们进行大多数（如果有的话）分类：

 SELECT ?senator, ?country WHERE {
   ?senator rdf:type <http://dbpedia.org/ontology/Senator> .
   ?senator <http://dbpedia.org/ontology/nationality> ?country
 }

查询results不能令人满意。

2。发现有一个名为wikipedia的python模块允许我从各个维基页面搜索和检索信息。用它通过查看超链接从表中获取参议员名称列表。

import wikipedia as w
 w.set_lang('pt')

 # Grab page with table of senator names.
 s = w.page(w.search('Lista de Senadores do Brasil da 55 legislatura')[0])

 # Get links to senator names by removing links of no interest
 # For each link in the page, check if it's a link to a senator page.
 senators = [name for name in s.links if not
             # Senator names don't contain digits nor ,
             (any(char.isdigit() or char == ',' for char in name) or
             # And full names always contain spaces.
              ' ' not in name)]

此时我有点失落。此处列表senators包含所有参议员姓名，但也包含其他名称，例如，聚会名称。 wikipidia模块（至少从我在API文档中找到的模块）也没有实现跟踪链接或搜索表格的功能。

我在StackOverflow上看到了两个相关的条目似乎很有用，但它们（here和here）都从单个页面中提取信息。

有人能指出我的解决方案吗？

谢谢！

Answer 1

好的，所以我想出来了（感谢评论指向我的BeautifulSoup）。

实现我想要的东西实际上没有什么大秘密。我只需要通过BeautifulSoup浏览列表并存储所有链接，然后用urllib2打开每个存储的链接，在响应上调用BeautifulSoup，然后完成。这是解决方案：

import urllib2 as url
import wikipedia as w
from bs4 import BeautifulSoup as bs
import re

# A dictionary to store the data we'll retrieve.
d = {}

# 1. Grab  the list from wikipedia.
w.set_lang('pt')
s = w.page(w.search('Lista de Senadores do Brasil da 55 legislatura')[0])
html = url.urlopen(s.url).read()
soup = bs(html, 'html.parser')


# 2. Names and links are on the second column of the second table.
table2 = soup.findAll('table')[1]
for row in table2.findAll('tr'):
    for colnum, col in enumerate(row.find_all('td')):
        if (colnum+1) % 5 == 2:
            a = col.find('a')
            link = 'https://pt.wikipedia.org' + a.get('href')
            d[a.get('title')] = {}
            d[a.get('title')]['link'] = link


# 3. Now that we have the links, we can iterate through them,
# and grab the info from the table.
for senator, data in d.iteritems():
    page = bs(url.urlopen(data['link']).read(), 'html.parser')
    # (flatten list trick: [a for b in nested for a in b])
    rows = [item for table in
            [item.find_all('td') for item in page.find_all('table')[0:3]]
            for item in table]
    for rownumber, row in enumerate(rows):
        if row.get_text() == 'Nascimento':
            birthinfo = rows[rownumber+1].getText().split('\n')
            try:
                d[senator]['birthplace'] = birthinfo[1]
            except IndexError:
                d[senator]['birthplace'] = ''
            birth = re.search('(.*\d{4}).*\((\d{2}).*\)', birthinfo[0])
            d[senator]['birthdate'] = birth.group(1)
            d[senator]['age'] = birth.group(2)
        if row.get_text() == 'Partido':
            d[senator]['party'] = rows[rownumber + 1].getText()
        if 'Profiss' in row.get_text():
            d[senator]['profession'] = rows[rownumber + 1].getText()

非常简单。 BeautifulSoup创造奇迹=）

如何使用python从多个维基百科页面中抓取数据？

1 个答案: