Question

我一直在尝试使用wikipedia python包提取信息框内容。

我的代码如下（适用于this page）：

import wikipedia
Aldi = wikipedia.page('Aldi')

当我进入时：

Aldi.content

我收到文章文字但不是信息框。

我试过从DBPedia获取数据但没有运气。我也尝试使用BeautifulSoup4提取页面，但表格结构奇怪（因为有一个图像跨越两列，后面是未命名的列。

就我和BeautifulSoup而言：

from bs4 import BeautifulSoup
import urllib2
site= "http://en.wikipedia.org/wiki/Aldi"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup

我也查看了维基数据，但它不包含我需要的大部分信息。

我不一定非常关注python包作为解决方案。任何可以解析表格的东西都会很棒。

最好，我想要一个包含信息框值的词典：

Type     Private
Industry Retail

等等......

Answer 1

基于BeautifulSoup的解决方案：

from bs4 import BeautifulSoup
import urllib2
site= "http://en.wikipedia.org/wiki/Aldi"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page.read())
table = soup.find('table', class_='infobox vcard')
result = {}
exceptional_row_count = 0
for tr in table.find_all('tr'):
    if tr.find('th'):
        result[tr.find('th').text] = tr.find('td').text
    else:
        # the first row Logos fall here
        exceptional_row_count += 1
if exceptional_row_count > 1:
    print 'WARNING ExceptionalRow>1: ', table
print result

在http://en.wikipedia.org/wiki/Aldi上测试过，但未在其他维基页面上进行过全面测试。

Answer 2

我的解决方案

from bs4 import BeautifulSoup as bs
query = 'albert einstien'
url = 'https://en.wikipedia.org/wiki/'+query
def infobox() :
raw = urllib.urlopen(url)
soup = bs(raw)
table = soup.find('table',{'class':'infobox vcard'})
for tr in table.find_all('tr') :
    print tr.text

如何使用python维基百科库从维基百科中提取信息框vcard

2 个答案: