我有以下代码来获取Apple维基百科页面右侧的小表(包含公司基本信息的那个):
import requests
from bs4 import BeautifulSoup
WIKI_URL = "https://en.wikipedia.org/wiki/Apple_Inc."
req = requests.get(WIKI_URL)
soup = BeautifulSoup(req.content, 'lxml')
table_classes = {"class": ["infobox vcard"]}
wikitables = soup.find("table", table_classes)
我想添加一些额外的代码行,以使用产品提取表格行中的逐项列表。换句话说,我想创建一个看起来像这样的列表
['Macintosh',
'iPod',
'iPhone',
'iPad',
'Apple Watch',
'Apple TV',
'HomePod',
'macOS',
'iOS',
'watchOS',
'tvOS',
'iLife']
我想将此作为代码的一部分,以便我能够使用代码从其他公司的Wiki页面中提取类似信息。我怎么能做到这一点?
答案 0 :(得分:0)
您可以从维基百科信息框中删除所有数据,通过删除所有无关的空格和换行符清除每个字符串,然后分组以查找每个标题行和相应的值列表:
from bs4 import BeautifulSoup as soup
import urllib
import re
import itertools
s = soup(str(urllib.urlopen('https://en.wikipedia.org/wiki/Apple_Inc.').read()), 'lxml')
rows = map(lambda x:re.sub('\s+', ' ', x.text), s.find_all('th', {'scope':'row'}))
info_box = [re.split('\n+', i.text) for i in s.find_all('table', {'class':'infobox vcard'})][0]
final_data = [list(b) for a, b in itertools.groupby(info_box, key=lambda x:x in rows)]
products = [a for i, a in enumerate(final_data) if final_data[i-1][0] == u'Products'][0]
输出:
[u'Macintosh', u'iPod', u'iPhone', u'iPad', u'Apple Watch', u'Apple TV', u'HomePod', u'macOS', u'iOS', u'watchOS', u'tvOS', u'iLife', u'iWork']
编辑:对于Python3.x,此解决方案使用更强大的requests
模块而不是urllib
:
from bs4 import BeautifulSoup as soup
import requests
import re
import itertools
s = soup(str(requests.get('https://en.wikipedia.org/wiki/Apple_Inc.').text), 'html.parser')
rows = list(map(lambda x:re.sub('\s+', ' ', x.text), s.find_all('th', {'scope':'row'})))
info_box = [re.split('\n+', i.text) for i in s.find_all('table', {'class':'infobox vcard'})][0]
final_data = [list(b) for a, b in itertools.groupby(info_box, key=lambda x:x in rows)]
products = [a for i, a in enumerate(final_data) if final_data[i-1][0] == u'Products'][0]
输出:
['Macintosh', 'iPod', 'iPhone', 'iPad', 'Apple Watch', 'Apple TV', 'HomePod', 'macOS', 'iOS', 'watchOS', 'tvOS', 'iLife', 'iWork']