我正在开展一个项目,我正在尝试从这个维基百科页面中搜索数据,我希望列中包含多年(恰好是<th>
)和第四列“沃尔特迪斯尼公园”和度假村“。
代码:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://en.wikipedia.org/wiki/The_Walt_Disney_Company#Revenues")
bsObj = BeautifulSoup(html, "html.parser")
t = open("scrape_project.txt", "w")
year = bsObj.find("table", {"class":"wikitable"}).tr.next_sibling.next_sibling.th
money = bsObj.find("table", {"class":"wikitable"}).td.next_sibling.next_sibling.next_sibling.next_sibling
for year_data in year:
year.sup.clear()
print(year.get_text())
for revenue in money:
print(money.get_text())
t.close()
现在,当我通过终端运行时,所有打印都是1991(两次)和2,794。我需要它打印沃尔特迪斯尼公园和度假村的所有年份和相关收入。我也试图让它写入文件“scrape_project.tx”
任何帮助将不胜感激!
答案 0 :(得分:0)
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://en.wikipedia.org/wiki/The_Walt_Disney_Company#Revenues")
soup = BeautifulSoup(html)
t = open("scrape_project.txt", "w")
table = soup.find('table', {"class": "wikitable"})
# get all rows, skipping first empty
data = table.select("tr")[1:]
# year data is in the scope attribute
years = [td.select("th[scope]")[0].text[:4] for td in data]
# Walt Disney Parks and Resort is the third element in each row
rec = [td.select("td")[2].text for td in data]
from pprint import pprint as pp
pp(years)
pp(rec)
这将为您提供数据:
['1991',
'1992',
'1993',
'1994',
'1995',
'1996',
'1997',
'1998',
'1999',
'2000',
'2001',
'2002',
'2003',
'2004',
'2005',
'2006',
'2007',
'2008',
'2009',
'2010',
'2011',
'2012',
'2013',
'2014']
['2,794.0',
'3,306',
'3,440.7',
'3,463.6',
'3,959.8',
'4,142[Rev 3]',
'5,014',
'5,532',
'6,106',
'6,803',
'6,009',
'6,691',
'6,412',
'7,750',
'9,023',
'9,925',
'10,626',
'11,504',
'10,667',
'10,761',
'11,797',
'12,920',
'14,087',
'15,099']
如果您想要保留信息然后不切片,我会使用text[:4]
对修订版进行切片。如果你想从钱中删除,即从'4,142[Rev 3]'
删除Rev 3,你可以使用正则表达式:
import re
m = re.compile("\d+,\d+")
rec = [m.search(td.select("td")[2].text).group() for td in data]
哪个会给你:
['2,794',
'3,306',
'3,440',
'3,463',
'3,959',
'4,142',
'5,014',
'5,532',
'6,106',
'6,803',
'6,009',
'6,691',
'6,412',
'7,750',
'9,023',
'9,925',
'10,626',
'11,504',
'10,667',
'10,761',
'11,797',
'12,920',
'14,087',
'15,099']
答案 1 :(得分:-1)
必须有更清洁的方式进入那里,但这样做。
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://en.wikipedia.org/wiki/The_Walt_Disney_Company#Revenues")
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {"class":"wikitable"})
rows = [row for row in table.findAll("th", {"scope":"row"})]
for each in rows:
string = each.text[:4] + ", $" + \
each.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.text)