我试图从1992年到2014年的维基百科上搜索Billboard 100强,然后清理数据。我最后得到一个“无效的文字”错误:
years = range(1992,2015)
yearstext = dict()
for year in years:
t_1992=requests.get('http://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_%(year)s' % {"year":year})
soup = BeautifulSoup(t_1992.text, "html.parser")
yearstext[year]=soup
def parse_year(year, ytextdixt):
rows = soup.find("table", attrs={"class": "wikitable"}).find_all("tr")[1:]
cleaner = lambda r: [r[0].get_text(), int(r[1].get_text()), r[2].get_text(), r[2].find("a").get("href"), r[3].get_text(),r[3].find("a").get("href")]
fields = ["band_singer", "ranking", "song", "songurl","titletext","url"]
songs = [dict(zip(fields, cleaner(row.find_all("td")))) for row in rows]
ValueError: invalid literal for int() with base 10: 'Pharrell Williams'
任何人都知道这是为什么?
答案 0 :(得分:0)
'r [1] .get_text()'在某些情况下返回'Pharrell Williams'
然后'int(r [1] .get_text())'触发了这个异常。
重新检查你从网址获得的详细信息。
答案 1 :(得分:0)
做了一些实验我发现:
views/vehicles/create.js.erb
给出:
from bs4 import BeautifulSoup
import requests
year = 1992
t_1992=requests.get('http://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_%(year)s' % {"year":year})
soup = BeautifulSoup(t_1992.content, "lxml.parser")
rows = soup.find("table", attrs={"class": "wikitable"}).find_all("tr")[1:]
rows[0].get_text()
所以使用:
u'\n1\n"End of the Road"\nBoyz II Men\n'
给出:
rows[0].get_text().strip().split('\n')
应该让你走上正轨。