我从这里报废:https://www.usatoday.com/sports/ncaaf/sagarin/,页面只是乱七八糟的字体标签。我已经能够成功地抓取我需要的数据,但我很好奇我是否可以写出这个“更清洁”的数据。我想缺少一个更好的词。当我对报废数据进行清理时,我必须使用三个不同的临时列表,这似乎很愚蠢。
例如,这是我的代码片段,可以获得"表格中每个团队的总体评分。在该页面上:
source = urllib.request.urlopen('https://www.usatoday.com/sports/ncaaf/sagarin/').read()
soup = bs.BeautifulSoup(source, "lxml")
page_source = soup.find("font", {"color": "#000000"}
sagarin_raw_rating_list = page_source.find_all("font", {"color": "#9900ff"})
raw_ratings = sagarin_raw_rating_list[:-1]
temp_list = [element.text for element in raw_ratings]
temp_list_cleanup1 = [element for element in temp_list if element != 'RATING']
temp_list_cleanup2 = re.findall(" \s*(-?\d+\.\d+)", str(temp_list_cleanup1))
final_ratings_list = [element for element in temp_list_cleanup2 if element != home_team_advantage] # This variable is scrapped from another piece of code
print(final_ratings_list)
对于我和一些朋友来说,这是一个私人计划,所以我是唯一一个维持它的人,但它似乎有点令人费解。部分问题是网站,因为我必须做很多工作来提取相关数据。
答案 0 :(得分:1)
我看到的主要是你将temp_list_cleanup1
转换为不必要的字符串。我不认为在一个巨大的字符串上re.findall
和在一堆较小的字符串上re.search
之间会有很大差异。之后,您可以将大部分列表推导[...]
换成生成器推导(...)
。它不会消除任何代码行,但是您不会存储您再也不需要的额外列表
temp_iter = (element.text for element in raw_ratings)
temp_iter_cleanup1 = (element for element in temp_iter if element != 'RATING')
# search each element individually, rather than one large string
temp_iter_cleanup2 = (re.search(" \s*(-?\d+\.\d+)", element).group(1)
for element in temp_iter_cleanup1)
# here do a list comprehension so that you have the scrubbed data stored
final_ratings_list = [element for element in temp_iter_cleanup2 if element != home_team_advantage]