该网站是: https://pokemongo.gamepress.gg/best-attackers-type
我的代码现在如下:
from bs4 import BeautifulSoup
import requests
import re
site = 'https://pokemongo.gamepress.gg/best-attackers-type'
page_data = requests.get(site, headers=headers)
soup = BeautifulSoup(page_data.text, 'html.parser')
check_gamepress = soup.body.findAll(text=re.compile("Strength"))
print(check_gamepress)
但是,我真的想刮掉某些数据,而且我确实遇到了麻烦。 例如,我如何刮取显示以下最佳Bug类型的部分:
"Good typing and lightning-fast attacks. Though cool-looking, Scizor is somewhat fragile."
这个信息显然可以像过去一样更新,当一个更好的口袋妖怪出现在那种类型时。那么,如何在将来可能更新的数据中删除这些数据,而不必在发生时进行代码更改。
事先,谢谢你的阅读!
答案 0 :(得分:2)
由于HTML的组织方式,这个特定网站有点难度。包含信息的相关标签实际上并没有很多区别特征,所以我们必须要有点聪明。为了使事情变得复杂,整个页面中包含信息的div是兄弟姐妹。我们还必须通过一些独创性来弥补这个网页设计的歪曲。
我确实注意到整个页面中(几乎完全)一致的模式。每个'类型'和底层部分分为3个div:
Dark Type: Tyranitar
。接下来的基本思想是我们可以通过一个松散的过程来开始组织这个标记混乱:
考虑到这一点,我制作了一个有效的解决方案。代码的内容包括5个函数。一个用于查找每个部分,一个用于提取兄弟,另外三个用于解析每个div。
import re
import json
import requests
from pprint import pprint
from bs4 import BeautifulSoup
def type_section(tag):
"""Find the tags that has the move type and pokemon name"""
pattern = r"[A-z]{3,} Type: [A-z]{3,}"
# if all these things are true, it should be the right tag
return all((tag.name == 'div',
len(tag.get('class', '')) == 1,
'field__item' in tag.get('class', []),
re.findall(pattern, tag.text),
))
def parse_type_pokemon(tag):
"""Parse out the move type and pokemon from the tag text"""
s = tag.text.strip()
poke_type, pokemon = s.split(' Type: ')
return {'type': poke_type, 'pokemon': pokemon}
def parse_speciality(tag):
"""Parse the tag containing the speciality and moves"""
table = tag.find('table')
rows = table.find_all('tr')
speciality_row, fast_row, charge_row = rows
speciality_types = []
for anchor in speciality_row.find_all('a'):
# Each type 'badge' has a href with the type name at the end
href = anchor.get('href')
speciality_types.append(href.split('#')[-1])
fast_move = fast_row.find('td').text
charge_move = charge_row.find('td').text
return {'speciality': speciality_types,
'fast_move': fast_move,
'charge_move': charge_move}
def parse_rating(tag):
"""Parse the tag containing categorical ratings and commentary"""
table = tag.find('table')
category_tags = table.find_all('th')
strength_tag, meta_tag, future_tag = category_tags
str_rating = strength_tag.parent.find('td').text.strip()
meta_rating = meta_tag.parent.find('td').text.strip()
future_rating = meta_tag.parent.find('td').text.strip()
blurb_tags = table.find_all('td', {'colspan': '2'})
if blurb_tags:
# `if` to accomodate fire section bug
str_blurb_tag, meta_blurb_tag, future_blurb_tag = blurb_tags
str_blurb = str_blurb_tag.text.strip()
meta_blurb = meta_blurb_tag.text.strip()
future_blurb = future_blurb_tag.text.strip()
else:
str_blurb = None;meta_blurb=None;future_blurb=None
return {'strength': {
'rating': str_rating,
'commentary': str_blurb},
'meta': {
'rating': meta_rating,
'commentary': meta_blurb},
'future': {
'rating': future_rating,
'commentary': future_blurb}
}
def extract_divs(tag):
"""
Get the divs containing the moves/ratings
determined based on sibling position from the type tag
"""
_, speciality_div, _, rating_div, *_ = tag.next_siblings
return speciality_div, rating_div
def main():
"""All together now"""
url = 'https://pokemongo.gamepress.gg/best-attackers-type'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
types = {}
for type_tag in soup.find_all(type_section):
type_info = {}
type_info.update(parse_type_pokemon(type_tag))
speciality_div, rating_div = extract_divs(type_tag)
type_info.update(parse_speciality(speciality_div))
type_info.update(parse_rating(rating_div))
type_ = type_info.get('type')
types[type_] = type_info
pprint(types) # We did it
with open('pokemon.json', 'w') as outfile:
json.dump(types, outfile)
现在,整个事情中只有一把小扳手。还记得当我说这个模式几乎完全一致吗?好吧,Fire
类型在这里是一个奇怪的球,因为它们包括两个用于该类型的口袋妖怪,因此Fire
类型的结果不正确。我或者一些勇敢的人可能想出办法解决这个问题。或者也许他们将来会决定一个火警小精灵。
此代码,生成的json(美化)以及使用的HTML响应存档可以在this gist中找到。