废弃网站的英文版

时间:2016-06-16 17:09:54

标签: python web-scraping beautifulsoup

我试图刮一个日文网站的英文版,问题是日文版和英文版的链接是一样的,有没有办法告诉beautifulsoup刮掉英文版而不是日文版? / p>

链接我想刮:

https://data.j-league.or.jp/SFMS02/?match_card_id=17975

1 个答案:

答案 0 :(得分:2)

要证明添加lang=en url查询参数确实有效:

>>> import requests
>>> from bs4 import BeautifulSoup
>>>
>>> url = "https://data.j-league.or.jp/SFMS02/?match_card_id=17975"
>>> english_url = "https://data.j-league.or.jp/SFMS02/?match_card_id=17975&lang=en"
>>>
>>> print(BeautifulSoup(requests.get(url).content, "html.parser").find(class_="team-name").get_text(strip=True))
サガン鳥栖
>>> print(BeautifulSoup(requests.get(english_url).content, "html.parser").find(class_="team-name").get_text(strip=True))
Sagan Tosu

请注意,您还可以使用SFCM01LANG值添加en Cookie

>>> url = "https://data.j-league.or.jp/SFMS02/?match_card_id=17975"
>>> response = requests.get(url, cookies={'SFCM01LANG': 'en'})
>>> soup = BeautifulSoup(response.content, "html.parser")
>>> print(soup.find(class_="team-name").get_text(strip=True)) 
Sagan Tosu