我正在尝试处理此页面:
https://play.google.com/store/movies/details?id=3B6EBBD94D13B4DCMV
我正在使用以下代码来阅读HTML:
from BeautifulSoup import BeautifulSoup as BS
import requests
def read_html(url):
try:
res = requests.get(url)
if res.status_code == 200:
html_content = res.content
soup = BS(html_content)
return _get_type(soup)
else:
print res.status_code
except ValueError, e:
print e
def _get_type(soup):
"""Read Movie."""
mydivs = soup.findAll("span", {"class": "DBzzzb"})
if mydivs:
return 'AVAILABLE'
mydivs = soup.findAll("span", {"class": "DBzzzb"})
if mydivs:
return 'PREORDER'
mydivs = soup.findAll("div", {"class": "Wc4pU"})
if mydivs:
return 'NOT_AVAILABLE'
return 'INVALID'
我的条件永远不匹配:soup.findAll("div", {"class": "Wc4pU"}
即使实际上有HTML代码:
<div class="Wc4pU">We'll notify you on your wishlist when movies become available</div>
来源HTML:
view-source:https://play.google.com/store/movies/details?id=3B6EBBD94D13B4DCMV
答案 0 :(得分:2)
您需要指定解析器:
soup = BS(html_content, 'html5lib')
这也使得这个过程更快。