python - beautifulsoup find_all()导致无效日期

时间:2016-09-21 08:52:44

标签: python beautifulsoup

我的代码:

import requests
import re
from bs4 import BeautifulSoup

r = requests.get(
    "https://www.traveloka.com/hotel/detail?spec=22-9-2016.24-9-2016.2.1.HOTEL.3000010016588.&nc=1474427752464")

data = r.content
soup = BeautifulSoup(data, "html.parser")
ratingdates = soup.find_all("div", {"class": "reviewDate"})

for i in range(0,10):
    print(ratingdates[i].get_text())

这些代码将打印"无效日期"。如何获得约会?

附加说明:

似乎解决方案是使用硒或spynner,但我不知道如何使用它。而且我无法安装spynner,它始终坚持安装lxml

1 个答案:

答案 0 :(得分:1)

如果你使用Selenium,那真的很简单。这是一个基本的例子,有一些解释:

安装selenium run pip install selenium

from bs4 import BeautifulSoup
from selenium import webdriver

# set webdriver's browser to Firefox
driver = webdriver.Firefox() 

#load page in browser
driver.get(
    "https://www.traveloka.com/hotel/detail?spec=22-9-2016.24-9-2016.2.1.HOTEL.3000010016588.&nc=1474427752464")

#Wait 5 seconds after page load so dates are loaded
driver.implicitly_wait(5)
#get page's source
data = driver.page_source

#rest is pretty much the same
soup = BeautifulSoup(data, "html.parser")
ratingdates = soup.find_all("div", {"class": "reviewDate"})

#I changed this bit to always print all dates without range issues
for i in ratingdates:
    print(i.get_text())

有关使用Selenium的更多信息,请查看此处的文档 - http://selenium-python.readthedocs.io/

如果您不希望每次运行脚本时都弹出Firefox,您可以使用PhantomJS - 轻量级无头浏览器。在downloading完成设置后,您可以在上面的示例中将driver = webdriver.Firefox()更改为driver = webdriver.PhantomJS()