Question

我正在关注在线（https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/）在线抓取html表的教程。当我按照本教程操作时，便可以抓取表格数据，但是当我尝试从该（https://www.masslottery.com/games/lottery/search/results-history.html?game_id=15&mode=2&selected_date=2019-03-04&x=12&y=11）网站抓取数据时，却无法操作。

我以前尝试使用scrapy，但是得到了相同的结果。

这是我使用的代码。

import urllib.request

wiki = "https://www.masslottery.com/games/lottery/search/results-history.html?game_id=15&mode=2&selected_date=2019-03-04&x=12&y=11"
page = urllib.request.urlopen(wiki)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")


all_tables=soup.find_all('table')


right_table=soup.find('table', class_='zebra-body-only')
print(right_table)

这是我在终端上运行此代码时得到的

<table cellspacing="0" class="zebra-body-only">
<tbody id="target-area">
</tbody>
</table>

尽管当我使用google chrome浏览大众彩票的网站时，这还是我看到的

<table cellspacing="0" class="zebra-body-only"                                  <tbody id="target-area">
<tr class="odd">
<th>Draw #</th>
<th>Draw Date</th>
<th>Winning Number</th>
<th>Bonus</th>
</tr>
<tr><td>2107238</td>
<td>03/04/2019</td>
<td>01-04-05-16-23-24-27-32-34-41-42-44-47-49-52-55-63-65-67-78</td><td>No Bonus</td>
</tr>
<tr class="odd">
<td>2107239</td>
<td>03/04/2019</td>
<td>04-05-11-15-19-20-23-24-25-28-41-45-52-63-64-68-71-72-73-76</td><td>4x</td>
</tr> 
....(And so on)

我希望能够从该表中提取数据。

Answer 1

之所以发生这种情况，是因为该网站再次调用了加载结果。初始链接仅加载页面，而不加载结果。使用chrome dev工具检查请求，您将能够找出需要复制以获取结果的请求。

这意味着要获得结果，您只需调用上述请求即可，而不必完全调用网页。

幸运的是，您必须调用的端点已经采用了不错的JSON格式。

GET https://www.masslottery.com/data/json/search/dailygames/history/15/201903.json?_=1555083561238

我假设1555083561238是时间戳记。

Answer 2

页面是动态的，因此在您发出请求后将其呈现。您可以a）使用JC1的解决方案并访问json响应。或者，您可以使用Seleneium模拟打开浏览器，呈现页面，然后获取表格：

from bs4 import BeautifulSoup
from selenium import webdriver


url = 'https://www.masslottery.com/games/lottery/search/results-history.html?game_id=15&mode=2&selected_date=2019-03-04&x=12&y=11'  

driver = webdriver.Chrome()
driver.get(url)
page = driver.page_source

soup = BeautifulSoup(page, "lxml")

all_tables=soup.find_all('table')


right_table=soup.find('table', class_='zebra-body-only')

另外，请注意：通常，如果我看到<table>标签，我将让Pandas为我完成工作（请注意，我被禁止访问该网站，因此无法对其进行测试）：

import pandas as pd
from selenium import webdriver


url = 'https://www.masslottery.com/games/lottery/search/results-history.html?game_id=15&mode=2&selected_date=2019-03-04&x=12&y=11'  

driver = webdriver.Chrome()
driver.get(url)
page = driver.page_source

# will return a list of dataframes
tables = pd.read_html(page)

# chose the dataframe you want from the list by it's position
df = tables[0]

Answer 3

是的，我会将您获得的数据保存在文件中，以查看您要查找的内容是否真的存在。将open（'stuff.html'，'w'）设为f： f.write（response.text）

unicode，请尝试：导入编解码器 codecs.open（fp，'w'，'utf-8'）as f：

如果您看不到您要查找的内容，则必须找出要加载的正确网址，请检查chrome开发人员选项通常很难

简单的方法是使用硒确保您等到您要查找的内容出现在页面上（是动态的）

无法使用网站上的BeautifulSoup剪贴表数据

3 个答案: