Question

我几天来一直试图从一个使用asmx post请求的网站获取一些数据来检索我想要的数据。我尝试过使用php curl，python和现在的html解析器，但仍然没有运气......帖子请求是：

https://sports-itainment.biahosted.com/WebServices/SportEvents.asmx/GetEvents

{"champIds":["38"],"eventIds":[],"dateFilter":"All","marketsId":-1,"skinId":"betrebels"}

经过多次尝试后，我发现此链接为我提供了我想要的数据：

https://sports-itainment.biahosted.com/generic/prelive.aspx?token=&clientTimeZoneOffset=-180&lang=en-Gb&walletcode=508729&skinid=betrebels&parentUrl=https://ps.equalsystem.com/ps/game/BIASportbook.action#sportids=&catids=28&champids=91

但是当我尝试用curl打开它或者只是简单地用simple_html_dom解析它时它没有显示数据;我只是显示一些文字..任何想法我怎么能得到它？我有超过50个文件尝试不同的方式没有结果，因此很难发布我的代码。

Answer 1

我知道这个问题标记为php，但您似乎也愿意使用Python，所以我希望这个答案可以满足您的需求！

您遇到的问题是网站是动态创建的（它是在页面加载后加载的）所以您之前尝试在Python中加载页面（请求，如您所说）工作，但实际上并没有返回任何数据！

要抓住您在问题中链接到的网站，我强烈建议您使用与phantomjs配对的Python Selenium模块。对于如何在Selenium中安装phantomjs，此SO question有一些很好的答案。 phantomjs允许页面完全加载（包括实际使用您想要的表信息填充它的JS）。

然后，一旦创建了这两个依赖项，就可以运行以下代码：

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.PhantomJS()
driver.get('https://sports-itainment.biahosted.com/generic/prelive.aspx?token=&clientTimeZoneOffset=-180&lang=en-Gb&walletcode=508729&skinid=betrebels&parentUrl=https://ps.equalsystem.com/ps/game/BIASportbook.action#sportids=&catids=28&champids=91')
soup = BeautifulSoup(driver.page_source)
soup.find_all('tbody')

使用BeautifulSoup与网页互动！

如果您需要，这是其他信息的良好来源！

scrape html generated by javascript with python

希望它有所帮助！

如何动态抓取页面数据？

1 个答案: