Question

我想浏览网址https://www.horsedeathwatch.com/index.php并将数据转储到Pandas数据框中。

列如马/日期/病程/死亡原因我试过pandas read_html直接读取此url，即使它具有table标记，也找不到该表。

我尝试使用：

  url='https://www.horsedeathwatch.com/index.php'
  #Create a handle, page, to handle the contents of the website
  page = requests.get(url)
  #print(page.text)
  soup = BeautifulSoup(page.content,'lxml')

然后使用findall（'tr'）方法，但由于某些原因无法正常工作。

我想做的第二件事是..每个Horse（网页表的第一列）都有一个带有附加属性的超链接。

关于如何检索那些附加属性到熊猫数据框的建议

Answer 1

在站点上，我可以看到使用POST请求传递了/loaddata.php来传递页码，从而加载了数据。将此与pandas.read_html结合：

import requests
import pandas

res = requests.post('https://www.horsedeathwatch.com/loaddata.php', data={'page': '3'})
html = pandas.read_html(res.content)

尽管BeautifulSoup可能会为您提供更丰富的数据结构..因为如果您想针对每匹马提取更多属性，则需要获取锚元素的'href'并执行另一个请求-该请求是GET请求，则需要解析响应中来自<div class="view">的响应内容。

需要使用python

1 个答案: