使用python请求发布-如何获取请求的正确表格数据?

时间:2020-02-04 19:02:43

标签: post web-scraping html-table beautifulsoup python-requests

我正尝试从以下日期(2020年2月1日至2020年2月5日)从此网站https://www.investing.com/economic-calendar/获取历史经济日历数据。

今天是2020年2月4日。

如果我使用下面的https://www.investing.com/economic-calendar/网址,则可以使用beautifulsoup提取表格,但是除当前日期外,我无法选择其他日期。我今天(2020年2月4日)在python脚本中保存了一张表。

import requests
import pandas as pd
from bs4 import BeautifulSoup

payload = {"country[]":["25","32","6","37","72","22","17","39","14","10","35","43","56","36","110","11","26","12","4","5"],
                "dateFrom":"2020-02-01",
                "dateTo":"2020-02-05",
                "timeZone":"8",
                "timeFilter":"timeRemain",
                "currentTab":"custom",
                "limit_from":"0"}

urlheader = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest"
}

url = "https://www.investing.com/economic-calendar/"

req = requests.post(url, data=payload, headers=urlheader)
print(req)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('table', id="economicCalendarData")

表变量看起来像这样 table variable

我可以看到,每当我更改日期范围或过滤器设置时,它都会向“ https://www.investing.com/economic-calendar/Service/getCalendarFilteredData”发送发布请求。

这是我找到的请求数据。

request data

这是POST链接

post link

所以我改用下面的代码,因为我想选择日期。

import requests
import pandas as pd
from bs4 import BeautifulSoup

payload = {"country[]":["25","32","6","37","72","22","17","39","14","10","35","43","56","36","110","11","26","12","4","5"],
                "dateFrom":"2020-02-01",
                "dateTo":"2020-02-05",
                "timeZone":"8",
                "timeFilter":"timeRemain",
                "currentTab":"custom",
                "limit_from":"0"}

urlheader = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest"
}

url = "https://www.investing.com/economic-calendar/Service/getCalendarFilteredData"

req = requests.post(url, data=payload, headers=urlheader)
print(req)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('table', id="economicCalendarData")

但是这一次,没有economicCalendarData,因此表变量显示为空。 汤变量中有数据,但其中没有表数据。

这是我要保存的表。

table to save

就像我之前说的那样,如果将URL用作https://www.investing.com/economic-calendar/,则只能获取当天(2020年2月4日)的表格数据;无论我输入有效负载的日期是什么(dateFrom,dateTo)。

由于某种原因,当我尝试发布到https://www.investing.com/economic-calendar/Service/getCalendarFilteredData时,表变成空的,即使汤变量包含数据,也不是我请求的数据。我究竟做错了什么?如何在选择的日期保存表格?

1 个答案:

答案 0 :(得分:2)

您真的很亲密。如果我了解您的要求,以下内容将带您到达这里:

import requests
from bs4 import BeautifulSoup

url = "https://www.investing.com/economic-calendar/Service/getCalendarFilteredData"

payload = {"country[]":["25","32","6","37","72","22","17","39","14","10","35","43","56","36","110","11","26","12","4","5"],
                "dateFrom":"2020-02-01",
                "dateTo":"2020-02-05",
                "timeZone":"8",
                "timeFilter":"timeRemain",
                "currentTab":"custom",
                "limit_from":"0"}

req = requests.post(url, data=payload, headers={
    "User-Agent":"Mozilla/5.0",
    "X-Requested-With": "XMLHttpRequest"
    })
soup = BeautifulSoup(req.json()['data'],"lxml")
for items in soup.select("tr"):
    data = [item.get_text(strip=True) for item in items.select("th,td")]
    print(data)