浏览器中的URL参数

Question

我正在使用请求来编译自定义URL，并且一个参数包含井号。谁能解释如何在不对井号进行编码的情况下传递参数？

这将返回正确的CSV文件

results_url = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&hfPT=&hfAB=&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=R%7C&hfC=&hfSea=2019%7C&hfSit=&player_type=batter&hfOuts=&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt=&game_date_lt=&hfInfield=&team=&position=&hfOutfield=&hfRO=&home_road=&hfFlag=&hfPull=&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name&sort_col=pitches&player_event_sort=h_launch_speed&sort_order=desc&min_abs=0&type=#results'
results = requests.get(results_url, timeout=30).content
results_df = pd.read_csv(io.StringIO(results.decode('utf-8')))

这不是

URL = 'https://baseballsavant.mlb.com/statcast_search/csv?'

def _get_statcast(params):

     _get = get(URL, params=params, timeout=30)
     _get.raise_for_status()
     return _get.content

问题似乎是当通过请求传递“ #results”时，忽略“＃”之后的任何内容，这将导致下载错误的CSV。如果有人对解决此问题的其他方式有任何想法，我将不胜感激。

EDIT2：也在python论坛https://python-forum.io/Thread-Handling-pound-sign-within-custom-URL?pid=75946#pid75946

上问了这个问题

Answer 1

基本上，URL中的文字井号后的所有内容都不会发送到服务器。这适用于浏览器和requests。

URL的格式表明type=#results部分实际上是一个查询参数。

requests将自动对查询参数进行编码，而浏览器则不会。以下是各种查询以及服务器在每种情况下收到的内容：

浏览器中的URL参数

在浏览器中使用井号时，不后的任何内容都不会发送到服务器：

https://httpbin.org/anything/type=#results

返回：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7", 
    "Cache-Control": "max-age=0", 
    "Host": "httpbin.org", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "*redacted*"
  }, 
  "json": null, 
  "method": "GET", 
  "origin": "*redacted*", 
  "url": "https://httpbin.org/anything/type="
}

服务器收到的URL为https://httpbin.org/anything/type=。
所请求的页面称为type=，这似乎不正确。

浏览器中的查询参数

<key>=<value>格式建议它可能是您要传递的查询参数。不过，井号后的所有内容都不会不发送到服务器：

https://httpbin.org/anything?type=#results

返回：

{
  "args": {
    "type": ""
  }, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7", 
    "Host": "httpbin.org", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "*redacted*"
  }, 
  "json": null, 
  "method": "GET", 
  "origin": "*redacted*", 
  "url": "https://httpbin.org/anything?type="
}

服务器收到的URL为https://httpbin.org/anything?type=。
所请求的页面称为anything。
接收到没有值的自变量type。

浏览器中的编码查询参数

https://httpbin.org/anything?type=%23results

返回：

{
  "args": {
    "type": "#results"
  }, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7", 
    "Host": "httpbin.org", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "*redacted*"
  }, 
  "json": null, 
  "method": "GET", 
  "origin": "*redacted*", 
  "url": "https://httpbin.org/anything?type=%23results"
}

服务器收到的URL为https://httpbin.org/anything?type=%23results。
所请求的页面称为anything。
接收到值为type的参数#results。

带有URL参数的Python请求

requests也不会在井号后发送任何内容到服务器：

import requests

r = requests.get('https://httpbin.org/anything/type=#results')
print(r.url)
print(r.json())

返回：

https://httpbin.org/anything/type=#results
{
    "args": {},
    "data": "",
    "files": {},
    "form": {},
    "headers": {
        "Accept": "*/*",
        "Accept-Encoding": "gzip, deflate",
        "Host": "httpbin.org",
        "User-Agent": "python-requests/2.21.0"
    },
    "json": null,
    "method": "GET",
    "origin": "*redacted*",
    "url": "https://httpbin.org/anything/type="
}

服务器收到的URL为https://httpbin.org/anything?type=。
所请求的页面称为anything。
接收到没有值的自变量type。

带有查询参数的Python请求

requests自动对查询参数进行编码：

import requests

r = requests.get('https://httpbin.org/anything', params={'type': '#results'})
print(r.url)
print(r.json())

返回：

https://httpbin.org/anything?type=%23results
{
    "args": {
        "type": "#results"
    },
    "data": "",
    "files": {},
    "form": {},
    "headers": {
        "Accept": "*/*",
        "Accept-Encoding": "gzip, deflate",
        "Host": "httpbin.org",
        "User-Agent": "python-requests/2.21.0"
    },
    "json": null,
    "method": "GET",
    "origin": "*redacted*",
    "url": "https://httpbin.org/anything?type=%23results"
}

服务器收到的URL为https://httpbin.org/anything?type=%23results。
所请求的页面称为anything。
接收到值为type的参数#results。

带有双重编码查询参数的Python请求

如果您手动编码查询参数，然后将其传递给requests，它将再次编码已经编码的查询参数：

import requests

r = requests.get('https://httpbin.org/anything', params={'type': '%23results'})
print(r.url)
print(r.json())

返回：

https://httpbin.org/anything?type=%23results
{
    "args": {
        "type": "%23results"
    },
    "data": "",
    "files": {},
    "form": {},
    "headers": {
        "Accept": "*/*",
        "Accept-Encoding": "gzip, deflate",
        "Host": "httpbin.org",
        "User-Agent": "python-requests/2.21.0"
    },
    "json": null,
    "method": "GET",
    "origin": "*redacted*",
    "url": "https://httpbin.org/anything?type=%2523results"
}

服务器收到的URL为https://httpbin.org/anything?type=%2523results。
所请求的页面称为anything。
接收到值为type的参数%23results。

Answer 2

answer by Cloudomation提供了许多有趣的信息，但我认为这可能不是您想要的。假设python论坛中的this identical thread也是由您编写的，请继续阅读：

从您提供的信息来看，似乎type=#results被用于过滤原始csv 并仅返回部分数据。
如果是这种情况，则type=部分并不是真正必要的（尝试不使用该URL的URL，并确保获得相同的结果）。

我将解释：

URLS中的#符号称为fragment identifier，在不同类型的页面中，它具有不同的用途。在text/csv页中，它可以按列，行或两者的某种组合过滤 csv表。您可以详细了解here。

在您的情况下，results可能是一个查询参数，用于以自定义方式过滤csv表。

不幸的是，如Cloudomation的答案所示，碎片数据在服务器端不可用，因此您将无法以尝试的方式通过python request参数访问数据。

您可以尝试使用Javascript as suggested here访问它，也可以直接下载整个（未过滤的）CSV表并自行过滤。

有很多方法可以在python中轻松有效地做到这一点。查看here了解更多信息，或者，如果需要更多控制，可以将CSV导入pandas dataframe。

编辑：

我看到您通过加入字符串并传递第二个请求找到了解决方法。由于此方法有效，您可能可以避免将参数转换为字符串（如建议的here）。如果这样做，那么您会得到更有效的解决方案，也许是稍微更优雅的解决方案：

params = {'key1': 'value1', 'key2': 'value2'} // sample params dict

def _get_statcast_results(params):

    // convert params to string - alternatively you can  use %-formatting 
    params_str = "&".join(f"{k}={v}" for k,v in payload.items())

    s = session()

    data = s.get(statcast_url, params = params_str, timeout=30)

    return data.content

Answer 3

我只经历了一次试用，但希望能找到解决方案。我没有通过参数传递“ #results”，而是开始了与基本url和所有其他参数的会话，并使用“ #results”将其加入，然后通过第二次get运行。

statcast_url = 'https://baseballsavant.mlb.com/statcast_search/csv?'
results_url = '&type=#results&'

def _get_statcast_results(params):

    s = session()
    _get = s.get(statcast_url, params=params, timeout=30, allow_redirects=True)

    new_url = _get.url+results_url
    data = s.get(new_url, timeout=30)

    return data.content

仍然需要进行更多的试验，但是我认为这应该可行。感谢所有参与的人。即使我没有得到直接的答复，答复仍然帮助了很多人。

在python请求中处理井号（＃）

3 个答案:

浏览器中的URL参数

浏览器中的查询参数

浏览器中的编码查询参数

带有URL参数的Python请求

带有查询参数的Python请求

带有双重编码查询参数的Python请求