Question

我正在尝试从该网站获取数据。
https://api.etherscan.io/api?module=account&action=tokentx&contractaddress=0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2&page=1&offset=100&sort=asc&apikey=YourApiKeyToken
但是，在执行以下代码时，我总是收到错误消息

import pandas as pd
import json
import urllib.request
from urllib.request import FancyURLopener

url = 'https://api.etherscan.io/api?module=account&action=tokentx&contractaddress=0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2&page='
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)     Chrome/37.0.2049.0 Safari/537.36'}
request_interval = 2  # interval

urls = []
df = []
if __name__ == '__main__':
    for i in range(1, 2):
        url = urllib.parse.urljoin(url, '&page='+str(i)+'&offset=10000&sort=asc&apikey=YourApiKeyToken')
        urls.append(str(url))

    for url in urls:
        headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"}
        request = urllib.request.Request(url=url, headers=headers)
        html = urllib.request.urlopen(request).read()
        result = json.loads(html.decode('utf-8'))['blockNumber']
        df.extend(json.loads(html.decode('utf-8'))['blockNumber'])
        print('Completed URL : ', url)

pdf = pd.DataFrame(df)

pdf.to_csv("output.csv")

我尝试了一些在Stackoverflow上找到的解决方案。
urllib2.HTTPError: HTTP Error 400: Bad Request - Python
urllib2 HTTP Error 400: Bad Request

我也改变了

headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"}

和

{'Authorization': auth,
             'Content-Type':'application/json',
             'Accept':'application/json'}

但仍然出现相同的错误。

谢谢

Answer 1

urljoin与您打算使用的目的不同。

来自docs

通过组合“基本URL”（基本）和另一个URL（URL）来构造完整（“绝对”）URL。非正式地，这使用基本URL的组件，尤其是寻址方案，网络位置和路径（的一部分）来提供相对URL中缺少的组件。例如：
>>> from urllib.parse import urljoin
>>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
'http://www.cwi.nl/%7Eguido/FAQ.html'

我不确定是否可以将其用于组合url的查询参数

有了这个，您在urljoin之后得到的URL就像

https://api.etherscan.io/&page=1&offset=10000&sort=asc&apikey=YourApiKeyToken

这是错误的。

使用字符串连接。在第一个for循环中，从

更改

url = urllib.parse.urljoin(url, '&page='+str(i)+'&offset=10000&sort=asc&apikey=YourApiKeyToken')

到

url = url + str(i) + '&offset=10000&sort=asc&apikey=YourApiKeyToken'

您正在将值重新分配给for循环内的主url变量。因此，在下一次迭代中，您将在第一个迭代网址上添加偏移量部分。

加上上述更改，而不是

for i in range(1, 2):
    url = urllib.parse.urljoin(url, '&page='+str(i)+'&offset=10000&sort=asc&apikey=YourApiKeyToken')
    urls.append(str(url))

您可以做到

for i in range(1, 2):
        urls.append(url + str(i) + '&offset=10000&sort=asc&apikey=YourApiKeyToken')

希望您知道第一个循环只会运行一次。 range(1,2)将返回[1]而不是[1, 2]

从Json页面进行Web抓取，但HTTP错误：HTTP错误400：错误的请求

1 个答案: