Question

我正在编写一个脚本来获取有关纽约市建筑物的信息。我知道我的代码可以运行并返回我喜欢的代码。我以前做过手动输入，但它有效。现在我试图让它从文本文件中读取地址并使用该信息访问网站，我收到此错误：

urllib.error.HTTPError：HTTP错误400：错误请求

我认为它与网站有关，不喜欢非浏览器的大量访问。我听说过有关用户代理的内容，但不知道如何使用它们。这是我的代码：

from bs4 import BeautifulSoup
import urllib.request

f = open("FILE PATH GOES HERE")

def getBuilding(link):
    r = urllib.request.urlopen(link).read()
    soup = BeautifulSoup(r, "html.parser")
    print(soup.find("b",text="KEYWORDS IM SEARCHING FOR GO HERE:").find_next("td").text)


def main():
    for line in f:
        num, name = line.split(" ", 1)
        newName = name.replace(" ", "+")
        link = "LINK GOES HERE (constructed from num and newName variables)"
        getBuilding(link)      
    f.close()

if __name__ == "__main__":
    main()

Answer 1

A 400 error means that the server cannot understand your request (e.g., malformed syntax). That said, its up to the developers on what status code they want to return and, unfortunately, not everyone strictly follows their intended meaning.

Check out this page for more details on HTTP Status Codes.

With regards on how to how to set a User Agent: A user agent is set in the request header and, basically, defines the client making the request. Here is a list of recognized User Agents. You will need to use urllib2, rather than urllib, but urllib2 is also a built-in package. I will show you how update the getBuilding function to set the header using that module. But I would recommend checking out the requests library. I just find that to be super straight-forward and it is highly adopted/supported.

Python 2:

from urllib2 import Request, urlopen

def getBuilding(link):        
    q = Request(link)
    q.add_header('User-Agent', 'Mozilla/5.0')
    r = urlopen(q).read()
    soup = BeautifulSoup(r, "html.parser")
    print(soup.find("b",text="KEYWORDS IM SEARCHING FOR GO HERE:").find_next("td").text)

Python 3:

from urllib.request import Request, urlopen

def getBuilding(link):        
    q = Request(link)
    q.add_header('User-Agent', 'Mozilla/5.0')
    r = urlopen(q).read()
    soup = BeautifulSoup(r, "html.parser")
    print(soup.find("b",text="KEYWORDS IM SEARCHING FOR GO HERE:").find_next("td").text)

Note: The only difference between Python v2 and v3 is the import statement.

HTTP错误400：错误请求（urllib）

1 个答案: