Question

当我尝试向此网站发送请求时：

import requests
requests.get('https://www.ldoceonline.com/')

返回了一个例外

requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

奇怪的是，如果您通过常规方法（通过浏览器）访问网站，它们功能齐全且响应非常好。只有当您尝试通过网络抓取技术检索信息时才会遇到此响应。

关于如何成功抓取它的任何想法？

Answer 1

尝试使用标头获取有效的回复。

CREATE OR REPLACE FUNCTION insert_update_notifications(notification_ids jsonb) RETURNS void AS
$$
DECLARE
allNotificationIds text[];
indJson jsonb;
notIdCount int;
i json;
BEGIN

FOR i IN SELECT * FROM jsonb_array_elements(notification_ids)
  LOOP
    select into notIdCount count(notification_id) from notification_table where notification_id = i->>'notificationId' ;
    IF(notIdCount = 0 ) THEN
       insert into notification_table(notification_id,userid) values(i->>'notificationId',i->>'userId');
    ELSE
        update notification_table set userid = i->>'userId' where notification_id = i->>'notificationId';
    END IF;
  END LOOP;

END;
$$
language plpgsql;

select * from insert_update_notifications('[{
        "notificationId": "123",
        "userId": "444"
    },
    {
        "notificationId": "456",
        "userId": "789"
    }
]');

输出：

import requests

res = requests.get('https://www.ldoceonline.com/',headers={"User-Agent":"Mozilla/5.0"})
print(res.status_code)

Answer 2

如果检查请求模块的code，则将找到发出请求时使用的default headers的值。上面提到的User-Agent标头也在那里。

如果将User-Agent标头设置为“ python-requests / 2.21.0”，似乎一堆web资源（无论有意还是无意）都无法正确处理请求。

因此，实际的解决方案是使用自定义User-Agent标头。 here提供了用于不同浏览器的用户代理字符串。

import requests

url = 'https://www.ldoceonline.com/'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"}

r = requests.get(url,headers=headers)
r.raise_for_status()

无法request.get（）一个网站，“远程端关闭连接没有响应”

2 个答案: