工作的网络爬虫,突然不再工作了

时间:2019-07-20 20:18:57

标签: python web beautifulsoup web-crawler

我正在关注this tutorial,并且代码运行正常。

现在,在完成其他一些项目之后,我返回并想重新运行相同的代码。突然,我收到一条错误消息,迫使我在汤变量中添加features="html.parser"

我这样做了,但是现在当我运行代码时,实际上什么也没发生。为什么会这样,我在做什么错了?

我检查了是否可以卸载beautifulsoup4模块,但是没有,它仍然存在。我从头开始重新输入了整个代码,但似乎无济于事。

import requests
from bs4 import BeautifulSoup

def spider():
    url = "https://www.amazon.de/s?k=laptop+triton&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss"
    source = requests.get(url)
    plain_text = source.text
    soup = BeautifulSoup(plain_text, features="html.parser")

    for mylink in soup.findAll('img', {'class':'s-image'}):
        mysrc = mylink.get('src')
        print(mysrc)

spider()

理想情况下,我希望爬网程序在所讨论的亚马逊页面上打印大约10到20行的src =“ ...”。这段代码几个小时前就起作用了……

1 个答案:

答案 0 :(得分:1)

解决方案是将 headers={'User-Agent':'Mozilla/5.0'} 添加到requests.get()(如果没有,Amazon将不会发送正确的页面):

import requests
from bs4 import BeautifulSoup

def spider():
    url = "https://www.amazon.de/s?k=laptop+triton&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss"
    source = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
    plain_text = source.text
    soup = BeautifulSoup(plain_text, features="html.parser")

    for mylink in soup.findAll('img', {'class':'s-image'}):
        mysrc = mylink.get('src')
        print(mysrc)

spider()

打印:

https://m.media-amazon.com/images/I/71YPEDap2lL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81kXH-OA6tL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81kXH-OA6tL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81fyVgZuQxL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81kXH-OA6tL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81kXH-OA6tL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71VmlANJMOL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71rAT5E7DfL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71cEKKNfb3L._AC_UL436_.jpg
https://m.media-amazon.com/images/I/61aWXuLIEBL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71B7NyjuU9L._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81s822PQUcL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71fBKuAiQzL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71hXTUR-oRL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81-Lf6jX-OL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81B85jUARqL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/8140E7+uhZL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/8140E7+uhZL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71ROCddvJ2L._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71ROCddvJ2L._AC_UL436_.jpg
https://m.media-amazon.com/images/I/41bB8HuoBYL._AC_UL436_.jpg