Question

当我尝试在 Colab 中实现一个简单的网络爬虫代码时，由于我编写了以下代码，我收到了如下语法错误。请告诉我如何解决运行它的问题：

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page=1 
    while page <= max_pages:
      url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=2%22+Butterfly+Valve&_sacat=0&_pgn='+ str(page)
      source_code= requests.get(url)
      plain_text=source_code.text
      soup = BeautifulSoup(plain_text)
        for link in soup.findALL('a', {'class':'s-item__title s-item__title--has-tags'})
          href = link.get('href')
          print(href)
        page+=1   

trade_spider(1)

错误：

File "<ipython-input-4-5d567ac26fb5>", line 11
    for link in soup.findALL('a', {'class':'s-item__title s-item__title--has-tags'})
    ^
IndentationError: unexpected indent

Answer 1

这段代码有很多错误的地方，但我可以帮忙。 for 循环有一个额外的缩进，因此从开头删除一个缩进，并在 for 循环的结尾添加一个 :。此外，您似乎只是从互联网上复制了这个，但无论如何。无论如何，这是正确的代码：

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page=1 
    while page <= max_pages:
      url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=2%22+Butterfly+Valve&_sacat=0&_pgn='+ str(page)
      source_code= requests.get(url)
      plain_text=source_code.text
      soup = BeautifulSoup(plain_text)
      for link in soup.findALL('a', {'class':'s-item__title s-item__title--has-tags'}):
        href = link.get('href')
        print(href)
      page+=1   

trade_spider(1)

编辑：运行此代码后，出现错误：

main.py:10: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 10 of the file main.py. To get rid of this warning, pass the additional argument 'features="html5lib"' to the BeautifulSoup constructor.

正确的代码如下：

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page=1 
    while page <= max_pages:
      url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=2%22+Butterfly+Valve&_sacat=0&_pgn='+ str(page)
      source_code= requests.get(url)
      plain_text=source_code.text
      soup = BeautifulSoup(plain_text, features="html5lib")
      for link in soup.find_all('a', {'class':'s-item__title s-item__title--has-tags'}):
        href = link.get('href')
        print(href)
      page+=1   

trade_spider(1)

用 Python 实现网络爬虫

1 个答案: