Question

我开发了这个简单的Web抓取程序来抓取newegg.com。我做了一个for循环，以打印出产品名称，价格和运输成本。

但是，当我运行for循环时，它不会输出任何内容，也不会给我任何错误。在编写for循环（带注释的项目）之前，我已经运行了这些行（带注释的项目），并且只打印其中一种产品的详细信息。

from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('https://www.newegg.com/PS4-Systems/SubCategory/ID-3102').text

soup = BeautifulSoup(source, 'lxml')

#prod = soup.find('a', class_='item-title').text
#price = soup.find('li', class_='price-current').text.strip()
#ship = soup.find('li', class_='price-ship').text.strip()
#print(prod.strip())
#print(price.strip())
#print(ship)

for info in soup.find_all('div', class_='item-container  '):
    prod = soup.find('a', class_='item-title').text
    price = soup.find('li', class_='price-current').text.strip()
    ship = soup.find('li', class_='price-ship').text.strip()
    print(prod.strip())
    #price.splitlines()[3].replace('\xa0', '')
    print(price.strip())
    print(ship)

Answer 1

除了'space'错字和缩进，您实际上并没有在for循环中使用info。这将继续打印第一项。在您拥有info的for循环中使用soup。

from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('https://www.newegg.com/PS4-Systems/SubCategory/ID-3102').text

soup = BeautifulSoup(source, 'lxml')

for info in soup.find_all('div', class_='item-container'):
    prod = info.find('a', class_='item-title').text.strip()
    price = info.find('li', class_='price-current').text.strip().splitlines()[1].replace(u'\xa0', '')
    if  u'$' not in price:
        price = info.find('li', class_='price-current').text.strip().splitlines()[0].replace(u'\xa0', '')
    ship = info.find('li', class_='price-ship').text.strip()
    print(prod)
    print(price)
    print(ship)

由于您的代码未在info下面的代码中使用for info in soup.....:，而是在soup.find(..)中使用，因此它只会继续查找例如soup.find('a', class_='item-title')。如果您使用info.find(....)，它将在for循环的每个循环中使用下一个<div>元素。

编辑： 我还发现，当您使用.splitlines()时，价格并不总是第二个项目，有时是第一个。为此，我添加了一项检查以查看该项目是否包含“ $”符号。如果不是，则使用第一个列表项。

Answer 2

编写更少的代码：

from bs4 import BeautifulSoup
import requests

source = requests.get('https://www.newegg.com/PS4-Systems/SubCategory/ID-3102').text    
soup = BeautifulSoup(source, 'lxml')

for info in soup.find_all('div', class_='item-container '):
    print(info.find('a', class_='item-title').text)
    print(info.find('li', class_='price-current').text.strip())        
    print(info.find('li', class_='price-ship').text.strip())

Answer 3

@Rick您错误地在属性值之后的for info in soup.find_all('div', class_='item-container '):这行中添加了多余的空间检查以下代码，它将按预期工作

from bs4 import BeautifulSoup
import requests
import csv

source = requests.get('https://www.newegg.com/PS4-Systems/SubCategory/ID-3102').text

soup = BeautifulSoup(source, 'lxml')

for info in soup.find_all('div', class_='item-container '):
    prod = soup.find('a', class_='item-title').text
    price = soup.find('li', class_='price-current').text.strip()
    ship = soup.find('li', class_='price-ship').text.strip()
    print(prod.strip())
    print(price.strip())
    print(ship)

希望这可以解决您的问题...

Web刮取程序循环不返回任何内容

3 个答案: