Question

在for循环中读取数组中的第二个值时，bs4存在问题。在下面，我将粘贴代码。

但是，当我使用第19行时，我没有收到任何错误。当我将其换出整个数组（第18行）时，尝试收集第二个值时会出错。请注意，数组中的第二个值与第19行的值相同。

import requests
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

 
SmartLiving_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=Smart%20Living&selectedFacets=Brand%7CSmart%20Living&sortBy="
IEL_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=IEL&selectedFacets=Brand%7CIts%20Exciting%20Lighting&sortBy="
TD_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=two%20dogs&selectedFacets=Brand%7CTwo%20Dogs%20Designs&sortBy="

Headers = "Description, URL, Price \n"

text_file = open("HayneedlePrices.csv", "w")
text_file.write(Headers)
text_file.close()


URL_Array = [SmartLiving_IDS, IEL_IDS, TD_IDS]
#URL_Array = [IEL_IDS]
for URL in URL_Array:
  print("\n" + "Loading New URL:" "\n" + URL + "\n" + "\n")
  
  uClient = uReq(URL)
  page_html = uClient.read()
  uClient.close() 
  soup = soup(page_html, "html.parser")
  
  Containers = soup.findAll("div", {"product-card__container___1U2Sb"})
  for Container in Containers:

    
    Title             = Container.div.img["alt"]    
    Product_URL       = Container.a["href"]
    
    Price_Container   = Container.findAll("div", {"class":"product-card__productInfo___30YSc body no-underline txt-black"})[0].findAll("span", {"style":"font-size:20px"})

    Price_Dollars     = Price_Container[0].get_text()
    Price_Cents       = Price_Container[1].get_text()


    print("\n" + "#####################################################################################################################################################################################################" + "\n")
    # print("   Container: " + "\n" + str(Container))
    # print("\n" + "-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------" + "\n")
    print(" Description: " + str(Title))
    print(" Product URL: " + str(Product_URL))
    print("       Price: " + str(Price_Dollars) + str(Price_Cents))
    print("\n" + "#####################################################################################################################################################################################################" + "\n")
 
    text_file = open("HayneedlePrices.csv", "a")
    text_file.write(str(Title) +  ", " + str(Product_URL) + ", " + str(Price_Dollars) + str(Price_Cents) + "\n")
    text_file.close()

  print("Information gathered and Saved from URL Successfully.")
  print("Looking for Next URL..")
print("No Additional URLs to Gather. Process Completed.")

Answer 1

问题在于您import BeautifulSoup as soup并还定义了一个具有相同名称的变量soup = soup(page_html, "html.parser")！

我稍微重构了您的代码，让我知道它是否按预期工作！

import csv

import requests
from bs4 import BeautifulSoup

smart_living_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=Smart%20Living&selectedFacets=Brand%7CSmart%20Living&sortBy="
IEL_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=IEL&selectedFacets=Brand%7CIts%20Exciting%20Lighting&sortBy="
TD_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=two%20dogs&selectedFacets=Brand%7CTwo%20Dogs%20Designs&sortBy="

site_URLs = [smart_living_IDS, IEL_IDS, TD_IDS]

sess = requests.Session()

prod_data = []

for curr_URL in site_URLs:
    req = sess.get(url=curr_URL)
    soup = BeautifulSoup(req.content, "lxml")

    containers = soup.find_all("div", {"product-card__container___1U2Sb"})
    for curr_container in containers:
        prod_title = curr_container.div.img["alt"]
        prod_URL = curr_container.a["href"]

        price_container = curr_container.find(
            "div",
            {"class": "product-card__productInfo___30YSc body no-underline txt-black"},
        )

        dollars_elem = price_container.find("span", {"class": "main-price-dollars"})
        cents_elem = dollars_elem.find_next("span")

        prod_price = dollars_elem.get_text() + cents_elem.get_text()
        prod_price = float(prod_price[1:])

        prod_data.append((prod_title, prod_URL, prod_price))

CSV_headers = ("title", "URL", "price")

with open("../out/hayneedle_prices.csv", "w", newline="") as file_out:
    writer = csv.writer(file_out)
    writer.writerow(CSV_headers)
    writer.writerows(prod_data)

我通过重复当前URL列表10次来测试它，它花费的时间比我预期的要长。当然，还需要进行一些改进，我可能会在接下来的几天内将其重写为使用lxml，并且多处理可能也是一个不错的选择。当然，这完全取决于您的使用方式：）

网页抓取-ResultSet对象没有属性'findAll'

1 个答案: