Question

我正尝试通过Web抓取多个网站以获取不同类型的产品。我能够通过网络抓取一个网址。我创建了一个列表，用于在网上抓取多个网址，然后将产品名称和价格导出到CVL文件。但是，它似乎无法按需工作。

下面是我的代码：

#imports
import pandas as pd
import requests
from bs4 import BeautifulSoup

#Product Websites For Consolidation
urls = ['https://www.aeroprecisionusa.com/ar15/lower-receivers/stripped-lowers?product_list_limit=all', 'https://www.aeroprecisionusa.com/ar15/lower-receivers/complete-lowers?product_list_limit=all']
for url in urls:
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0"}
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')


    #Locating All Products On Page
    all_products_on_page = soup.find(class_='products wrapper container grid products-grid')
    individual_items = all_products_on_page.find_all(class_='product-item-info')


    #Breaking Down Product By Name And Price
    aero_product_name = [item.find(class_='product-item-link').text for item in individual_items]
    aero_product_price = [p.text if (p := item.find(class_='price')) is not None else 'no price' for item in individual_items]


    Aero_Stripped_Lowers_Consolidated = pd.DataFrame(
        {'Aero Product': aero_product_name,
        'Prices': aero_product_price,
        })

    Aero_Stripped_Lowers_Consolidated.to_csv('MasterPriceTracker.csv')

该代码将所需的产品名称和价格导出到CVL文件，但仅用于第二个URL，即“ complete-lower”。我不确定我在For循环中弄乱了什么，以使其不通过网络抓取两个URL。我确认两个网址的HTML代码相同。

任何帮助将不胜感激！

Answer 1

将to_csv调用移到循环外。因为它在循环中，所以它正在为每个条目重写csv文件（因此，文件中仅显示最后一个条目）。

在循环中，将字典附加到循环开始之前创建的数据框。另外，循环中不需要每次都重新定义headers，所以我也将它们拉到外面。

import pandas as pd
import requests
from bs4 import BeautifulSoup

#Product Websites For Consolidation
urls = ['https://www.aeroprecisionusa.com/ar15/lower-receivers/stripped-lowers?product_list_limit=all', 'https://www.aeroprecisionusa.com/ar15/lower-receivers/complete-lowers?product_list_limit=all']

Aero_Stripped_Lowers_Consolidated = pd.DataFrame(columns=['Aero Product', 'Prices'])
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0"}

for url in urls:
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')


    #Locating All Products On Page
    all_products_on_page = soup.find(class_='products wrapper container grid products-grid')
    individual_items = all_products_on_page.find_all(class_='product-item-info')


    #Breaking Down Product By Name And Price
    aero_product_name = [item.find(class_='product-item-link').text for item in individual_items]
    aero_product_price = [p.text if (p := item.find(class_='price')) is not None else 'no price' for item in individual_items]


    Aero_Stripped_Lowers_Consolidated = Aero_Stripped_Lowers_Consolidated.append(pd.DataFrame(
        {'Aero Product': aero_product_name,
        'Prices': aero_product_price,
        }))

Aero_Stripped_Lowers_Consolidated.to_csv('MasterPriceTracker.csv')

在网络上抓取多个URL时，For循环不起作用。只抓取一个网址

1 个答案: