Question

我正在从事一个网络爬虫项目，在该项目中，我必须在网站中搜索产品并将该产品的所有详细信息附加到相应的列表中。

例如，此URL的首页列出了10个名称为“ CLOSE UP”的产品。我必须将产品标题添加到列表中，将产品条形码添加到另一个列表中，依此类推。

我也必须对多个页面执行此操作。

到目前为止，这是我的代码

def find_items(base_url, item_to_find, num_of_pages):

    title_list = []
    barcode_list = []
    category_list = []
    manufacturer_list = []

    url = base_url + item_to_find + '/'

    for num in range(1, num_of_pages+1):
        url = url + str(num)
        print(url)
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')
        a_tags = soup.find_all('a', {"class": 'product-search-item'})

        for tag in a_tags:
            p_tags = tag.find_all('p')
            try:
                title_list.append(p_tags[0].contents[0])
                barcode_list.append(p_tags[1].contents[0])
                category_list.append(p_tags[2].contents[0])
                manufacturer_list.append(p_tags[3].contents[0])
            except Exception as e:
                title_list.append('NaN')
                barcode_list.append('NaN')
                category_list.append('NaN')
                manufacturer_list.append('NaN')



        url = base_url + item_to_find + '/'

    return (title_list, barcode_list, category_list, manufacturer_list)

在上面的代码中，我使用 try除外条件将信息附加到列表中，因为并非所有产品都具有所有信息。如果信息可用，则追加到列表，否则追加“ NaN”。这就是代码应该做的。这样可以确保列表的长度始终保持不变。

但是当我运行以下代码时，列表的长度不一样。

title_list, barcode_list, category_list, manufacturer_list = find_items("https://www.barcodelookup.com/", 'close-up', 20)

我不知道我在做什么错。

Answer 1

在try-try上，如果附加项之一失败，则将NaN附加到每个附加项中。以此更改代码。

for tag in a_tags:
    p_tags = tag.find_all('p')
    try:
        title_list.append(p_tags[0].contents[0])
    except Exception as e:
        title_list.append('NaN')
    try:
        barcode_list.append(p_tags[1].contents[0])
    except Exception as e:
        barcode_list.append('NaN')
    try:
        category_list.append(p_tags[2].contents[0])
    eexcept Exception as e:
        category_list.append('NaN')
    try:
        manufacturer_list.append(p_tags[3].contents[0])
    except Exception as e:
        manufacturer_list.append('NaN')

Answer 2

也许您尝试做的是：有时会失败，并且您将在其中添加更多项，除了：

try:
    title_list.append(p_tags[0].contents[0])
except Exception as e:
    title_list.append('NaN')
try:
    barcode_list.append(p_tags[1].contents[0])
except:
    barcode_list.append('NaN')
try:
    category_list.append(p_tags[2].contents[0])
except:
    category_list.append('NaN')
try:
    manufacturer_list.append(p_tags[3].contents[0])  
except:
    manufacturer_list.append('NaN')

Answer 3

问题出在您的try-except逻辑上。让我们假设p_tags[3]不存在。您已经附加了p_tags[0].contents[0]，p_tags[1].contents[0]，p_tags[2].contents[0]，然后出现列表索引超出范围的异常。在except子句中，您将NaN再次附加到所有四个列表中。请注意，您已经为NaN，title_list，barcode_list附加了实际值和category_list。

该修补程序取决于您想要的。合理的选择是仅在您无法访问该特定值时才附加NaN。

def find_items(base_url, item_to_find, num_of_pages):

    title_list = []
    barcode_list = []
    category_list = []
    manufacturer_list = []

    a_tag_count = 0

    url = base_url + item_to_find + '/'

    for num in range(1, num_of_pages+1):
        url = url + str(num)
        print(url)
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')
        a_tags = soup.find_all('a', {"class": 'product-search-item'})
        a_tag_count += len(a_tags)
        for tag in a_tags:
            p_tags = tag.find_all('p')
            safe_append(title_list, 0, p_tags)
            safe_append(barcode_list, 1, p_tags)
            safe_append(category_list, 2, p_tags)
            safe_append(manufacturer_list, 3, p_tags)

        url = base_url + item_to_find + '/'

    return (title_list, barcode_list, category_list, manufacturer_list)


def safe_append(list_to_append, tag_index, p_tags, default_to='NaN'):
    try:
        list_to_append.append(p_tags[tag_index].contents[0])
    except:
        list_to_append.append(default_to)

    return list_to_append

追加项目时列表的长度不同

3 个答案: