我的python脚本只拉取第一个结果然后转到下一个url

时间:2018-01-26 17:45:14

标签: python web-scraping tags

我正在为网站创建一个python scraper来提取价格,产品编号,编号,描述。当我运行此脚本时,它只会拉出页面的第一项,然后转到下一个URL。 python的新手只是想知道如何修改以从页面中提取所有产品。由于澄清第一个网址只有一个产品,但第二个第三个网站都有很多产品没有被拉。

 import requests
from bs4 import BeautifulSoup
import random
import time

product_urls = [
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-precursor-assays/#orderinginformation',
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assay-plate/#orderinginformation',
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assays/#orderinginformation', 
]

for URL in product_urls:
    page = requests.get(URL)
    soup = BeautifulSoup(page.text,"lxml")
    timeDelay = random.randrange(5, 25)

    for item in soup.select('.content'):
        cat_name = item.select_one('.title').text.strip()
        cat_discription = item.select_one('.copy').text.strip()
        product_name = (item.find('div',{'class':'headline'}).text.strip())
        product_discription = (item.find('div',{'class': 'copy'}).text.strip())
        product_number = (item.find('td',{'class': 'textLeft paddingTopLess'}).text.strip())
        cat_number = (item.find('td',{'class': 'textRight paddingTopLess2'}).text.strip())
        product_price = (item.find('span',{'class': 'prc'}).text.strip())
        print("Catagory Name: {}\n\nCatagory Discription:  {}\n\nProduct Name:  {}\n\nProduct Discription:  {}\n\nProduct Number:  {}\n\nCat No:  {}\n\nPrice:  {}\n\n".format(cat_name,cat_discription,product_name,product_discription,product_number,cat_number,product_price))
        time.sleep(timeDelay)

1 个答案:

答案 0 :(得分:0)

您可以从pane类div获取表格元素。第四张桌子是主要产品&第五个(现有的)是附加产品。

在下面的示例中,我使用list comprehension输出元组列表,其中包含标题,描述,产品编号,类别n°&价格:

from bs4 import BeautifulSoup
import requests

product_urls = [
    'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-precursor-assays/#orderinginformation',
    'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assay-plate/#orderinginformation',
    'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assays/#orderinginformation', 
    'https://www.qiagen.com/us/shop/pcr/real-time-pcr-enzymes-and-kits/miscript-target-protectors/#orderinginformation',
    'https://www.qiagen.com/us/shop/pcr/real-time-pcr-enzymes-and-kits/two-step-qrt-pcr/miscript-sybr-green-pcr-kit/#orderinginformation'
]

session = requests.Session()

for URL in product_urls:

    response = session.get(URL)
    soup = BeautifulSoup(response.content, "html.parser")

    tables = soup.find_all("div", {"class":"pane"})[0].find_all("table")

    if (len(tables) > 4):
        product_list = [
            (
                t[0].findAll("div", {"class":"headline"})[0].text.strip(), #title
                t[0].findAll("div", {"class":"copy"})[0].text.strip(),     #description
                t[1].text.strip(),                                         #product number
                t[2].text.strip(),                                         #category number
                t[3].text.strip()                                          #price
            )
            for t in (t.find_all('td') for t in tables[4].find_all('tr'))
            if t
        ]
    elif (len(tables) == 1):
        product_list = [
            (
                t[0].findAll("div", {"class":"catNo"})[0].text.strip(),    #catNo
                t[0].findAll("div", {"class":"headline"})[0].text.strip(), #headline
                t[0].findAll("div", {"class":"price"})[0].text.strip(),    #price
                t[0].findAll("div", {"class":"copy"})[0].text.strip()      #description
            )
            for t in (t.find_all('td') for t in tables[0].find_all('tr'))
            if t
        ]
    else:
        print("could not parse main product")

    print(product_list)

    if len(tables) > 5:
        add_product_list = [
            (
                t[0].findAll("div", {"class":"title"})[0].text.strip(), #title
                t[0].findAll("div", {"class":"copy"})[0].text.strip(),  #description
                t[1].text.strip(),                                      #product number
                t[2].text.strip(),                                      #category number
                t[3].text.strip()                                       #price
            )
            for t in (t.find_all('td') for t in tables[5].find_all('tr'))
            if t
        ]
        print(add_product_list)

如果要将元组索引转换为每个字段的单个列表,请检查this answer