通过URL数组循环解析html,不会循环

时间:2019-04-18 01:55:25

标签: for-loop beautifulsoup save screen-scraping

我正在尝试提取产品描述,第一个循环遍历每个产品,嵌套循环进入每个产品页面并获取描述以进行提取。

  for page in range(1, 2):
      guitarPage = 
  requests.get('https://www.guitarguitar.co.uk/guitars/acoustic/page- 
 {}'.format(page)).text
      soup = BeautifulSoup(guitarPage, 'lxml')
      guitars = soup.find_all(class_='col-xs-6 col-sm-4 col-md-4 col-lg-3')

这是每种产品的循环

for guitar in guitars:

    title_text = guitar.h3.text.strip()
    print('Guitar Name: ', title_text)
    price = guitar.find(class_='price bold small').text.strip()
    print('Guitar Price: ', price)

    priceSave = guitar.find('span', {'class': 'price save'})
    if priceSave is not None:
        priceOf = priceSave.text
        print(priceOf)
    else:
        print("No discount!")

    image = guitar.img.get('src')
    print('Guitar Image: ', image)

    productLink = guitar.find('a').get('href')
    linkProd = url + productLink
    print('Link of product', linkProd)

在这里,我将收集的链接添加到数组中

    productsPage.append(linkProd)

这是我进入每个产品页面并提取说明的尝试

    for products in productsPage:
        response = requests.get(products)
        soup = BeautifulSoup(response.content, "lxml")
        productsDetails = soup.find("div", {"class":"description-preview"})
        if productsDetails is not None:
            description = productsDetails.text
            # print('product detail: ', description)
        else:
            print('none')
        time.sleep(0.2)

    if None not in(title_text,price,image,linkProd, description):
        products = {
            'title': title_text,
            'price': price,
            'discount': priceOf,
            'image': image,
            'link': linkProd,
            'description': description,

        }
        result.append(products)
        with open('datas.json', 'w') as outfile:
            json.dump(result, outfile, ensure_ascii=False, indent=4, separators=(',', ': '))
        # print(result)
        print('--------------------------')
    time.sleep(0.5)

结果应该是

{
        "title": "Yamaha NTX700 Electro Classical Guitar (Pre-Owned) #HIM041005",
        "price": "£399.00",
        "discount": null,
        "image": "https://images.guitarguitar.co.uk/cdn/large/150/PXP190415342158006-3115645f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
        "link": "https://www.guitarguitar.co.uk/product/pxp190415342158006-3115645--yamaha-ntx700-electro-classical-guitar-pre-owned-him",
        "description": "\nProduct Overview\nThe versatile, contemporary styled NTX line is designed with thinner bodies, narrower necks, 14th fret neck joints, and cutaway designs to provide greater comfort and playability f... read more\n"
    },

但是该描述适用于第一个,以后不会更改。

[
    {
        "title": "Yamaha APX600FM Flame Maple Tobacco Sunburst",
        "price": "£239.00",
        "discount": "Save £160.00",
        "image": "https://images.guitarguitar.co.uk/cdn/large/150/190315340677008f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
        "link": "https://www.guitarguitar.co.uk/product/190315340677008--yamaha-apx600fm-flame-maple-tobacco-sunburst",
        "description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
    },
    {
        "title": "Yamaha APX600FM Flame Maple Amber",
        "price": "£239.00",
        "discount": "Save £160.00",
        "image": "https://images.guitarguitar.co.uk/cdn/large/150/190315340676008f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
        "link": "https://www.guitarguitar.co.uk/product/190315340676008--yamaha-apx600fm-flame-maple-amber",
        "description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
    },
    {
        "title": "Yamaha AC1R Acoustic Electric Concert Size Rosewood Back And Sides with SRT Pickup",
        "price": "£399.00",
        "discount": "Save £267.00",
        "image": "https://images.guitarguitar.co.uk/cdn/large/105/11012414211132.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
        "link": "https://www.guitarguitar.co.uk/product/11012414211132--yamaha-ac1r-acoustic-electric-concert-size-rosewood-back-and-sid",
        "description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
    }
]

这是我得到的结果,它一直在变化,有时会显示产品的先前描述

1 个答案:

答案 0 :(得分:0)

它确实循环了,但似乎在服务器端已采取了一些保护措施,并且失败的页面发生了变化。我检查失败的页面,并搜索了内容。在我的测试中,似乎没有任何一种方法可以满足要求(我没有尝试超过2的睡眠时间,但是尝试通过<< 2的睡眠尝试进行一些IP和用户代理更改。)

您可以尝试交替使用IP和用户代理,取消重试,更改两次请求之间的时间。

正在更改代理:https://www.scrapehero.com/how-to-rotate-proxies-and-ip-addresses-using-python-3/

更改用户代理:https://pypi.org/project/fake-useragent/