Why are multiple prices being saved per product?

时间:2015-10-06 08:13:24

标签: python csv xpath web-scraping mechanize

I've been trying to work out myself why this code is generating multiple prices per product when the data is saved in the csv. It seems that all prices for the row on the page that a product is on are saved under each of the products in that row. Obviously what I'm trying to do is just save one price per product, not 3 or 4 each.

I haven't been able to figure this out myself. What needs to be changed so that only the correct price for each product is stored?

import mechanize
from lxml import html
import csv
import io
from time import sleep

def save_products (products, writer):

    for product in products:

        writer.writerow([ product["title"][0].encode('utf-8') ])
        for price in product['prices']:
            writer.writerow([ price["value"][0].encode('utf-8') ])

f_out = open('ssdResult.csv', 'wb')
writer = csv.writer(f_out)

links = ["http://sciencesuppliesdirect.com/research-chemicals", "http://sciencesuppliesdirect.com/research-chemicals?p=2", "http://sciencesuppliesdirect.com/research-chemicals?p=3","http://sciencesuppliesdirect.com/research-chemicals?p=4","http://sciencesuppliesdirect.com/research-chemicals?p=5","http://sciencesuppliesdirect.com/research-chemicals?p=6","http://sciencesuppliesdirect.com/research-chemicals?p=7","http://sciencesuppliesdirect.com/research-chemicals?p=8","http://sciencesuppliesdirect.com/research-chemicals?p=9","http://sciencesuppliesdirect.com/research-chemicals?p=10","http://sciencesuppliesdirect.com/research-chemicals?p=11","http://sciencesuppliesdirect.com/research-chemicals?p=12","http://sciencesuppliesdirect.com/research-chemicals?p=13","http://sciencesuppliesdirect.com/research-chemicals?p=14","http://sciencesuppliesdirect.com/research-chemicals?p=15","http://sciencesuppliesdirect.com/research-chemicals?p=16","http://sciencesuppliesdirect.com/research-chemicals?p=17","http://sciencesuppliesdirect.com/research-chemicals?p=18","http://sciencesuppliesdirect.com/research-chemicals?p=19","http://sciencesuppliesdirect.com/research-chemicals?p=20","http://sciencesuppliesdirect.com/research-chemicals?p=21","http://sciencesuppliesdirect.com/research-chemicals?p=22","http://sciencesuppliesdirect.com/research-chemicals?p=23","http://sciencesuppliesdirect.com/research-chemicals?p=24","http://sciencesuppliesdirect.com/cannabinoids","http://sciencesuppliesdirect.com/cannabinoids?p=2","http://sciencesuppliesdirect.com/cannabinoids?p=3","http://sciencesuppliesdirect.com/cannabinoids?p=4","http://sciencesuppliesdirect.com/cannabinoids?p=5","http://sciencesuppliesdirect.com/cannabinoids?p=6","http://sciencesuppliesdirect.com/cannabinoids?p=7","http://sciencesuppliesdirect.com/pellets","http://sciencesuppliesdirect.com/pellets?p=2","http://sciencesuppliesdirect.com/pellets?p=3","http://sciencesuppliesdirect.com/herbal-blends","http://sciencesuppliesdirect.com/herbal-blends?p=2","http://sciencesuppliesdirect.com/branded-products","http://sciencesuppliesdirect.com/branded-products?p=2"]

br = mechanize.Browser() 

for link in links:

    print(link)
    r = br.open(link)

    content = r.read()

    products = []        
    tree = html.fromstring(content)        
    product_nodes = tree.xpath('//div[@class="category-products"]/ul')

    for product_node in product_nodes:

        product = {}
        try:
            product['title'] = product_node.xpath('.//li/div[2]/h2/a/text()')

        except:
            product['title'] = ""

        price_nodes = product_node.xpath('.//li/div[2]/div[1]/span')

        product['prices'] = []
        for price_node in price_nodes:

            price = {}
            try:
                price['value'] = price_node.xpath('.//span/text()')

            except:
                price['value'] = ""


            product['prices'].append(price)
        products.append(product)
    save_products(products, writer)

f_out.close() 

1 个答案:

答案 0 :(得分:0)

仔细查看您正在创建的数据结构。坦率地说,这是一团糟 快速浏览一下,就我所知,它是这样的:

[
{
'prices': [{'value': [u'\xa35.00']}, {'value': [u'\xa35.00']}, {'value': [u'\xa36.00']}
],
'title': ['500mg Nitracaine', '5 x 4mg Flubromazepam Pellets', '1 Bk-2C-B Pellet', '10 0.5mg Pyrazolam Pellets']
}
]

一个列表,包含一组价格和标题,其中价格存储为包含列表集的列表,标题是列表。
我想!!!
只是看着它,我的头疼。结果是您的CSV编写例程没有希望,这种数据的结构方式。你将不得不把它排除在外,以便有希望创造你想要的东西 另一件事是,即使您更改代码以将所有内容存储在可用结构中,您的代码也不允许old pricespecial-price,因为product_node.xpath('.//li/div[2]/div[1]/span')不再是正确的方法您想要的价格,而不是它返回一个子集,具体取决于第一个old-price所在的位置,因此价格数量与产品数量不匹配。