Python Spider:从部分匹配

时间:2016-05-11 21:31:30

标签: python

我列出了所有当前的电视模型。作为一个例子,我使用索尼电视模型。官方模特看起来像这样:

KDL40W705CBAEP KDL55W755CBAEP KD75XD8505BAEP KDL40WD650BAEP

使用我的脚本,见下文,只要网页上的文字是100%匹配,我就设法让所有内容都匹配。

我遇到的问题是网站上放了一些模型

KDL40W705 CBAEP KDL55-W755 CBAEP KD75XD8505 KDL40 WD650BAEP

我怎样才能得到那些匹配的?

import scrapy.selector 
import urlparse
from scrapy.spiders import Spider
from scrapy.http import Request
from MediaMarkt.items import MediamarktItem

models = []
for line in open("Sonytv.txt", "r"):
    models.append(line.strip("\n\-"))
d = {}
for model in models:
    d[model] = True

def complete_url(string):
    return "http://www.mediamarkt.de"+ string


def encode(str):
    return str.encode('utf8', 'ignore') 


class MshbeSpider(Spider):
    name = "mshdetv"
    start_urls = ['http://www.mediamarkt.de/mcs/productlist/_led-lcd-fernseher,48353,460668.html?langId=-3&searchParams=&sort=&view=&page=']

    def parse(self, response):  
        items = response.xpath('//ul[@class="products-list"]/li/div')
        for item in items:
            mshtv = MediamarktItem()
            mshtv['item_3_price'] = encode(item.xpath('normalize-space(.//aside/div/div/div/text())').extract()[0]).replace("-","")
            mshtv['item_2_name'] = encode(item.xpath('normalize-space(.//div/h2/a/text())').extract()[0])
            mshtv['item_a_link'] = item.select('.//div/h2/a/@href').extract()
            mshtv['item_4_avai'] = encode(item.xpath('normalize-space(.//aside/div/div/ul/li//text())').extract()[0])
            #mshtv['item_1_cat'] = encode(item.xpath('normalize-space(//*[@id="category"]/hgroup/h1/text())').extract()[0])
            for word in mshtv['item_2_name'].split(" "):
                if word in d:
                    mshtv['item_model'] = word            
            yield mshtv


        new_link = response.xpath('//li[@class="pagination-next"]/a/@href').extract()[0]
        yield Request(complete_url(new_link),callback=self.parse)

0 个答案:

没有答案