我不明白如何为scrapy编写一个linkpattern

时间:2016-04-28 08:39:16

标签: python python-2.7 web-scraping scrapy

我很难理解链接模式在scrapy中是如何工作的,我得到了一个例子。有没有人对如何写一个有任何想法?

def parse(self, response):

    hxs             = scrapy.Selector(response)
    links           = hxs.xpath("//a/@href").extract()        
    #We stored already crawled links in this list
    crawledLinks    = []



    #Pattern to check proper link
    linkPattern     = re.compile("^(?:ftp|http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+))?$")



    for link in links:
        # If it is a proper link and is not checked yet, yield it to the Spider
        if linkPattern.match(link) and not link in crawledLinks:
            crawledLinks.append(link)
            yield Request(link, self.parse)


    item = MS_homeItem() 
    item['name'] = hxs.xpath('//*[@id="product-detail-page"]/li[4]/div/div[2]/h1').extract()
    yield item

任何帮助都会很棒谢谢James

1 个答案:

答案 0 :(得分:0)

Python支持一种非常常见的功能,称为regular expressions(正则表达式)。这是一个更简单的例子:

import re

strings = [
    "cat",
    "cut",
    "bird",
    "catnip",
    "cute",
]

pattern = r"c.t"
regex = re.compile(pattern)

for string in strings:
    if regex.match(string):
        result = 'yes'
    else:
        result = 'no'

    print("{} => {}".format(string, result))

--output:--
cat => yes
cut => yes
bird => no
catnip => yes
cute => yes

在模式中,.匹配任何字符。

linkPattern是一个很复杂的模式,可以匹配以下内容:

http://www.google.com

以及更复杂的网址。网址可能非常复杂,您可以阅读here