我很难理解链接模式在scrapy中是如何工作的,我得到了一个例子。有没有人对如何写一个有任何想法?
def parse(self, response):
hxs = scrapy.Selector(response)
links = hxs.xpath("//a/@href").extract()
#We stored already crawled links in this list
crawledLinks = []
#Pattern to check proper link
linkPattern = re.compile("^(?:ftp|http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+))?$")
for link in links:
# If it is a proper link and is not checked yet, yield it to the Spider
if linkPattern.match(link) and not link in crawledLinks:
crawledLinks.append(link)
yield Request(link, self.parse)
item = MS_homeItem()
item['name'] = hxs.xpath('//*[@id="product-detail-page"]/li[4]/div/div[2]/h1').extract()
yield item
任何帮助都会很棒谢谢James
答案 0 :(得分:0)
Python支持一种非常常见的功能,称为regular expressions
(正则表达式)。这是一个更简单的例子:
import re
strings = [
"cat",
"cut",
"bird",
"catnip",
"cute",
]
pattern = r"c.t"
regex = re.compile(pattern)
for string in strings:
if regex.match(string):
result = 'yes'
else:
result = 'no'
print("{} => {}".format(string, result))
--output:--
cat => yes
cut => yes
bird => no
catnip => yes
cute => yes
在模式中,.
匹配任何字符。
linkPattern
是一个很复杂的模式,可以匹配以下内容:
以及更复杂的网址。网址可能非常复杂,您可以阅读here。