我正在使用scrapy从网站上抓取网址。目前它返回所有网址,但我希望它只返回包含“下载”一词的网址。我怎样才能做到这一点?
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
import scrapy
DOMAIN = 'somedomain.com'
URL = 'http://' +str(DOMAIN)
class MySpider(scrapy.Spider):
name = DOMAIN
allowed_domains = [DOMAIN]
start_urls = [
URL
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
for url in hxs.select('//a/@href').extract():
if not ( url.startswith('http://') or url.startswith('https://') ):
url= URL + url
print url
yield Request(url, callback=self.parse)
编辑:
我实施了以下建议。它仍然会抛出一些错误,但至少这只返回包含下载的链接。
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
import scrapy
from scrapy.linkextractors import LinkExtractor
DOMAIN = 'somedomain.com'
URL = 'http://' +str(DOMAIN)
class MySpider(scrapy.Spider):
name = DOMAIN
allowed_domains = [DOMAIN]
start_urls = [
URL
]
# First parse returns all the links of the website and feeds them to parse2
def parse(self, response):
hxs = HtmlXPathSelector(response)
for url in hxs.select('//a/@href').extract():
if not ( url.startswith('http://') or url.startswith('https://') ):
url= URL + url
yield Request(url, callback=self.parse2)
# Second parse selects only the links that contains download
def parse2(self, response):
le = LinkExtractor(allow=("download"))
for link in le.extract_links(response):
yield Request(url=link.url, callback=self.parse2)
print link.url
答案 0 :(得分:2)
更加pythonic和干净的解决方案,是使用LinkExtractor
:
from scrapy.linkextractors import LinkExtractor
...
le = LinkExtractor(deny="download")
for link in le.extract_links(response):
yield Request(url=link.url, callback=self.parse)
答案 1 :(得分:1)
您正在尝试检查字符串中是否存在子字符串。
示例:强>
string = 'this is a simple string'
'simple' in string
True
'zimple' in string
False
因此,您只需要添加if
语句,如:
if 'download' in url:
行后:
for url in hxs.select('//a/@href').extract():
<强>即:强>
for url in hxs.select('//a/@href').extract():
if 'download' in url:
if not ( url.startswith('http://') or url.startswith('https://') ):
url = URL + url
print url
yield Request(url, callback=self.parse)
因此,如果条件http://
返回'download' in url
,代码将仅检查链接是否以True
开头。