我正在抓取本地page_source文件。 Scrapy完全跳过了parse_nextfile()
功能。它适用于parse()
函数。我不知道为什么会这样?
from scrapy import Spider
from scrapy.loader import ItemLoader
from linkedin.items import LinkedinItem
import glob, os
class ProfilesSpider(Spider):
name = 'profiles'
allowed_domains = ["file://127.0.0.1"]
start_urls = ["file://127.0.0.1/path/to/file/text.txt"]
def parse_nextfile(self,response):
#retrieve local files directory
request(url, callback = self.parse)
def parse(self, response):
#scraping the page_source file
答案 0 :(得分:0)
# BEGIN SSL
RewriteEngine On
RewriteCond %{REQUEST_URI} !^/public/
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ /public/$1
## EXPIRES CACHING ##
<IfModule mod_expires.c>
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} libwww-perl.*
RewriteRule .* ? [F,L]
ExpiresActive On
ExpiresByType image/jpg "access 1 month"
ExpiresByType image/jpeg "access 1 month"
ExpiresByType image/gif "access 1 year"
ExpiresByType image/png "access 1 year"
ExpiresByType text/css "access 1 month"
ExpiresByType text/html "access 1 month"
ExpiresByType application/pdf "access 1 month"
ExpiresByType text/x-javascript "access 1 month"
ExpiresByType application/x-shockwave-flash "access 1 month"
ExpiresByType image/x-icon "access 1 year"
ExpiresDefault "access 1 month"
</IfModule>
## EXPIRES CACHING ##
是任何Scrapy请求的默认回调。
如果您需要其他方法来解析请求,则需要在请求中指定parse