如何让我的规则在我的crawlspider中工作并遵循链接,我添加了这个规则,但它不起作用,没有任何显示,但我也没有得到任何错误。我会在我的规则代码中注释掉我的域名。
规则#1
Rule(SgmlLinkExtractor(allow=r'\/company\/.*\?goback=.*'), callback='parse_item',follow=True)
# looking for domains like in my rule:
#http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
#http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
我也尝试过这条规则,但没有任何工作也没有错误:规则#2
rules = (
Rule(SgmlLinkExtractor(allow=('\/company\/[0-9][0-9][0-9][0-9]\?',)), callback='parse_item'),
)
码
class LinkedPySpider(CrawlSpider):
name = 'LinkedPy'
allowed_domains = ['linkedin.com']
login_page = 'https://www.linkedin.com/uas/login'
start_urls = ["http://www.linkedin.com/csearch/results"]
Rule(SgmlLinkExtractor(allow=r'\/company\/.*\?goback=.*'), callback='parse_item',follow=True)
# looking for domains like in my rule:
#http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
#http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True
)
# def init_request(self):
#"""This function is called before crawling starts."""
# return Request(url=self.login_page, callback=self.login)
def login(self, response):
#"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'session_key': 'yescobar2012@gmail.com', 'session_password': 'yescobar01'},
callback=self.check_login_response)
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "Sign Out" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
self.log('Hi, this is an response page! %s' % response.url)
return Request(url='http://www.linkedin.com/csearch/results')
else:
self.log("\n\n\nFailed, Bad times :(\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens.
def parse_item(self, response):
self.log("\n\n\n We got data! \n\n\n")
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[@id=\'result-set\']/li')
items = []
for site in sites:
item = LinkedconvItem()
item['title'] = site.select('h2/a/text()').extract()
item['link'] = site.select('h2/a/@href').extract()
items.append(item)
return items
输出
C:\Users\ye831c\Documents\Big Data\Scrapy\linkedconv>scrapy crawl LinkedPy
2013-07-15 12:05:15-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: linkedconv)
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon
sole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Enabled item pipelines:
2013-07-15 12:05:15-0500 [LinkedPy] INFO: Spider opened
2013-07-15 12:05:15-0500 [LinkedPy] INFO: Crawled 0 pages (at 0 pages/min), scra
ped 0 items (at 0 items/min)
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2013-07-15 12:05:15-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-15 12:05:16-0500 [LinkedPy] DEBUG: Crawled (200) <GET https://www.linked
in.com/uas/login> (referer: None)
2013-07-15 12:05:16-0500 [LinkedPy] DEBUG: Redirecting (302) to <GET http://www.
linkedin.com/nhome/> from <POST https://www.linkedin.com/uas/login-submit>
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/nhome/> (referer: https://www.linkedin.com/uas/login)
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG:
Successfully logged in. Let's start crawling!
2013-07-15 12:05:17-0500 [LinkedPy] DEBUG: Hi, this is an item page! http://www.
linkedin.com/nhome/
2013-07-15 12:05:18-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/csearch/results> (referer: http://www.linkedin.com/nhome/)
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Closing spider (finished)
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2171,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 87904,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 15, 17, 5, 18, 941000),
'log_count/DEBUG': 12,
'log_count/INFO': 4,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2013, 7, 15, 17, 5, 15, 820000)}
2013-07-15 12:05:18-0500 [LinkedPy] INFO: Spider closed (finished)
答案 0 :(得分:2)
SgmlLinkExtractor
使用re
查找链接网址中的匹配项。
您在allow=
中传递的内容经过.compile()
,然后使用_matches
检查页面中的所有链接,该.search()
在已编译的正则表达式上使用... _matches = lambda url, regexs: any((r.search(url) for r in regexs))
SRE_Match
请参阅https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/linkextractors/sgml.py
当我在Python shell中检查你的正则表达式时,它们都有效(它们为URL 1和2返回>>> import re
>>> url1 = 'http://www.linkedin.com/company/1009?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits'
>>> url2 = 'http://www.linkedin.com/company/1033?goback=.fcs_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits'
>>> regex1 = re.compile(r'\/company\/.*\?goback=.*')
>>> regex2 = re.compile('\/company\/[0-9][0-9][0-9][0-9]\?')
>>> regex_fail = re.compile(r'\/company\/.*\?gobackbogus=.*')
>>> regex1.search(url1)
<_sre.SRE_Match object at 0xe6c308>
>>> regex2.search(url1)
<_sre.SRE_Match object at 0xe6c2a0>
>>> regex_fail.search(url1)
>>> regex1.search(url2)
<_sre.SRE_Match object at 0xe6c308>
>>> regex2.search(url2)
<_sre.SRE_Match object at 0xe6c2a0>
>>> regex_fail.search(url2)
>>>
;我添加了一个失败的正则表达式进行比较):
Rule
要检查页面中是否有链接(如果所有内容都不是Javascript生成的),我会添加一个非常通用的allow=()
匹配每个链接(设置allow
或不设置总共{{1}})
见http://doc.scrapy.org/en/latest/topics/link-extractors.html#sgmllinkextractor
但最后,您可能最好使用LinkedIn API进行公司搜索: http://developer.linkedin.com/documents/company-search