这是一个scrapy spider.This spider应该收集所有div节点的名称,类属性= 5d-5,基本上是从y位置创建所有x名称的人的列表,但创建的文件是空的。这是代码:
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class fb_spider(scrapy.Spider):
name="fb"
allowed_domains = ["facebook.com"]
start_urls = [
"https://www.facebook.com/search/people/?q=jaslyn%20california"]
def parse(self,response):
x=response.xpath('//div[@class="_5d-5"]'.extract())
with open("asdf.txt",'wb') as f:
f.write(u"".join(x).encode("UTF-8"))
在命令提示符中,我看到一行:
2016-08-14 12:06:57 [scrapy] DEBUG: Forbidden by robots.txt: <GET https://www.facebook.com/search/top/?q=jaslyn%20california>
2016-08-14 12:06:58 [scrapy] INFO: Closing spider (finished)
它说robots.txt是禁止的。有些东西很可疑。这个网址与我编程抓取的网址不一样。
我将settings.py升级为ROBOTSTXT_OBEY = False,但蜘蛛仍然抓取了与指定网址略有不同的页面。 现在我明白了:
2016-08-14 12:55:37 [scrapy] DEBUG: Crawled (200) <GET https://www.facebook.com/search/top/?q=jaslyn%20california> (referer: None)
2016-08-14 12:55:38 [scrapy] ERROR: Spider error processing <GET https://www.facebook.com/search/top/?q=jaslyn%20california> (referer: None)
Traceback (most recent call last):
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\admin\Downloads\dist\Scrapy-1.1.1\scrapy\fb\fb\spiders\fbspider.py", line 15, in parse
NameError: global name 'x' is not defined