为什么scrapy会抓取不同的facebook页面?为什么文件为空?

时间:2016-08-13 16:48:15

标签: python scrapy

这是一个scrapy spider.This spider应该收集所有div节点的名称,类属性= 5d-5,基本上是从y位置创建所有x名称的人的列表,但创建的文件是空的。这是代码:

import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class fb_spider(scrapy.Spider):
    name="fb"
    allowed_domains = ["facebook.com"]
    start_urls = [             
    "https://www.facebook.com/search/people/?q=jaslyn%20california"]

     def parse(self,response):
         x=response.xpath('//div[@class="_5d-5"]'.extract())
         with open("asdf.txt",'wb') as f:
             f.write(u"".join(x).encode("UTF-8"))

在命令提示符中,我看到一行:

    2016-08-14 12:06:57 [scrapy] DEBUG: Forbidden by robots.txt: <GET https://www.facebook.com/search/top/?q=jaslyn%20california>
    2016-08-14 12:06:58 [scrapy] INFO: Closing spider (finished)

它说robots.txt是禁止的。有些东西很可疑。这个网址与我编程抓取的网址不一样。

我将settings.py升级为ROBOTSTXT_OBEY = False,但蜘蛛仍然抓取了与指定网址略有不同的页面。 现在我明白了:

2016-08-14 12:55:37 [scrapy] DEBUG: Crawled (200) <GET https://www.facebook.com/search/top/?q=jaslyn%20california> (referer: None)
 2016-08-14 12:55:38 [scrapy] ERROR: Spider error processing <GET https://www.facebook.com/search/top/?q=jaslyn%20california> (referer: None)
 Traceback (most recent call last):
 File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
 File "C:\Users\admin\Downloads\dist\Scrapy-1.1.1\scrapy\fb\fb\spiders\fbspider.py", line 15, in parse
NameError: global name 'x' is not defined     

0 个答案:

没有答案