如何使用scrapy获取一个人的粉丝以及Instagram下的照片评论?

时间:2017-03-02 01:00:34

标签: python json web-scraping scrapy scrapy-spider

如您所见,以下json有多个关注者以及评论数量,但我如何访问每条评论中的数据以及关注者的ID以便我可以抓取它们?

{
    "logging_page_id": "profilePage_20327023", 
    "user": {
        "biography": null, 
        "blocked_by_viewer": false, 
        "connected_fb_page": null, 
        "country_block": false, 
        "external_url": null, 
        "external_url_linkshimmed": null, 
        "followed_by": {
            "count": 2585
        }, 
        "followed_by_viewer": false, 
        "follows": {
            "count": 561
        }, 
        "follows_viewer": false, 
        "full_name": "LeAnne Barengo", 
        "has_blocked_viewer": false, 
        "has_requested_viewer": false, 
        "id": "20327023", 
        "is_private": false, 
        "is_verified": false, 
        "media": {
            "count": 1904, 
            "nodes": [
                {
                    "__typename": "GraphImage", 
                    "caption": "The coop was literally blown away. #wtf #sicktomystomach", 
                    "code": "BRHDfFHAUg3", 
                    "comments": {
                        "count": 18
                    }, 
                    "comments_disabled": false, 
                    "date": 1488402905, 
                    "dimensions": {
                        "height": 1080, 
                        "width": 1080
                    }, 
                    "display_src": "https://scontent.cdninstagram.com/t51.2885-15/e35/16908727_1139679066131441_6607786783801344000_n.jpg", 
                    "id": "1461151934034561079", 
                    "is_video": false, 
                    "likes": {
                        "count": 46
                    }, 
                    "owner": {
                        "id": "20327023"
                    }, 
                    "thumbnail_src": "https://scontent.cdninstagram.com/t51.2885-15/s640x640/sh0.08/e35/16908727_1139679066131441_6607786783801344000_n.jpg"

以下是与此相关的代码:

import scrapy
print(scrapy.__file__)
import json
from instagram.items import UserItem
from instagram.items import PostItem
from scrapy.spider import BaseSpider as Spider

class InstagramSpider(Spider):

    name = 'instagramspider'
    allowed_domains = ['instagram.com']
    start_urls = []

    def __init__(self):
        #self.start_urls = ["https://www.instagram.com/_spataru/?__a=1"]
        #self.start_urls = ["https://www.instagram.com/mona_of_green_gables/?__a=1"]
        self.start_urls = ["https://www.instagram.com/ducks_love_sun/?__a=1"]
    def parse(self, response):
        #get the json file
        json_response = {}
        try:
            json_response = json.loads(response.body_as_unicode())
            print json.dumps(json_response, indent=4, sort_keys=True)

        except:
            self.logger.info('%s doesnt exist', response.url)
            pass
        if json_response["user"]["is_private"]:
            return;
        #check if the username even worked
        try:
            json_response = json_response["user"]

            item = UserItem()


            #get User Info
            item["username"] = json_response["username"]
            item["follows_count"] = json_response["follows"]["count"]
            item["followed_by_count"] = json_response["followed_by"]["count"]
            item["is_verified"] = json_response["is_verified"]
            item["biography"] = json_response.get("biography")
            item["external_link"] = json_response.get("external_url")
            item["full_name"] = json_response.get("full_name")
            item["posts_count"] = json_response.get("media").get("count")

            #interate through each post
            item["posts"] = []

            json_response = json_response.get("media").get("nodes")
            if json_response:
                for post in json_response:
                    items_post = PostItem()
                    items_post["code"]=post["code"]
                    items_post["likes"]=post["likes"]["count"]
                    items_post["caption"]=post["caption"]
                    items_post["thumbnail"]=post["thumbnail_src"]
                    item["posts"].append(dict(items_post))

            return item
        except:
            self.logger.info("Error during parsing %s", response.url)

当我使用print json.dumps(json_response["user"]["media"]["nodes"][1]["comments"][1], indent=4, sort_keys=True)

之类的东西时

我收到此错误:

mona@pascal:~/computer_vision/instagram/instagram$ scrapy crawl instagramspider
2017-03-01 20:38:39-0600 [scrapy] INFO: Scrapy 0.14.4 started (bot: instagram)
2017-03-01 20:38:39-0600 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
/usr/lib/python2.7/dist-packages/scrapy/__init__.pyc
2017-03-01 20:38:40-0600 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-03-01 20:38:40-0600 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-03-01 20:38:40-0600 [scrapy] DEBUG: Enabled item pipelines: 
2017-03-01 20:38:40-0600 [instagramspider] INFO: Spider opened
2017-03-01 20:38:40-0600 [instagramspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-03-01 20:38:40-0600 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2017-03-01 20:38:40-0600 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2017-03-01 20:38:40-0600 [instagramspider] DEBUG: Crawled (200) <GET https://www.instagram.com/ducks_love_sun/?__a=1> (referer: None)
monamona
2017-03-01 20:38:40-0600 [instagramspider] ERROR: Spider error processing <GET https://www.instagram.com/ducks_love_sun/?__a=1>
    Traceback (most recent call last):
      File "/usr/local/lib/python2.7/dist-packages/Twisted-13.1.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 1201, in mainLoop
        self.runUntilCurrent()
      File "/usr/local/lib/python2.7/dist-packages/Twisted-13.1.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 824, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/usr/local/lib/python2.7/dist-packages/Twisted-13.1.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 380, in callback
        self._startRunCallbacks(result)
      File "/usr/local/lib/python2.7/dist-packages/Twisted-13.1.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 488, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/usr/local/lib/python2.7/dist-packages/Twisted-13.1.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 575, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/home/mona/computer_vision/instagram/instagram/instagram/spiders/spider.py", line 31, in parse
        self.logger.info('%s doesnt exist', response.url)
    exceptions.AttributeError: 'InstagramSpider' object has no attribute 'logger'

2017-03-01 20:38:40-0600 [instagramspider] INFO: Closing spider (finished)
2017-03-01 20:38:40-0600 [instagramspider] INFO: Dumping spider stats:
    {'downloader/request_bytes': 223,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 2985,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2017, 3, 2, 2, 38, 40, 460277),
     'scheduler/memory_enqueued': 1,
     'spider_exceptions/AttributeError': 1,
     'start_time': datetime.datetime(2017, 3, 2, 2, 38, 40, 206785)}
2017-03-01 20:38:40-0600 [instagramspider] INFO: Spider closed (finished)
2017-03-01 20:38:40-0600 [scrapy] INFO: Dumping global stats:
    {'memusage/max': 120844288, 'memusage/startup': 120844288}

print json.dumps(json_response["user"]["media"]["nodes"][1]["comments"], indent=4, sort_keys=True)

我得到了

{
    "count": 19
}

1 个答案:

答案 0 :(得分:0)

通过使用类似于以下的代码,我能够抓住评论的文本:

 29             code = json_response["user"]["media"]["nodes"][1]["code"]
 30             sub_post_url = "https://www.instagram.com/p/"+code+"/?__a=1"
 31             print(sub_post_url)
 32             sub_response = requests.get(sub_post_url).json()
 33             pprint(sub_response)
 34             print(sub_response["media"]["comments"]["nodes"][7]["text"])

或通过以下方式迭代所有评论:

34             for i in range(len(sub_response["media"]["comments"]["nodes"])):
 35                 print(sub_response["media"]["comments"]["nodes"][i]["text"])