如您所见,以下json有多个关注者以及评论数量,但我如何访问每条评论中的数据以及关注者的ID以便我可以抓取它们?
{
"logging_page_id": "profilePage_20327023",
"user": {
"biography": null,
"blocked_by_viewer": false,
"connected_fb_page": null,
"country_block": false,
"external_url": null,
"external_url_linkshimmed": null,
"followed_by": {
"count": 2585
},
"followed_by_viewer": false,
"follows": {
"count": 561
},
"follows_viewer": false,
"full_name": "LeAnne Barengo",
"has_blocked_viewer": false,
"has_requested_viewer": false,
"id": "20327023",
"is_private": false,
"is_verified": false,
"media": {
"count": 1904,
"nodes": [
{
"__typename": "GraphImage",
"caption": "The coop was literally blown away. #wtf #sicktomystomach",
"code": "BRHDfFHAUg3",
"comments": {
"count": 18
},
"comments_disabled": false,
"date": 1488402905,
"dimensions": {
"height": 1080,
"width": 1080
},
"display_src": "https://scontent.cdninstagram.com/t51.2885-15/e35/16908727_1139679066131441_6607786783801344000_n.jpg",
"id": "1461151934034561079",
"is_video": false,
"likes": {
"count": 46
},
"owner": {
"id": "20327023"
},
"thumbnail_src": "https://scontent.cdninstagram.com/t51.2885-15/s640x640/sh0.08/e35/16908727_1139679066131441_6607786783801344000_n.jpg"
以下是与此相关的代码:
import scrapy
print(scrapy.__file__)
import json
from instagram.items import UserItem
from instagram.items import PostItem
from scrapy.spider import BaseSpider as Spider
class InstagramSpider(Spider):
name = 'instagramspider'
allowed_domains = ['instagram.com']
start_urls = []
def __init__(self):
#self.start_urls = ["https://www.instagram.com/_spataru/?__a=1"]
#self.start_urls = ["https://www.instagram.com/mona_of_green_gables/?__a=1"]
self.start_urls = ["https://www.instagram.com/ducks_love_sun/?__a=1"]
def parse(self, response):
#get the json file
json_response = {}
try:
json_response = json.loads(response.body_as_unicode())
print json.dumps(json_response, indent=4, sort_keys=True)
except:
self.logger.info('%s doesnt exist', response.url)
pass
if json_response["user"]["is_private"]:
return;
#check if the username even worked
try:
json_response = json_response["user"]
item = UserItem()
#get User Info
item["username"] = json_response["username"]
item["follows_count"] = json_response["follows"]["count"]
item["followed_by_count"] = json_response["followed_by"]["count"]
item["is_verified"] = json_response["is_verified"]
item["biography"] = json_response.get("biography")
item["external_link"] = json_response.get("external_url")
item["full_name"] = json_response.get("full_name")
item["posts_count"] = json_response.get("media").get("count")
#interate through each post
item["posts"] = []
json_response = json_response.get("media").get("nodes")
if json_response:
for post in json_response:
items_post = PostItem()
items_post["code"]=post["code"]
items_post["likes"]=post["likes"]["count"]
items_post["caption"]=post["caption"]
items_post["thumbnail"]=post["thumbnail_src"]
item["posts"].append(dict(items_post))
return item
except:
self.logger.info("Error during parsing %s", response.url)
当我使用print json.dumps(json_response["user"]["media"]["nodes"][1]["comments"][1], indent=4, sort_keys=True)
我收到此错误:
mona@pascal:~/computer_vision/instagram/instagram$ scrapy crawl instagramspider
2017-03-01 20:38:39-0600 [scrapy] INFO: Scrapy 0.14.4 started (bot: instagram)
2017-03-01 20:38:39-0600 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
/usr/lib/python2.7/dist-packages/scrapy/__init__.pyc
2017-03-01 20:38:40-0600 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-03-01 20:38:40-0600 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-03-01 20:38:40-0600 [scrapy] DEBUG: Enabled item pipelines:
2017-03-01 20:38:40-0600 [instagramspider] INFO: Spider opened
2017-03-01 20:38:40-0600 [instagramspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-03-01 20:38:40-0600 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2017-03-01 20:38:40-0600 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2017-03-01 20:38:40-0600 [instagramspider] DEBUG: Crawled (200) <GET https://www.instagram.com/ducks_love_sun/?__a=1> (referer: None)
monamona
2017-03-01 20:38:40-0600 [instagramspider] ERROR: Spider error processing <GET https://www.instagram.com/ducks_love_sun/?__a=1>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/Twisted-13.1.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "/usr/local/lib/python2.7/dist-packages/Twisted-13.1.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/local/lib/python2.7/dist-packages/Twisted-13.1.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 380, in callback
self._startRunCallbacks(result)
File "/usr/local/lib/python2.7/dist-packages/Twisted-13.1.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 488, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/Twisted-13.1.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 575, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/mona/computer_vision/instagram/instagram/instagram/spiders/spider.py", line 31, in parse
self.logger.info('%s doesnt exist', response.url)
exceptions.AttributeError: 'InstagramSpider' object has no attribute 'logger'
2017-03-01 20:38:40-0600 [instagramspider] INFO: Closing spider (finished)
2017-03-01 20:38:40-0600 [instagramspider] INFO: Dumping spider stats:
{'downloader/request_bytes': 223,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2985,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 3, 2, 2, 38, 40, 460277),
'scheduler/memory_enqueued': 1,
'spider_exceptions/AttributeError': 1,
'start_time': datetime.datetime(2017, 3, 2, 2, 38, 40, 206785)}
2017-03-01 20:38:40-0600 [instagramspider] INFO: Spider closed (finished)
2017-03-01 20:38:40-0600 [scrapy] INFO: Dumping global stats:
{'memusage/max': 120844288, 'memusage/startup': 120844288}
而print json.dumps(json_response["user"]["media"]["nodes"][1]["comments"], indent=4, sort_keys=True)
我得到了
{
"count": 19
}
答案 0 :(得分:0)
通过使用类似于以下的代码,我能够抓住评论的文本:
29 code = json_response["user"]["media"]["nodes"][1]["code"]
30 sub_post_url = "https://www.instagram.com/p/"+code+"/?__a=1"
31 print(sub_post_url)
32 sub_response = requests.get(sub_post_url).json()
33 pprint(sub_response)
34 print(sub_response["media"]["comments"]["nodes"][7]["text"])
或通过以下方式迭代所有评论:
34 for i in range(len(sub_response["media"]["comments"]["nodes"])):
35 print(sub_response["media"]["comments"]["nodes"][i]["text"])