Question

我在json中获得了一个twitter流数据文件。现在我试图在python中加载它：

import json

tweets_data=[]
tweets_file=open('test1.txt',"r")
for line in tweets_file:
    try:
        tweet=json.load(line)
        tweets_data.append(tweet)
    except:
        continue 

print(len(tweets_data))

结果始终为0.如果＆＃34;尝试＆＃34;和＆＃34;除了＆＃34;被删除，然后错误是＆＃34; ValueError：期望值：第2行第1列（字符1）＆＃34;。但是，根据在线验证器，文件的每一行都是有效的JSON。

以下是test1.txt的一段：

{"created_at":"Fri Jul 24 16:35:22 +0000 2015","id":624618886277640192,"id_str":"624618886277640192","text":"RT @nodenow: Essential Steps: Long Term Support for Node.js\nhttp:\/\/t.co\/MzPfvenwtT\n+1 micshasan #javascript","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":3290861609,"id_str":"3290861609","name":"Rajiin","screen_name":"Rajiin_07","location":"Pokhara city","url":"http:\/\/www.pokharacity.com","description":null,"protected":false,"verified":false,"followers_count":1101,"friends_count":1119,"listed_count":155,"favourites_count":2048,"statuses_count":5498,"created_at":"Wed May 20 04:58:23 +0000 2015","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"4A913C","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/617620457336893440\/3HTEKnMx_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/617620457336893440\/3HTEKnMx_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/3290861609\/1435854327","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Fri Jul 24 16:33:04 +0000 2015","id":624618308050915328,"id_str":"624618308050915328","text":"Essential Steps: Long Term Support for Node.js\nhttp:\/\/t.co\/MzPfvenwtT\n+1 micshasan #javascript","source":"\u003ca href=\"http:\/\/ifttt.com\" rel=\"nofollow\"\u003eIFTTT\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":3243544179,"id_str":"3243544179","name":"Javascript Digest","screen_name":"nodenow","location":"","url":null,"description":null,"protected":false,"verified":false,"followers_count":1238,"friends_count":1,"listed_count":1148,"favourites_count":2,"statuses_count":130923,"created_at":"Sat May 09 15:45:13 +0000 2015","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/597066594334941184\/Xe4tTtU8_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/597066594334941184\/Xe4tTtU8_normal.jpg","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":1,"favorite_count":0,"entities":{"hashtags":[{"text":"javascript","indices":[83,94]}],"trends":[],"urls":[{"url":"http:\/\/t.co\/MzPfvenwtT","expanded_url":"http:\/\/bit.ly\/1LH81ly","display_url":"bit.ly\/1LH81ly","indices":[47,69]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en"},"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"javascript","indices":[96,107]}],"trends":[],"urls":[{"url":"http:\/\/t.co\/MzPfvenwtT","expanded_url":"http:\/\/bit.ly\/1LH81ly","display_url":"bit.ly\/1LH81ly","indices":[60,82]}],"user_mentions":[{"screen_name":"nodenow","name":"Javascript Digest","id":3243544179,"id_str":"3243544179","indices":[3,11]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1437755722003"}


{"created_at":"Fri Jul 24 16:35:22 +0000 2015","id":624618888387432449,"id_str":"624618888387432449","text":"python \u041c\u043e\u0441\u043a\u0432\u0430  http:\/\/t.co\/itYJmgVvgD","source":"\u003ca href=\"http:\/\/gdepraktika.ru\" rel=\"nofollow\"\u003egdepraktika-trfnslator\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":623605809,"id_str":"623605809","name":"\u0413\u0434\u0435 \u043f\u0440\u0430\u043a\u0442\u0438\u043a\u0430?","screen_name":"gdepraktika","location":"\u0420\u043e\u0441\u0441\u0438\u044f","url":"http:\/\/gdepraktika.ru","description":"\u041f\u0440\u0430\u043a\u0442\u0438\u043a\u0430, \u0441\u0442\u0430\u0436\u0438\u0440\u043e\u0432\u043a\u0430, \u0440\u0430\u0431\u043e\u0442\u0430 \u0434\u043b\u044f \u0441\u0442\u0443\u0434\u0435\u043d\u0442\u043e\u0432, \u043e\u0431\u0443\u0447\u0435\u043d\u0438\u0435 \u0432 \u043a\u043e\u043c\u043f\u0430\u043d\u0438\u044f\u0445","protected":false,"verified":false,"followers_count":17,"friends_count":9,"listed_count":0,"favourites_count":0,"statuses_count":902069,"created_at":"Sun Jul 01 07:53:36 +0000 2012","utc_offset":10800,"time_zone":"Moscow","geo_enabled":false,"lang":"ru","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/378800000420815111\/bba61a6dcd4272794a4af41dd8a44cf5_normal.png","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/378800000420815111\/bba61a6dcd4272794a4af41dd8a44cf5_normal.png","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"trends":[],"urls":[{"url":"http:\/\/t.co\/itYJmgVvgD","expanded_url":"http:\/\/bit.ly\/1GqpqOg","display_url":"bit.ly\/1GqpqOg","indices":[15,37]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"und","timestamp_ms":"1437755722506"}

Answer 1

这是因为有效的json行之间有两个空行。只需添加一个空行检查，你就应该好了。

import json
tweets_data = []
notParsed = []
tweets_file = open('test1.txt',"r")
for line in tweets_file:    
    if line.strip():    
        try:
            tweet=json.load(line)
            tweets_data.append(tweet)
        except:
            notParsed.append(line)
            continue
print(len(tweets_data))
print('Could not parse: ', len(notParsed))

这不是必需的，我只是因为您的答案而修改Python，但您可以按如下方式编辑代码：

map(json.loads, [x for x in open('test1.txt').read().split('\n') if x.strip()])

Answer 2

两件事。首先，你的片段看起来并不像有效的json。（并将其复制粘贴到验证器http://jsonlint.com中确认了这一点。从语法上讲，两条推文之间需要有逗号，原因是它们需要是某些更高级别数据结构中的元素（json） - 等价的python列表==一个“数组”，或者一个dict的json等价物，一个“对象”）。在json文件中只能有一个根元素。请参阅前面的问题：How to read a JSON file containing multiple root elements? 。

其次，如果你试图获得对json数据结构的普通访问，并且没有尝试做任何特殊的，取决于行的概念（或担心内存管理等），那么你真的不需要像这样逐行阅读它。相反，你可以将整个shebang绑定到一个变量，它将根据明显的语法将json转换为嵌套列表和dicts（即json花括号/括号函数与python函数相同）。

因此，一旦json有效，代码就可以如下：

import json
with open('test1.txt') as json_file:
  myjson = json.load(json_file)

然后只需通过list indexes / dict键访问json中的每个元素。

此方法忽略空格，因此您应该没有换行符问题。

最终，这表明解决方案是将您的推文包装在顶级列表（“数组”）中并在它们之间粘贴逗号（可能将正则表达式作为字符串操作？），然后只需通过列表索引来访问它们而不是一行一行。

Answer 3

一行可能有\ r \ n空格.etc 你应该使用json.loads()

import json

tweets_data=[]
tweets_file=open('test1.txt',"r")
for line in tweets_file:
    try:
        tweet=json.loads(line.strip())
        tweets_data.append(tweet)
    except:
        continue

print(len(tweets_data))

无法在python

3 个答案: