我目前正在开发一个python程序,它将过滤掉" text"中的一些关键字。 JSON文件的标记。我系统的转换如下:.gz - >在模式rb中使用gzip打开 - >转变b''进入str - > json.load(STR)
def gzworker(fullpath, condition):
"""Worker opens one .gz file"""
print('Opening {}'.format(fullpath))
buffer = []
with gzip.open(fullpath, 'rb') as infile:
for _line in infile:
result = filter(json.loads(str(_line).split('|',1)[1][:-5]), condition)
if result:
buffer.append(result)
print('Closing {}'.format(fullpath))
return buffer
使用过滤器函数将JSON文件作为参数
多次运行此代码后,我意识到实际上它不起作用的原因是某些逗号似乎消失了。有人知道在这个过程中是否有可能丢失某些信息?
使用前一种方法获得的结果(无效的JSON)[如果我使用解码时结果相同]
{"created_at":"Thu Apr 17 04:45:03 +0000 2014","id":456654551114735616,"id_str":"456654551114735616","text":"@cam_clay1 come visit us soon plz \\ud83d\\ude18","source":"\\u003ca href=\\"http:\\/\\/twitter.com\\/download\\/iphone\\" rel=\\"nofollow\\"\\u003eTwitter for iPhone\\u003c\\/a\\u003e","truncated":false,"in_reply_to_status_id":456654343781892098,"in_reply_to_status_id_str":"456654343781892098","in_reply_to_user_id":427007607,"in_reply_to_user_id_str":"427007607","in_reply_to_screen_name":"cam_clay1","user":{"id":335107310,"id_str":"335107310","name":"Roger Krick","screen_name":"roger_krick","location":"Atlanta GA","url":null,"description":"I pushed Regina George in front of the bus.","protected":false,"followers_count":772,"friends_count":235,"listed_count":3,"created_at":"Thu Jul 14 04:49:29 +0000 2011","favourites_count":7192,"utc_offset":-18000,"time_zone":"Quito","geo_enabled":true,"verified":false,"statuses_count":9518,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\\/\\/pbs.twimg.com\\/profile_background_images\\/378800000021719152\\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_image_url_https":"https:\\/\\/pbs.twimg.com\\/profile_background_images\\/378800000021719152\\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_tile":true,"profile_image_url":"http:\\/\\/pbs.twimg.com\\/profile_images\\/453031044393222144\\/7vIvMWvk_normal.jpeg","profile_image_url_https":"https:\\/\\/pbs.twimg.com\\/profile_images\\/453031044393222144\\/7vIvMWvk_normal.jpeg","profile_banner_url":"https:\\/\\/pbs.twimg.com\\/profile_banners\\/335107310\\/1352964715","profile_link_color":"0084B4","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":{"type":"Point","coordinates":[33.75781394,-84.38479358]},"coordinates":{"type":"Point","coordinates":[-84.38479358,33.75781394]},"place":{"id":"8173485c72e78ca5","url":"https:\\/\\/api.twitter.com\\/1.1\\/geo\\/id\\/8173485c72e78ca5.json","place_type":"city","name":"Atlanta","full_name":"Atlanta, GA","country_code":"US","country":"United States","contained_within":[],"bounding_box":{"type":"Polygon","coordinates":[[[-84.5464728,33.647845],[-84.5464728,33.8868859],[-84.289385,33.8868859],[-84.289385,33.647845]]]},"attributes":{}},"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"cam_clay1","name":"Cameron Clay","id":427007607,"id_str":"427007607","indices":[0,10]}]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"en"}
我应该得到什么(有效的JSON):
{"created_at":"Thu Apr 17 04:45:03 +0000 2014","id":456654551114735616,"id_str":"456654551114735616","text":"@cam_clay1 come visit us soon plz \ud83d\ude18","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":456654343781892098,"in_reply_to_status_id_str":"456654343781892098","in_reply_to_user_id":427007607,"in_reply_to_user_id_str":"427007607","in_reply_to_screen_name":"cam_clay1","user":{"id":335107310,"id_str":"335107310","name":"Roger Krick","screen_name":"roger_krick","location":"Atlanta GA","url":null,"description":"I pushed Regina George in front of the bus.","protected":false,"followers_count":772,"friends_count":235,"listed_count":3,"created_at":"Thu Jul 14 04:49:29 +0000 2011","favourites_count":7192,"utc_offset":-18000,"time_zone":"Quito","geo_enabled":true,"verified":false,"statuses_count":9518,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/pbs.twimg.com\/profile_background_images\/378800000021719152\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_image_url_https":"https:\/\/pbs.twimg.com\/profile_background_images\/378800000021719152\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_tile":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/453031044393222144\/7vIvMWvk_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/453031044393222144\/7vIvMWvk_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/335107310\/1352964715","profile_link_color":"0084B4","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":{"type":"Point","coordinates":[33.75781394,-84.38479358]},"coordinates":{"type":"Point","coordinates":[-84.38479358,33.75781394]},"place":{"id":"8173485c72e78ca5","url":"https:\/\/api.twitter.com\/1.1\/geo\/id\/8173485c72e78ca5.json","place_type":"city","name":"Atlanta","full_name":"Atlanta, GA","country_code":"US","country":"United States","contained_within":[],"bounding_box":{"type":"Polygon","coordinates":[[[-84.5464728,33.647845],[-84.5464728,33.8868859],[-84.289385,33.8868859],[-84.289385,33.647845]]]},"attributes":{}},"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"cam_clay1","name":"Cameron Clay","id":427007607,"id_str":"427007607","indices":[0,10]}]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"en"}
答案 0 :(得分:1)
您正在解码错误的字节:
str(_line)
将对象转换为表示,这对于调试很有用,但不用于处理数据:
>>> 'Føo'.encode('utf8')
b'F\xc3\xb8o'
>>> str('Føo'.encode('utf8'))
"b'F\\xc3\\xb8o'"
请注意b'
前缀,'
后缀和转义序列!
解码字节对象:
_line.decode('utf8')
我假设因为这是JSON数据,所以它使用UTF-8编码(JSON标准声明这是默认选项,唯一允许的其他选项是UTF-16和UTF-32)。
更好的是,使用io.TextIOWrapper()
object为您处理解码。
接下来,您似乎已经撤消了您的条件和数据。 filter()
首先采用条件,数据序列采用第二。
更正后的代码:
def gzworker(fullpath, condition):
"""Worker opens one .gz file"""
print('Opening {}'.format(fullpath))
buffer = []
with gzip.open(fullpath, 'rb') as infile:
decoded = io.TextIOWrapper(infile, encoding='utf8')
for line in decoded:
json_data = line.split('|', 1)[1][:-4]
result = filter(condition, json.loads(json_data))
if result:
buffer.append(result)
print('Closing {}'.format(fullpath))
return buffer
我调整了您的切片操作,假设您之前切断了'
调用引入的str()
字符。