将tweets json文件解压缩到csv文件Python

时间:2015-07-21 11:29:37

标签: python json csv

我从twitter API获得了json文件,json看起来像这样:

{"in_reply_to_user_id_str": null, "geo": null, "id": 100689407677440000, "lang": "en", "in_reply_to_user_id": null, "contributors": null, "source": "<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck</a>", "place": null, "user": {"profile_background_image_url": "http://pbs.twimg.com/profile_background_images/512692008/Screen_shot_2012-01-11_at_11.58.58_PM.png", "id_str": "181218735", "profile_link_color": "0084B4", "profile_image_url_https": "https://pbs.twimg.com/profile_images/378800000007522671/f4552422d443160c075fb3d521ffb3c2_normal.jpeg", "default_profile": false, "id": 181218735, "name": "Author Al King", "contributors_enabled": false, "is_translation_enabled": false, "profile_use_background_image": true, "friends_count": 109, "notifications": false, "utc_offset": -14400, "statuses_count": 3521, "listed_count": 3, "profile_sidebar_border_color": "C0DEED", "profile_sidebar_fill_color": "DDEEF6", "default_profile_image": false, "profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/512692008/Screen_shot_2012-01-11_at_11.58.58_PM.png", "profile_text_color": "333333", "description": "Order your eBook copy of the Truth-Selling LET IT BE KNOWN at http://t.co/9m4DBuSatZ", "time_zone": "Eastern Time (US & Canada)", "geo_enabled": false, "follow_request_sent": false, "profile_background_color": "C0DEED", "favourites_count": 1, "lang": "en", "verified": false, "profile_image_url": "http://pbs.twimg.com/profile_images/378800000007522671/f4552422d443160c075fb3d521ffb3c2_normal.jpeg", "followers_count": 117, "screen_name": "LetItBeKnownAKP", "url": "http://t.co/zl3wbvRpco", "is_translator": false, "profile_background_tile": true, "has_extended_profile": false, "following": false, "created_at": "Sat Aug 21 16:25:13 +0000 2010", "protected": false, "location": "", "entities": {"description": {"urls": [{"expanded_url": "http://www.thealkingpointofview.com", "display_url": "thealkingpointofview.com", "url": "http://t.co/9m4DBuSatZ", "indices": [62, 84]}]}, "url": {"urls": [{"expanded_url": "http://www.thealkingpointofview.com", "display_url": "thealkingpointofview.com", "url": "http://t.co/zl3wbvRpco", "indices": [0, 22]}]}}}, "truncated": false, "retweet_count": 0, "id_str": "100689407677440000", "retweeted": false, "created_at": "Mon Aug 08 22:06:40 +0000 2011", "favorited": false, "entities": {"urls": [], "hashtags": [{"text": "LETITBEKNOWNLIVERADIO", "indices": [14, 36]}], "user_mentions": [], "symbols": []}, "in_reply_to_status_id": null, "coordinates": null, "favorite_count": 0, "is_quote_status": false, "text": "WED at 8pm on #LETITBEKNOWNLIVERADIO our Discussion \"Violence among the YOUTH PT 2..... Bullying, Gang Violenc\u2026 (cont) http://deck.ly/~ZN88q", "in_reply_to_status_id_str": null, "in_reply_to_screen_name": null}

我所做的只是提取&#34; id&#34;,&#34; lang&#34;,&#34; text&#34;从json上面,但是当我加载json时,发生错误,这是我的代码:

    import json
    with open ("tweet.json_test_1") as json_data:
    dataText = json.load(json_data)
    print (dataText)

错误是: ValueError:额外数据:第1行第2836行 - 第2行第1列(字符2835 - 2868)

抱歉,如果这是一个重复的问题,我是python和ML的新手。感谢

1 个答案:

答案 0 :(得分:0)

好像你的tweet.json_test_1文件有多个JSON对象,实际上每行一个,所以你最好逐行读取它并将每个JSON对象作为字符串加载。我建议使用try except来捕获某些行是否没有JSON。但请记住,这意味着如果您没有输出,那么所有行都不包含有效的JSON。

import json

with open ("tweet.json_test_1") as json_data:
    for line in json_data:
        try:
            dataText = json.loads(line)
        except ValueError:
            continue
        print (dataText)
        #Do other stuff here, especially if you want to retain all the JSON objects

顺便说一句,如果您是创建tweet.json_test_1文件的人,那么建议您将所有JSON对象放在列表中,然后json.load应该可以正常工作。