使用python

时间:2017-07-20 01:12:48

标签: python json pandas csv

我有一个包含这5列的 test.dat 文件:

  • [' user_id',' item_id',' rating',' scraping_time', ' tweet_in_json_format']

我想移动这三列为 test2.csv

  • [' user_id',' scraping_time',' tweet_in_json_format']

以下是test.dat的一行示例:

user_id,item_id,rating,scraping_time,tweet_in_json_format
819099800,0993846,10,1391278544,{"contributors": null, "truncated": false, "text": "", "in_reply_to_status_id": null, "id": 426902385735520256, "favorite_count": 0, "source": "<a href=\"http://itunes.apple.com/us/app/imdb-movies-tv/id342792525?mt=8&uo=4\" rel=\"nofollow\">IMDb Movies & TV on iOS</a>", "retweeted": false, "coordinates": null, "entities": {"symbols": [], "user_mentions": [], "hashtags": [{"indices": [61, 66], "text": "IMDb"}], "urls": [{"url": "http://tweeter.com/xQuwO8KJP1", "indices": [38, 60], "expanded_url": "http://www.imdb.com/title/tt0993846", "display_url": "imdb.com/title/tt0993846"}]}, "in_reply_to_screen_name": null, "id_str": "426902385735520256", "retweet_count": 0, "in_reply_to_user_id": null, "favorited": false, "user": {"follow_request_sent": false, "profile_use_background_image": true, "id": 819099800, "verified": false, "profile_text_color": "333333", "profile_image_url_https": "https://pbs.twimg.com/profile_images/420936276607791104/KVrTuNU9_normal.jpeg", "profile_sidebar_fill_color": "DDEEF6", "is_translator": false, "geo_enabled": false, "entities": {"description": {"urls": []}}, "followers_count": 116, "protected": false, "location": "in my dreams ", "default_profile_image": false, "id_str": "819099800", "lang": "ar", "utc_offset": -36000, "statuses_count": 1169, "description": "\u0646\u0628\u0649 \u0627\u0644\u0623\u062c\u0631 .", "friends_count": 86, "profile_link_color": "0084B4", "profile_image_url": "http://pbs.twimg.com/profile_images/420936276607791104/KVrTuNU9_normal.jpeg", "notifications": false, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/812779345/23ae1c7da01b05a6d5d9b0be28fe14c9.jpeg", "profile_background_color": "C0DEED", "profile_banner_url": "https://pbs.twimg.com/profile_banners/819099800/1390618207", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/812779345/23ae1c7da01b05a6d5d9b0be28fe14c9.jpeg", "name": "vivo per lei ", "is_translation_enabled": false, "profile_background_tile": false, "favourites_count": 155, "screen_name": "Orkida__", "url": null, "created_at": "Wed Sep 12 08:08:06 +0000 2012", "contributors_enabled": false, "time_zone": "Hawaii", "profile_sidebar_border_color": "C0DEED", "default_profile": false, "following": false, "listed_count": 0}, "geo": null, "in_reply_to_user_id_str": null, "possibly_sensitive": false, "lang": "en", "created_at": "Sat Jan 25 02:20:34 +0000 2014", "in_reply_to_status_id_str": null, "place": null}

问题在于&quot; tweet_in_json_format&#39;是 json Pandas 不能将其视为一个列......

我该怎么做?

1 个答案:

答案 0 :(得分:1)

您的主要问题是您的输入实际上并非CSV - 如果是,则必须引用最后一列中的JSON数据,以便其内部逗号不会被解释为CSV分隔符。

如果您只是想执行您描述的转换,并且您可以确信输入格式保持不变(即用户ID,项目ID,评级,抓取时间和按顺序的JSON数据,用逗号分隔),那么这可以相对简单地实现,而不需要Pandas(这对于这项工作来说真的太过分了):

with open('test.dat') as f_in, open('test2.csv', 'w') as f_out:
    for line in f_in:
        parts = line.split(',', 4)
        f_out.write('{},{},{}\n'.format(parts[0], parts[3], parts[4]))

简而言之,这会打开输入和输出文件,然后对于输入文件中的每一行,它在逗号上最多将它分割四次,这会将行分成各个字段而不会损坏JSON。然后它将第一个,第四个和第五个字段(对应于用户ID,抓取时间和JSON数据)写入输出文件,用逗号分隔。

请注意,这是一个稍微脆弱的解决方案,因为如果列顺序发生变化,它将会中断。