我正在尝试使用以下代码来标记推文的.json文件:
*from nltk.corpus import brown
brown.words()
from nltk.tokenize import word_tokenize
import json
import re
。 。 。 打开('volby2018_1.json','r')为f: 对于f中的行: tweet = json.loads(line) tokens = preprocess(tweet ['text'])*
我不断得到:
**KeyError Traceback (most recent call last)
<ipython-input-2-daba85c11858> in <module>()
52 for line in f:
53 tweet = json.loads(line)
---> 54 tokens = preprocess(tweet['text'])
55
56 print(preprocess(tweet))
KeyError: 'text'**
.json看起来像这样:
* {“status”:[{“created_at”:“Sun Feb 04 09:26:24 +0000 2018" , “ID”:960082100341919744, “ID_STR”: “960082100341919744”, “文”:“@ PREZIDENTmluvci Voli \ u010d \ u016fm Zemana。 SD \ u00edlejte。 #Zeman v辩论电视台 p \ u0159ed#volby2018 potvrdil,\ u017ee je \ u2026 https://t.co/bZOlX2DjqK","truncated":true,"entities":{"hashtags":[{"text":"Zeman","indices":[43,49]},{"text “:” Volby2018" , “指数”:[74,84]}], “符号”:[], “user_mentions”:[{ “SCREEN_NAME”: “PREZIDENTmluvci”, “名称”:“祭\ u0159 \ u00ed OV \ u010d \ u00e1 \ u010dek “ ”标识“:3055366126, ”ID_STR“: ”3055366126“, ”指数“:[0,16]}], ”网址“:[{ ”URL“:” https://开头t.co/bZOlX2DjqK","expanded_url":"https://twitter.com/i/web/status/960082100341919744","display_url":"twitter.com/i/web/status/9\u2026" ,”指数 “:[102125]}]},” 元数据 “:{” iso_language_code “:” CS “ ”result_type的“: ”最近“}, ”源“:” \ u003ca HREF = \ “HTTP://twitter.com/download/android \” rel = \“nofollow \”\ u003eTwitter for 的Android \ u003c / A \ u003e”, “in_reply_to_status_id”:960080132764467200 “in_reply_to_status_id_str”: “960080132764467200”, “in_reply_to_user_id”:3055366126, “in_reply_to_user_id_str”: “3055366126”, “in_reply_to_screen_name”: “PREZIDENTmluvci”, “用户”:{ “ID”:1915891352, “ID_STR”: “1915891352”, “名”:“ZDEN \ u011bk 布勃\ u00e1k “ ”SCREEN_NAME“: ”ZdenekBubak“, ”位置“:” \ u0160 \ u00e9fredaktor /主编“,”描述“:”Redaktor se specializac \ u00ed na FINAN \ u010dn \ u00ed 导航产品”, “URL”: “https://t.co/nvivnZXApP”, “实体”:{ “URL”:{ “网址”:[{ “URL”: “https://t.co/nvivnZXApP” “expanded_url”: “http://www.finparada.cz”, “DISPLAY_URL”: “finparada.cz”, “指数”:[0,23]}]}, “说明”:{ “网址”:[ ]}}, “受保护”:假 “FOLLOWERS_COUNT”:196, “FRIENDS_COUNT”:201, “listed_count”:3 “created_at”:“太阳 9月29日02:09:56 +0000 2013" , “favourites_count”:782, “utc_offset”:空, “TIME_ZONE”:空, “geo_enabled”:真正的 “验证”:假的, “statuses_count”:4821, “郎”: “CS”, “contributors_enabled” :假的, “is_translator”:假的, “is_translation_enabled”:假的, “profile_background_color”: “C0DEED”, “profile_background_image_url”: “http://abs.twimg.com/images/themes/theme1/bg.png”,” profile_background_image_url_https “:” https://abs.twimg.com/images/themes/theme1/bg.png “ ”profile_background_tile“:假的, ”profile_image_url“:” http://pbs.twimg.com/profile_images/843934466175356928/ 94cCpcLK_normal.jpg”, “profile_image_url_https”: “https://pbs.twimg.com/profile_images/843934466175356928/94cCpcLK_normal.jpg”, “profile_banner_url”: “https://pbs.twimg.com/profile_banners/1915891352/1380992989” “profile_link_color”: “1DA1F2”, “profile_sidebar_border_color”: “C0DEED”, “profile_sidebar_fill_color”: “DDEEF6”, “profile_text_color”: “333333”, “profile_use_background_image”:真实的, “has_extended_profile”:假的, “DEFAULT_PROFILE”:真“default_profile_image”:假的,“FOLL由于 “:假的,” follow_request_sent “:假的,” 通知 “:假的,” translator_type “:” 无 “},” 地理 “:空,” 坐标 “:空,” 地方 “:{” ID “:” 018e2bf71a3ef896" , “URL”: “https://api.twitter.com/1.1/geo/id/018e2bf71a3ef896.json”, “place_type”: “城市”, “名”: “布拉格”, “FULL_NAME”:“布拉格, 捷克共和国“,”country_code“:”CZ“,”country“:”捷克 共和国”, “contained_within”:[], “bounding_box”:{ “类型”: “多边形”, “坐标”:[[[14.2252428,49.9419037],[14.7065078,49.9419037],[14.7065078,50.1772562],[14.2252428, 50.1772562]]]}, “属性”: * ....