KeyError:标记化

时间:2018-02-14 16:53:14

标签: text twitter tokenize keyerror

我正在尝试使用以下代码来标记推文的.​​json文件:

*from nltk.corpus import brown
brown.words()
from nltk.tokenize import word_tokenize
import json
import re

。 。 。     打开('volby2018_1.json','r')为f:         对于f中的行:             tweet = json.loads(line)             tokens = preprocess(tweet ['text'])*

我不断得到:

    **KeyError                                  Traceback (most recent call last)
    <ipython-input-2-daba85c11858> in <module>()
         52     for line in f:
         53         tweet = json.loads(line)
--->     54         tokens = preprocess(tweet['text'])
         55 
         56 print(preprocess(tweet))
    KeyError: 'text'**

.json看起来像这样:

  

* {“status”:[{“created_at”:“Sun Feb 04 09:26:24 +0000   2018" , “ID”:960082100341919744, “ID_STR”: “960082100341919744”, “文”:“@ PREZIDENTmluvci   Voli \ u010d \ u016​​fm Zemana。 SD \ u00edlejte。 #Zeman v辩论电视台   p \ u0159ed#volby2018 potvrdil,\ u017ee je \ u2026   https://t.co/bZOlX2DjqK","truncated":true,"entities":{"hashtags":[{"text":"Zeman","indices":[43,49]},{"text “:” Volby2018" , “指数”:[74,84]}], “符号”:[], “user_mentions”:[{ “SCREEN_NAME”: “PREZIDENTmluvci”, “名称”:“祭\ u0159 \ u00ed   OV \ u010d \ u00e1 \ u010dek “ ”标识“:3055366126, ”ID_STR“: ”3055366126“, ”指数“:[0,16]}], ”网址“:[{ ”URL“:” https://开头t.co/bZOlX2DjqK","expanded_url":"https://twitter.com/i/web/status/960082100341919744","display_url":"twitter.com/i/web/status/9\u2026" ,”指数 “:[102125]}]},” 元数据 “:{” iso_language_code “:” CS “ ”result_type的“: ”最近“}, ”源“:” \ u003ca   HREF = \ “HTTP://twitter.com/download/android \”   rel = \“nofollow \”\ u003eTwitter for   的Android \ u003c / A \ u003e”, “in_reply_to_status_id”:960080132764467200 “in_reply_to_status_id_str”: “960080132764467200”, “in_reply_to_user_id”:3055366126, “in_reply_to_user_id_str”: “3055366126”, “in_reply_to_screen_name”: “PREZIDENTmluvci”, “用户”:{ “ID”:1915891352, “ID_STR”: “1915891352”, “名”:“ZDEN \ u011bk   布勃\ u00e1k “ ”SCREEN_NAME“: ”ZdenekBubak“, ”位置“:” \ u016​​0 \ u00e9fredaktor   /主编“,”描述“:”Redaktor se specializac \ u00ed na   FINAN \ u010dn \ u00ed   导航产品”, “URL”: “https://t.co/nvivnZXApP”, “实体”:{ “URL”:{ “网址”:[{ “URL”: “https://t.co/nvivnZXApP” “expanded_url”: “http://www.finparada.cz”, “DISPLAY_URL”: “finparada.cz”, “指数”:[0,23]}]}, “说明”:{ “网址”:[ ]}}, “受保护”:假 “FOLLOWERS_COUNT”:196, “FRIENDS_COUNT”:201, “listed_count”:3 “created_at”:“太阳   9月29日02:09:56 +0000   2013" , “favourites_count”:782, “utc_offset”:空, “TIME_ZONE”:空, “geo_enabled”:真正的 “验证”:假的, “statuses_count”:4821, “郎”: “CS”, “contributors_enabled” :假的, “is_translator”:假的, “is_translation_enabled”:假的, “profile_background_color”: “C0DEED”, “profile_background_image_url”: “http://abs.twimg.com/images/themes/theme1/bg.png”,” profile_background_image_url_https “:” https://abs.twimg.com/images/themes/theme1/bg.png “ ”profile_background_tile“:假的, ”profile_image_url“:” http://pbs.twimg.com/profile_images/843934466175356928/ 94cCpcLK_normal.jpg”, “profile_image_url_https”: “https://pbs.twimg.com/profile_images/843934466175356928/94cCpcLK_normal.jpg”, “profile_banner_url”: “https://pbs.twimg.com/profile_banners/1915891352/1380992989” “profile_link_color”: “1DA1F2”, “profile_sidebar_border_color”: “C0DEED”, “profile_sidebar_fill_color”: “DDEEF6”, “profile_text_color”: “333333”, “profile_use_background_image”:真实的, “has_extended_profile”:假的, “DEFAULT_PROFILE”:真“default_profile_image”:假的,“FOLL由于 “:假的,” follow_request_sent “:假的,” 通知 “:假的,” translator_type “:” 无 “},” 地理 “:空,” 坐标 “:空,” 地方 “:{” ID “:” 018e2bf71a3ef896" , “URL”: “https://api.twitter.com/1.1/geo/id/018e2bf71a3ef896.json”, “place_type”: “城市”, “名”: “布拉格”, “FULL_NAME”:“布拉格,   捷克共和国“,”country_code“:”CZ“,”country“:”捷克   共和国”, “contained_within”:[], “bounding_box”:{ “类型”: “多边形”, “坐标”:[[[14.2252428,49.9419037],[14.7065078,49.9419037],[14.7065078,50.1772562],[14.2252428, 50.1772562]]]}, “属性”:   * ....

0 个答案:

没有答案