需要帮助解决以下错误,请了解此处有许多解决此类错误的问题的解决方案。但是,我几乎都尝试过很少成功。
错误:'ascii'编解码器无法对位置132中的字符u'\ u2026'进行编码: 序数不在范围内(128)
我在做什么:
我正在尝试使用JSON.loads在分区上加载JSON数据,并通过读取PySpark
中的文件创建数据,如下所示。文件具有JSON格式的原始推文
数据文件由另一个火花作业创建,使用saveAsTextFile
压缩为bz2。
rdd1 = sc.textFile(datafile)
#to get only the json portion of raw tweets
rdd2 = rdd1.map(lambda x : x[x.find('{'):len(x)])
在rdd2上运行mapPartitions
以处理JSON数据的文本字段。
函数定义看起来像这样。
def func(raw_tweets):
for records in raw_tweets: # record is of type 'str'
try:
record = record.decode('utf-8') # record now is a 'unicode' object
t = json.loads(record, encoding=('utf-8'))
....
except:
...
这个错误与ascii'编解码器无法编码字符。
以下是我尝试过的可能的工作但没有成功。
use_unicode=False
sc.textFile
尝试将记录加载为
# to enforce record as type unicode, when read as str
u_record = unicode(record)
t = json.loads(u_record)
编码和解码,所有上述内容均以相同的ascii编码错误失败。
我认为正在发生的是JSON.loads在执行Unicode编码之前隐式尝试解码为ASCII。 可能正在读这样的东西?
str(u'\u2026')
因为任何编码对此都没有影响,并且ASCII编解码器错误失败
有人可以帮助指导我,并帮助了解如何使JSON.loads读取数据为Unicode编码,而不是ASCII。 谢谢!
json.loads('{"created_at":"Fri Apr 28 17:12:54 +0000 2017","id":858006119486062593,"id_str":"858006119486062593","text":"RT @richardgibson74: Rashid Khan fabulous to watch at the IPL again. Just one of the overseas players you won't be seeing in England's new\u2026","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2803663827,"id_str":"2803663827","name":"danesh bhopi","screen_name":"DaneshBhope","location":"Latur, India","url":null,"description":"4 Feb 1997 *wrestling* and *cricket*","protected":false,"verified":false,"followers_count":170,"friends_count":352,"listed_count":30,"favourites_count":30781,"statuses_count":23038,"created_at":"Sat Oct 04 03:29:03 +0000 2014","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/790885528073744384\/o1Nen52M_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/790885528073744384\/o1Nen52M_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2803663827\/1478675979","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Fri Apr 28 17:08:25 +0000 2017","id":858004991474372608,"id_str":"858004991474372608","text":"Rashid Khan fabulous to watch at the IPL again. Just one of the overseas players you won't be seeing in England's new T20 competition.","source":"\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":57331172,"id_str":"57331172","name":"Richard Gibson","screen_name":"richardgibson74","location":"Leeds","url":"http:\/\/www.theguardian.com\/profile\/richard-gibson","description":"Football and cricket reporter; Daily Mail; Mail On Sunday; Guardian; Observer; Cricketer magazine; author of Bumble, Anderson, Root, Stokes and Swann books.","protected":false,"verified":false,"followers_count":2880,"friends_count":198,"listed_count":97,"favourites_count":57,"statuses_count":5428,"created_at":"Thu Jul 16 13:16:51 +0000 2009","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/544621780363534336\/3_7ZOBvx_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/544621780363534336\/3_7ZOBvx_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/57331172\/1473706117","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":1,"favorite_count":1,"entities":{"hashtags":[],"urls":[],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en"},"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[{"screen_name":"richardgibson74","name":"Richard Gibson","id":57331172,"id_str":"57331172","indices":[3,19]}],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en","timestamp_ms":"1493399574525"}',encoding=('utf-8'))
是一个这样的记录,它在json.loads期间无法加载ascii编码错误(错误消息中的位置可能不是此记录!)
PS:Spark 1.6上的Python 2.7版