我正在使用disco运行数十种地图缩减作业,用于多种不同目的。我的数据变得越来越大,我想我会尝试使用DDFS进行更改而不是标准的txt文件。
我已经按照DISCO地图/缩小示例Counting Words as a map/reduce job,毫不费力地在其他人的帮助下Reading JSON specific data into DISCO我已经解决了我最近的一个问题。
我正在尝试读取/退出ddfs中的数据以更好地分块和分发它,但是遇到了一些麻烦。
这是一个示例文件:file.txt
{"favorited": false, "in_reply_to_user_id": null, "contributors": null, "truncated": false, "text": "I'll call him back tomorrow I guess", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": null, "coordinates": null, "in_reply_to_user_id_str": null, "entities": {"user_mentions": [], "hashtags": [], "urls": []}, "in_reply_to_status_id": null, "id_str": "168931016843603968", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/305726905/FASHION-3.png", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1818996723/image_normal.jpg", "profile_sidebar_fill_color": "292727", "is_translator": false, "id": 113532729, "profile_text_color": "000000", "followers_count": 78, "protected": false, "location": "With My Niggas In Paris!", "default_profile_image": false, "listed_count": 0, "utc_offset": -21600, "statuses_count": 6733, "description": "Made in CHINA., Educated && Making My Own $$. Fear GOD && Put Him 1st. #TeamFollowBack #TeamiPhone\n", "friends_count": 74, "profile_link_color": "b03f3f", "profile_image_url": "http://a2.twimg.com/profile_images/1818996723/image_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": true, "profile_background_color": "1f9199", "id_str": "113532729", "profile_background_image_url": "http://a3.twimg.com/profile_background_images/305726905/FASHION-3.png", "name": "Bee'Jay", "lang": "en", "profile_background_tile": true, "favourites_count": 19, "screen_name": "OohMyBEEsNice", "url": "http://www.bitchimpaid.org", "created_at": "Fri Feb 12 03:32:54 +0000 2010", "contributors_enabled": false, "time_zone": "Central Time (US & Canada)", "profile_sidebar_border_color": "000000", "default_profile": false, "following": null}, "in_reply_to_screen_name": null, "retweet_count": 0, "geo": null, "id": 168931016843603968, "source": "<a href=\"http://twitter.com/#!/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"}
{"favorited": false, "in_reply_to_user_id": 50940453, "contributors": null, "truncated": false, "text": "@LegaMrvica @MimozaBand makasi om artis :D kadoo kadoo", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": "168653037894770688", "coordinates": null, "in_reply_to_user_id_str": "50940453", "entities": {"user_mentions": [{"indices": [0, 11], "screen_name": "LegaMrvica", "id": 50940453, "name": "Lega_thePianis", "id_str": "50940453"}, {"indices": [12, 23], "screen_name": "MimozaBand", "id": 375128905, "name": "Mimoza", "id_str": "375128905"}], "hashtags": [], "urls": []}, "in_reply_to_status_id": 168653037894770688, "id_str": "168931016868761600", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/347686061/Galungan_dan_Kuningan.jpg", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1803845596/Picture_20124_normal.jpg", "profile_sidebar_fill_color": "DDFFCC", "is_translator": false, "id": 48293450, "profile_text_color": "333333", "followers_count": 182, "protected": false, "location": "\u00dcT: -6.906799,107.622383", "default_profile_image": false, "listed_count": 0, "utc_offset": -28800, "statuses_count": 3052, "description": "Fashion design maranatha '11 // traditional dancer (bali) at sanggar tampak siring & Natya Nataraja", "friends_count": 206, "profile_link_color": "0084B4", "profile_image_url": "http://a3.twimg.com/profile_images/1803845596/Picture_20124_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": true, "profile_background_color": "9AE4E8", "id_str": "48293450", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/347686061/Galungan_dan_Kuningan.jpg", "name": "nana afiff", "lang": "en", "profile_background_tile": true, "favourites_count": 2, "screen_name": "hasnfebria", "url": null, "created_at": "Thu Jun 18 08:50:29 +0000 2009", "contributors_enabled": false, "time_zone": "Pacific Time (US & Canada)", "profile_sidebar_border_color": "BDDCAD", "default_profile": false, "following": null}, "in_reply_to_screen_name": "LegaMrvica", "retweet_count": 0, "geo": null, "id": 168931016868761600, "source": "<a href=\"http://blackberry.com/twitter\" rel=\"nofollow\">Twitter for BlackBerry\u00ae</a>"}
{"favorited": false, "in_reply_to_user_id": 27260086, "contributors": null, "truncated": false, "text": "@justinbieber u were born to be somebody, and u're super important in beliebers' life. thanks for all biebs. I love u. follow me? 84", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": null, "coordinates": null, "in_reply_to_user_id_str": "27260086", "entities": {"user_mentions": [{"indices": [0, 13], "screen_name": "justinbieber", "id": 27260086, "name": "Justin Bieber", "id_str": "27260086"}], "hashtags": [], "urls": []}, "in_reply_to_status_id": null, "id_str": "168931016856178688", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/416005864/Captura.JPG", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1808883280/Captura6_normal.JPG", "profile_sidebar_fill_color": "f5e7f3", "is_translator": false, "id": 406750700, "profile_text_color": "333333", "followers_count": 1122, "protected": false, "location": "Adentro de una supra.", "default_profile_image": false, "listed_count": 0, "utc_offset": -14400, "statuses_count": 20966, "description": "Mi \u00eddolo es @justinbieber , si te gusta \u00a1genial!, si no, solo respetalo. El cambi\u00f3 mi vida completamente y mi sue\u00f1o es conocerlo #TrueBelieber . ", "friends_count": 1015, "profile_link_color": "9404b8", "profile_image_url": "http://a1.twimg.com/profile_images/1808883280/Captura6_normal.JPG", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "f9fcfa", "id_str": "406750700", "profile_background_image_url": "http://a3.twimg.com/profile_background_images/416005864/Captura.JPG", "name": "neversaynever,right?", "lang": "es", "profile_background_tile": false, "favourites_count": 22, "screen_name": "True_Belieebers", "url": "http://www.wehavebieber-fever.tumblr.com", "created_at": "Mon Nov 07 04:17:40 +0000 2011", "contributors_enabled": false, "time_zone": "Santiago", "profile_sidebar_border_color": "C0DEED", "default_profile": false, "following": null}, "in_reply_to_screen_name": "justinbieber", "retweet_count": 0, "geo": null, "id": 168931016856178688, "source": "<a href=\"http://yfrog.com\" rel=\"nofollow\">Yfrog</a>"}
我将它加载到DDFS中:
# ddfs chunk data:test1 ./file.txt
created: disco://localhost/ddfs/vol0/blob/44/file_txt-0$549-db27b-125e1
我测试该文件确实已加载到ddfs中:
# ddfs xcat data:test1
{"favorited": false, "in_reply_to_user_id": null, "contributors": null, "truncated": false, "text": "I'll call him back tomorrow I guess", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": null, "coordinates": null, "in_reply_to_user_id_str": null, "entities": {"user_mentions": [], "hashtags": [], "urls": []}, "in_reply_to_status_id": null, "id_str": "168931016843603968", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/305726905/FASHION-3.png", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1818996723/image_normal.jpg", "profile_sidebar_fill_color": "292727", "is_translator": false, "id": 113532729, "profile_text_color": "000000", "followers_count": 78, "protected": false, "location": "With My Niggas In Paris!", "default_profile_image": false, "listed_count": 0, "utc_offset": -21600, "statuses_count": 6733, "description": "Made in CHINA., Educated && Making My Own $$. Fear GOD && Put Him 1st. #TeamFollowBack #TeamiPhone\n", "friends_count": 74, "profile_link_color": "b03f3f", "profile_image_url": "http://a2.twimg.com/profile_images/1818996723/image_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": true, "profile_background_color": "1f9199", "id_str": "113532729", "profile_background_image_url": "http://a3.twimg.com/profile_background_images/305726905/FASHION-3.png", "name": "Bee'Jay", "lang": "en", "profile_background_tile": true, "favourites_count": 19, "screen_name": "OohMyBEEsNice", "url": "http://www.bitchimpaid.org", "created_at": "Fri Feb 12 03:32:54 +0000 2010", "contributors_enabled": false, "time_zone": "Central Time (US & Canada)", "profile_sidebar_border_color": "000000", "default_profile": false, "following": null}, "in_reply_to_screen_name": null, "retweet_count": 0, "geo": null, "id": 168931016843603968, "source": "<a href=\"http://twitter.com/#!/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"}
{"favorited": false, "in_reply_to_user_id": 50940453, "contributors": null, "truncated": false, "text": "@LegaMrvica @MimozaBand makasi om artis :D kadoo kadoo", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": "168653037894770688", "coordinates": null, "in_reply_to_user_id_str": "50940453", "entities": {"user_mentions": [{"indices": [0, 11], "screen_name": "LegaMrvica", "id": 50940453, "name": "Lega_thePianis", "id_str": "50940453"}, {"indices": [12, 23], "screen_name": "MimozaBand", "id": 375128905, "name": "Mimoza", "id_str": "375128905"}], "hashtags": [], "urls": []}, "in_reply_to_status_id": 168653037894770688, "id_str": "168931016868761600", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/347686061/Galungan_dan_Kuningan.jpg", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1803845596/Picture_20124_normal.jpg", "profile_sidebar_fill_color": "DDFFCC", "is_translator": false, "id": 48293450, "profile_text_color": "333333", "followers_count": 182, "protected": false, "location": "\u00dcT: -6.906799,107.622383", "default_profile_image": false, "listed_count": 0, "utc_offset": -28800, "statuses_count": 3052, "description": "Fashion design maranatha '11 // traditional dancer (bali) at sanggar tampak siring & Natya Nataraja", "friends_count": 206, "profile_link_color": "0084B4", "profile_image_url": "http://a3.twimg.com/profile_images/1803845596/Picture_20124_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": true, "profile_background_color": "9AE4E8", "id_str": "48293450", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/347686061/Galungan_dan_Kuningan.jpg", "name": "nana afiff", "lang": "en", "profile_background_tile": true, "favourites_count": 2, "screen_name": "hasnfebria", "url": null, "created_at": "Thu Jun 18 08:50:29 +0000 2009", "contributors_enabled": false, "time_zone": "Pacific Time (US & Canada)", "profile_sidebar_border_color": "BDDCAD", "default_profile": false, "following": null}, "in_reply_to_screen_name": "LegaMrvica", "retweet_count": 0, "geo": null, "id": 168931016868761600, "source": "<a href=\"http://blackberry.com/twitter\" rel=\"nofollow\">Twitter for BlackBerry\u00ae</a>"}
{"favorited": false, "in_reply_to_user_id": 27260086, "contributors": null, "truncated": false, "text": "@justinbieber u were born to be somebody, and u're super important in beliebers' life. thanks for all biebs. I love u. follow me? 84", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": null, "coordinates": null, "in_reply_to_user_id_str": "27260086", "entities": {"user_mentions": [{"indices": [0, 13], "screen_name": "justinbieber", "id": 27260086, "name": "Justin Bieber", "id_str": "27260086"}], "hashtags": [], "urls": []}, "in_reply_to_status_id": null, "id_str": "168931016856178688", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/416005864/Captura.JPG", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1808883280/Captura6_normal.JPG", "profile_sidebar_fill_color": "f5e7f3", "is_translator": false, "id": 406750700, "profile_text_color": "333333", "followers_count": 1122, "protected": false, "location": "Adentro de una supra.", "default_profile_image": false, "listed_count": 0, "utc_offset": -14400, "statuses_count": 20966, "description": "Mi \u00eddolo es @justinbieber , si te gusta \u00a1genial!, si no, solo respetalo. El cambi\u00f3 mi vida completamente y mi sue\u00f1o es conocerlo #TrueBelieber . ", "friends_count": 1015, "profile_link_color": "9404b8", "profile_image_url": "http://a1.twimg.com/profile_images/1808883280/Captura6_normal.JPG", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "f9fcfa", "id_str": "406750700", "profile_background_image_url": "http://a3.twimg.com/profile_background_images/416005864/Captura.JPG", "name": "neversaynever,right?", "lang": "es", "profile_background_tile": false, "favourites_count": 22, "screen_name": "True_Belieebers", "url": "http://www.wehavebieber-fever.tumblr.com", "created_at": "Mon Nov 07 04:17:40 +0000 2011", "contributors_enabled": false, "time_zone": "Santiago", "profile_sidebar_border_color": "C0DEED", "default_profile": false, "following": null}, "in_reply_to_screen_name": "justinbieber", "retweet_count": 0, "geo": null, "id": 168931016856178688, "source": "<a href=\"http://yfrog.com\" rel=\"nofollow\">Yfrog</a>
此时一切都很棒,我加载了之前的Stack Post产生的脚本:
from disco.core import Job, result_iterator
import gzip
def map(line, params):
import unicodedata
import json
r = json.loads(line).get('text')
s = unicodedata.normalize('NFD', r).encode('ascii', 'ignore')
for word in s.split():
yield word, 1
def reduce(iter, params):
from disco.util import kvgroup
for word, counts in kvgroup(sorted(iter)):
yield word, sum(counts)
if __name__ == '__main__':
job = Job().run(input=["tag://data:test1"],
map=map,
reduce=reduce)
for word, count in result_iterator(job.wait(show=True)):
print word, count
注意:如果input=["file.txt"]
,此脚本会运行文件,但是当我使用"tag://data:test1"
运行时,我收到以下错误:
# DISCO_EVENTS=1 python count_normal_words.py
Job@549:db30e:25bd8:
Status: [map] 0 waiting, 1 running, 0 done, 0 failed
2012/11/25 21:43:26 master New job initialized!
2012/11/25 21:43:26 master Starting job
2012/11/25 21:43:26 master Starting map phase
2012/11/25 21:43:26 master map:0 assigned to solice
2012/11/25 21:43:26 master ERROR: Job failed: Worker at 'solice' died: Traceback (most recent call last):
File "/home/DISCO/data/solice/01/Job@549:db30e:25bd8/usr/local/lib/python2.7/site-packages/disco/worker/__init__.py", line 329, in main
job.worker.start(task, job, **jobargs)
File "/home/DISCO/data/solice/01/Job@549:db30e:25bd8/usr/local/lib/python2.7/site-packages/disco/worker/__init__.py", line 290, in start
self.run(task, job, **jobargs)
File "/home/DISCO/data/solice/01/Job@549:db30e:25bd8/usr/local/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 286, in run
getattr(self, task.mode)(task, params)
File "/home/DISCO/data/solice/01/Job@549:db30e:25bd8/usr/local/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 299, in map
for key, val in self['map'](entry, params):
File "count_normal_words.py", line 12, in map
File "/usr/lib64/python2.7/json/__init__.py", line 326, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
2012/11/25 21:43:26 master WARN: Job killed
Status: [map] 1 waiting, 0 running, 0 done, 1 failed
Traceback (most recent call last):
File "count_normal_words.py", line 28, in <module>
for word, count in result_iterator(job.wait(show=True)):
File "/usr/local/lib/python2.7/site-packages/disco/core.py", line 348, in wait
timeout, poll_interval * 1000)
File "/usr/local/lib/python2.7/site-packages/disco/core.py", line 309, in check_results
raise JobError(Job(name=jobname, master=self), "Status %s" % status)
disco.error.JobError: Job Job@549:db30e:25bd8 failed: Status dead
错误说明:ValueError: No JSON object could be decoded
。同样,这可以正常使用文本文件作为输入,但现在是DDFS。
任何想法,我愿意接受建议吗?
答案 0 :(得分:1)
加载到DDFS时,数据似乎不完整。加载的最后一行没有}作为最终字符。因此,这项工作失败了。我想我需要尝试继续处理非JSON行。