其实我是hadoop和python的新手......所以我的疑问是如何在hadoop中运行python脚本.....我还在使用python编写一个wordcount程序。所以我们可以执行这个脚本没有使用地图减少.... 其实我写的代码我可以看到输出如下 黑暗1 天堂2 它3 光4 5岁 6岁 所有7 全部8 当局9 10点之前 11之前 12岁 信念13 最佳14 比较15 学位16 绝望17 直接18 直接19
It is counting number of words in a list..but whati have to achieve is grouping and deleting the duplicates and also count number of times of its occurrences .....
Below is my code . can somebody please tell me where i have done the mistake
********************************************************
Wordcount.py
********************************************************
import urllib2
import random
from operator import itemgetter
current_word = {}
current_count = 0
story = 'http://sixty-north.com/c/t.txt'
request = urllib2.Request(story)
response = urllib2.urlopen(request)
each_word = []
words = None
count = 1
same_words ={}
word = []
""" looping the entire file """
for line in response:
line_words = line.split()
for word in line_words: # looping each line and extracting words
each_word.append(word)
random.shuffle(each_word)
Sort_word = sorted(each_word)
for words in Sort_word:
same_words = words.lower(),int(count)
#print same_words
#print words
if not words in current_word :
current_count = current_count +1
print '%s\t%s' % (words, current_count)
else:
current_count = 1
#if Sort_word == words.lower():
#current_count += count
current_count = count
current_word = word
#print '2. %s\t%s' % (words, current_count)
答案 0 :(得分:0)
要运行基于python的MR任务,请查看:
http://hadoop.apache.org/docs/r1.1.2/streaming.html http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
您需要使用Mapper - Reducer设计代码,以使Hadoop能够执行您的Python脚本。在开始编写代码之前,请阅读Map-Reduce编程范例。了解MR编程范式以及{Key,value}对在解决问题中的作用非常重要。
#Modified your above code to generate the required output
import urllib2
import random
from operator import itemgetter
current_word = {}
current_count = 0
story = 'http://sixty-north.com/c/t.txt'
request = urllib2.Request(story)
response = urllib2.urlopen(request)
each_word = []
words = None
count = 1
same_words ={}
word = []
""" looping the entire file """
#Collect All the words into a list
for line in response:
#print "Line = " , line
line_words = line.split()
for word in line_words: # looping each line and extracting words
each_word.append(word)
#for every word collected, in dict same_words
#if a key exists, such that key == word then increment Mapping Value by 1
# Else add word as new key with mapped value as 1
for words in each_word:
if words.lower() not in same_words.keys() :
same_words[words.lower()]=1
else:
same_words[words.lower()]=same_words[words.lower()]+1
for each in same_words.keys():
print "word = ",each, ", count = ",same_words[each]