Question

我有一个python文件可以在Hadoop（版本2.6.0）上使用mrjob来计算bigrams，但我没有得到我希望的输出，而且我在解码终端中的输出时遇到问题哪里出错了。

我的代码：

SELECT DATE(billing.charged_datetime)
FROM lead_gen_business.billing
WHERE MONTH(billing.charged_datetime) = 3 AND YEAR(billing.charged_datetime) = 2012;

我在我的本地机器上的mapper函数（基本上是通过“yield”行的所有内容）编写代码，以确保我的代码按照预期抓住了bigrams，所以我认为它应该工作正常....但是，当然，有些事情会出错。

当我在Hadoop服务器上运行代码时，我得到以下输出（如果这不是必要的话，道歉 - 屏幕输出大量信息，我还不确定什么对于珩磨有帮助问题区域）：

regex_for_words = re.compile(r"\b[\w']+\b")

class BiCo(MRJob):
  OUTPUT_PROTOCOL = mrjob.protocol.RawProtocol

  def mapper(self, _, line):
    words = regex_for_words.findall(line)
    wordsinline = list()
    for word in words:
        wordsinline.append(word.lower()) 
    wordscounter = 0
    totalwords = len(wordsinline)
    for word in wordsinline:
        if wordscounter < (totalwords - 1):
            nextword_pos = wordscounter+1
            nextword = wordsinline[nextword_pos]
            bigram = word, nextword
            wordscounter +=1
            yield (bigram, 1)

  def combiner(self, bigram, counts):
    yield (bigram, sum(counts))

  def reducer(self, bigram, counts):
    yield (bigram, str(sum(counts)))

if __name__ == '__main__':
  BiCo.run()

我对于为什么在第1步中找不到计数器感到困惑（我假设我的代码是映射器的一部分，这可能是一个错误的假设）。如果我正确地读取任何Hadoop输出，看起来它至少使它成为reduce阶段（因为有Reduce Input组）并且它没有找到任何Shuffling错误。我认为在“Unencodable输出：TypeError = 79518”中可能会出现一些问题的答案，但是我没有进行任何谷歌搜索，这有助于了解这是什么错误。

非常感谢任何帮助或见解。

Answer 1

一个问题在于映射器的二元组的编码。上面编码的方式，bigram是python类型“元组”：

>>> word = 'the'
>>> word2 = 'boy'
>>> bigram = word, word2
>>> type(bigram)
<type 'tuple'>

通常，普通字符串用作键。因此，将字符串创建为字符串。你可以这样做的一种方法是：

bigram = '-'.join((word, nextword))

当我在程序中进行更改时，我会看到这样的输出：

automatic-translation   1
automatic-vs    1
automatically-focus 1
automatically-learn 1
automatically-learning  1
automatically-translate 1
available-including 1
available-without   1

另外一个提示：在命令行上尝试-q以消除所有hadoop中间噪音。有时它只是妨碍了。

HTH。

Answer 2

这是缓存错误。我主要是通过Hortonworks沙箱找到它的。简单的解决方案是从沙箱注销并再次ssh ..

“步骤1的计数器：没有计数器发现”使用Hadoop和mrjob

2 个答案: