为什么python不会将stdin输入作为字典读取?

时间:2014-08-12 18:11:13

标签: python hadoop dictionary mapreduce stdin

我确定我在这里做了一些愚蠢的事,但是这里有。我正在为我的Udacity课程进行课堂作业,#34; Map Reduce和Hadoop简介"。我们的任务是制作一个mapper / reducer,它将计算我们数据集(论坛帖子的主体)中单词的出现次数。 我已经知道如何做到这一点,但我不能让Python将stdin数据作为字典读入reducer数据。

到目前为止,这是我的方法: Mapper读取数据(在本例中为代码)并吐出一个单词词典:每个论坛帖子的数量:

#!/usr/bin/python
import sys
import csv
import re
from collections import Counter


def mapper():
    reader = csv.reader(sys.stdin, delimiter='\t')
    writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)

    for line in reader:
        body = line[4]
        #Counter(body)
        words = re.findall(r'\w+', body.lower())
        c = Counter(words)
        #print c.items()
        print dict(c)





test_text = """\"\"\t\"\"\t\"\"\t\"\"\t\"This is one sentence sentence\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Also one sentence!\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Hey!\nTwo sentences!\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"One. Two! Three?\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"One Period. Two Sentences\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Three\nlines, one sentence\n\"\t\"\"
"""

# This function allows you to test the mapper with the provided test string
def main():
    import StringIO
    sys.stdin = StringIO.StringIO(test_text)
    mapper()
    sys.stdin = sys.__stdin__

if __name__ == "__main__":
    main()

论坛帖子的数量然后转到stdout,如: {'this': 1, 'is': 1, 'one': 1, 'sentence': 2}

然后reducer应该在这个stdin中读作字典

#!/usr/bin/python
import sys
from collections import Counter, defaultdict
for line in sys.stdin.readlines():
    print dict(line)

但是失败了,给我这个错误信息: ValueError: dictionary update sequence element #0 has length 1; 2 is required

这意味着(如果我理解正确的话)它在每行中读取的不是作为词典,而是作为文本字符串。我怎样才能让python理解输入行是一个字典?我尝试过使用Counter和defaultdict,但仍然遇到同样的问题,或者将每个字符作为列表元素读取,这也不是我想要的。

理想情况下,我希望映射器读取每行的dict,然后添加下一行的值,因此在第二行之后,值为{'this':1,'is':1,'one':2,'sentence':3,'also':1} 等等。

谢谢, JR

1 个答案:

答案 0 :(得分:1)

感谢@keyser,ast.literal_eval()方法对我有用。 这就是我现在所拥有的:

#!/usr/bin/python
import sys
from collections import Counter, defaultdict
import ast
lineDict = {}
c = Counter()
for line in sys.stdin.readlines():
    lineDict = ast.literal_eval(line)
    c.update(lineDict)
print c.most_common()