Question

我试着为给定文件写一个单词计数代码。当我运行这个时，我在字典中得到一个空的，我只是想获得单词及其频率。我不确定这是错的。

import collections, re

class Wordcount(object):
    def __init__(self):
        self.freq_dict = collections.defaultdict(int)

    def count(self,input_file):
        with open(input_file) as f:
            for line in f:
                words = line.rstrip().strip().split()
                for word in words:
                    word = word.lower()
                    word = re.sub("[^A-Za-z0-9]+",'',word)
                    self.freq_dict[word]+=1
        print self.freq_dict

def Main():
    c1 = Wordcount()
    c1.count('out.txt')

我的out.txt就像这样

The quick brown fox jumps over the lazy dog

--
 asd
 asdasd


The quick brown fox jumps over the lazy dog's

The quick brown fox jumps over the lazy dog

asd之前的空格被解析为字典。

defaultdict(<type 'int'>, {'': 1, 'brown': 3, 'lazy': 3, 'over': 3, 'fox': 3, 'dog': 2, 'asdasd': 1, 'dogs': 1, 'asd': 1, 'quick': 3, 'the': 6, 'jumps': 3})

另外，我想将ssh的这一块扩展到近1000台机器并读取文件并增加单词的频率。什么是最好的方法？我应该创建一个线程T1来登录到机器并将登录传递给另一个线程来读取文件，然后传递给另一个线程来单独递增哈希值。

关于如何扩展这一点的任何建议真的很有用吗？

Answer 1

这是使用fabric的简单示例。 Fabric是允许通过ssh在多台机器上执行命令的框架。

from fabric.api import task, run, get
from collections import Counter
from StringIO import StringIO


def worlds(data):
    return data.split()


@task
def count_worlds():
    s_fp = StringIO()
    # for big files better read to temp file
    get('/some/remote/file', s_fp)
    world_count = Counter(s_fp.getvalue())
    # do something with world_count

要在许多机器上执行此脚本，只需将其保存到fabfile.py并执行：

$ fab count_worlds -H host1,host2,host3

您也可以在fabfile中定义主机，有关详细信息，请参阅this。当然，你应该先安装布料。

Python Word使用100台机器的ssh计算文件

1 个答案: