成为Pythonic:收集任意字符串 - 索引器

时间:2016-01-01 13:21:21

标签: python python-2.x

首先,下面的代码按原样运行。我不仅仅是一名Ruby程序员,所以我仍然在Python中感受自己的方式,我相信必须有更多的干嘛方式来实现我在下面所做的事情。

我正在构建一个索引器,它创建一个在文档中重复的术语字典以及一个计数,然后输出带有计数的术语。现在它最多支持四个单词短语。有没有更好的方法来抽象出这种逻辑,这样我就可以做同样的事情,但是对于任意长度的短语而不需要添加越来越多的条件?

import sys
file=open(sys.argv[1],"r")
wordcount = {}
last_word = ""
last_last_word = ""
last_last_last_word = ""

for word in file.read().split():
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1

    if last_last_last_word != "":
        if "{} {} {} {}".format(last_last_last_word,last_last_word,last_word,word) not in wordcount:
            wordcount[last_last_last_word + " " + last_last_word + " " + last_word + " " + word ] = 1
        else: 
            wordcount[last_last_last_word + " " + last_last_word + " " + last_word + " " + word ] += 1
    last_last_last_word = last_last_word

    if last_last_word != "":
        if last_last_word + " " + last_word + " " + word not in wordcount:
            wordcount[last_last_word + " " + last_word + " " + word ] = 1
        else: 
            wordcount[last_last_word + " " + last_word + " " + word ] += 1
    last_last_word = last_word

    if last_word != "":
        if last_word + " " + word not in wordcount:
            wordcount[last_word + " " + word] = 1
        else: 
            wordcount[last_word + " " + word] += 1
    last_word = word

for k,v in sorted(wordcount.items(), key=lambda x:x[1], reverse=True):
    print k,v

我包含更广泛的示例输入和输出。我为长度道歉,但这段代码的性质往往会产生大量的输出。

此输入:

this is a sample input file an input file will always be all lower case with no punctuation

生成此输出:

file 2
input 2
input file 2
an input file 1
all 1
lower case 1
be 1
is 1
file will always 1
an 1
sample 1
case 1
always be all lower 1
this is a 1
will always be 1
sample input file 1
will always 1
is a sample 1
all lower 1
lower case with no 1
no 1
with 1
with no 1
file will always be 1
with no punctuation 1
lower 1
be all lower case 1
no punctuation 1
an input file will 1
input file an 1
file an 1
input file an input 1
always be 1
file an input file 1
be all 1
is a 1
input file will 1
file will 1
an input 1
input file will always 1
will always be all 1
always be all 1
lower case with 1
a sample 1
a sample input file 1
a sample input 1
is a sample input 1
be all lower 1
a 1
sample input file an 1
sample input 1
case with no punctuation 1
all lower case with 1
this 1
always 1
file an input 1
case with 1
case with no 1
will 1
all lower case 1
punctuation 1
this is 1
this is a sample 1
注意,每个单词都被计算,每对单词,每个单词三个单词和每个四重单词。我想干掉这段代码,这样我就可以把这个回复计算到一组任意的单词。

6 个答案:

答案 0 :(得分:3)

如果你担心一个大文件(也许是一个甚至没有行结尾允许逐行迭代的文件),那么你可以对它进行内存映射(保持低内存使用率)并使用正则表达式隔离所有小写单词,创建N个单词的滑动窗口,然后适当更新Counter,例如:

import re
import mmap
from itertools import islice, izip, tee
from collections import Counter
from pprint import pprint

def word_grouper(filename, size):
    counts = Counter()
    with open(filename) as fin:
        mm = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
        words = (m.group() for m in re.finditer('[a-z]+', mm))
        sliding = [islice(w, n, None) for n, w in enumerate(tee(words, size+1))]
        for slide in izip(*sliding):
            counts.update(slide[:n] for n in range(1, len(slide)))

    return counts

counts = word_grouper('input filename', 4)
# do appropriate formatting instead of just `pprint`ing
pprint(counts.most_common())

示例输出(输入文件包含示例字符串):

[(('file',), 2),
 (('input', 'file'), 2),
 (('input',), 2),
 (('a', 'sample', 'input'), 1),
 (('file', 'will', 'always', 'be'), 1),
 (('sample', 'input', 'file', 'an'), 1),
 (('this', 'is', 'a', 'sample'), 1),
 (('this', 'is'), 1),
 (('will',), 1),
 (('lower', 'case', 'with'), 1),
 (('an', 'input', 'file', 'will'), 1),
 (('sample', 'input'), 1),
 (('is', 'a'), 1),
 (('all', 'lower', 'case', 'with'), 1),
 (('input', 'file', 'will'), 1),
 (('an',), 1),
 (('always', 'be'), 1),
 (('lower', 'case', 'with', 'no'), 1),
 (('an', 'input'), 1),
 (('be', 'all', 'lower'), 1),
 (('this',), 1),
 (('be', 'all', 'lower', 'case'), 1),
 (('this', 'is', 'a'), 1),
 (('sample',), 1),
 (('sample', 'input', 'file'), 1),
 (('will', 'always', 'be', 'all'), 1),
 (('a',), 1),
 (('a', 'sample'), 1),
 (('is', 'a', 'sample'), 1),
 (('will', 'always'), 1),
 (('lower',), 1),
 (('lower', 'case'), 1),
 (('file', 'an'), 1),
 (('file', 'an', 'input'), 1),
 (('file', 'will'), 1),
 (('is',), 1),
 (('all', 'lower'), 1),
 (('input', 'file', 'an', 'input'), 1),
 (('always', 'be', 'all', 'lower'), 1),
 (('an', 'input', 'file'), 1),
 (('input', 'file', 'an'), 1),
 (('be', 'all'), 1),
 (('input', 'file', 'will', 'always'), 1),
 (('be',), 1),
 (('all',), 1),
 (('always', 'be', 'all'), 1),
 (('is', 'a', 'sample', 'input'), 1),
 (('always',), 1),
 (('all', 'lower', 'case'), 1),
 (('file', 'an', 'input', 'file'), 1),
 (('file', 'will', 'always'), 1),
 (('a', 'sample', 'input', 'file'), 1),
 (('will', 'always', 'be'), 1)]

答案 1 :(得分:0)

这是对代码的快速重构,defaultdict是您的朋友。

这会将您想要用它的单词数作为第二个参数。

import sys
from collections import defaultdict

file=open(sys.argv[1],"r")

wordcount = defaultdict(int)
wordlist = ["" for i in range(int(sys.argv[2]))]

def check(wordcount, wordlist, word):

    wordlist.append(word)
    for i, word in enumerate(wordlist):
        if word != "":
            current = "".join([w + " " for w in wordlist[i:]])
            wordcount[current] += 1

    return wordlist[1:]

for word in file.read().split():
    wordlist = check(wordcount, wordlist, word)

for k,v in sorted(wordcount.items(), key=lambda x:x[1], reverse=True):
    print k,v

答案 2 :(得分:0)

更新让它变得更加懒散

from collections import Counter
import itertools
import operator as op


def count_phrases(words, phrase_len):
    return reduce(op.add, 
    (Counter(tuple(words[i:i+l]) for i in xrange(len(words)-l+1)) for l in phrase_len))

示例:

words = "a b c a a".split()
for phrase, count in count_phrases(words, [1, 2]).iteritems():
    print " ".join(phrase), counts

输出:

b c 1
a 3
c 1
b 1
c a 1
a a 1
a b 1

答案 3 :(得分:0)

检查一下:

def parser(data,size):
    chunked = data.split()
    phrases = []
    for i in xrange(len(chunked)-size):
        phrase=' '.join(chunked[i:size+i])
        phrases.append(phrase)
    return phrases

def parse_file(fname,size):    
    result = []
    with open(fname,'r') as f:    
        for data in f.readlines():
            for i in xrange(1,size):
                result+=parser(data.strip(),i)

    return Counter(result)


result= parse_file('file.txt',4) 
print sorted(result.items(),key=lambda x:x[1],reverse=True)

[('file', 2),
 ('input', 2),
 ('input file', 2),
 ('an input file', 1),
 ('all', 1),
 ('always be all', 1),
 ('is', 1),
 ('an', 1),
 ('sample', 1),
 ('this is a', 1),
 ('will always be', 1),
 ('sample input file', 1),
 ('will always', 1),
 ('is a sample', 1),
 ('all lower', 1),
 ('no', 1),
 ('with no', 1),
 ('lower case', 1),
 ('case', 1),
 ('input file will', 1),
 ('case with no', 1),
 ('input file an', 1),
 ('file an', 1),
 ('be', 1),
 ('always be', 1),
 ('be all lower', 1),
 ('be all', 1),
 ('lower', 1),
 ('is a', 1),
 ('an input', 1),
 ('a sample input', 1),
 ('lower case with', 1),
 ('a sample', 1),
 ('file will', 1),
 ('with', 1),
 ('a', 1),
 ('file will always', 1),
 ('sample input', 1),
 ('this', 1),
 ('always', 1),
 ('file an input', 1),
 ('case with', 1),
 ('will', 1),
 ('all lower case', 1),
 ('this is', 1)]

答案 4 :(得分:0)

你去男人。我认为你一直在寻找的是这个。

string="this is a sample input file an input file will always be all lower case with no punctuation"

def words(count):
    return [" ".join(string.split()[a:b]) for a in range(len(string.split())) for b in range(a+count+1) if len(string.split()[a:b]) == count]

它基于切片输入文本并返回适当长度的短语列表。

使用您一直在寻找的序列的长度来调用列表。

lst = words(3)

当你用循环查找结果时;

for word in set(lst):
    print word, lst.count(word)

an input file 1
file will always 1
is a sample 1
be all lower 1
file an input 1
with no punctuation 1
input file will 1
lower case with 1
this is a 1
always be all 1
will always be 1
sample input file 1
a sample input 1
all lower case 1
case with no 1
input file an 1

是的,正如评论所说,这是一种无效的方法,所以我必须为此道歉。

你声明你想要通过任意长度提取的短语,所以如果我的第一个假设不正确,这里有另一个解决方案,可以在不使用.count()方法的情况下计算短语组合。

但是通过使用它,整个文本也被视为一个整体的短语,所以请确保你真的确定你想要的这些短语的长度。

words_list = string.split()
words_dict = {}

for a in range(len(words_list)):
    for b in range(a):
        phrase = " ".join(words_list[b:a])
        if phrase in words_dict:
            words_dict[phrase] += 1
        else:
            words_dict[phrase] = 1

for i in words_dict:
    print i, words_dict[i]

全力以赴。

答案 5 :(得分:0)

谦虚的贡献

import sys
file=open(sys.argv[1],"r")
wordcount = {}
nb_words = 4
last_words = []

for word in file.read().split():
    last_words = [word] + last_words 
    if len (last_words) > nb_words:
        last_words.pop()
    for i in range(len(last_words)-1,-1,-1):
        if last_words[i] != "":
            key = ' '.join(last_words[:i+1])
            if key not in wordcount:
                wordcount[key] = 1
            else: 
                wordcount[key] += 1

for k,v in sorted(wordcount.items(), key=lambda x:x[1], reverse=True):
    print k,v

我编写了一个循环来替换变量。所以现在你有一个超过4个字的参数。 编辑:在一些错误修正后,我现在确定它产生相同的输出