如何制作它以便我只能读取特定单词的文本文件?

时间:2013-04-11 05:41:16

标签: python regex text-files word-count

如何让我的代码只读取文本文件中的特定单词并显示单词和计数(单词出现在文本文件中的次数)?

from collections import Counter
import re

def openfile(filename):
 fh = open(filename, "r+")
 str = fh.read()
 fh.close()
 return str

def removegarbage(str):
 str = re.sub(r'\W+', ' ', str)
 str = str.lower()
 return str

def getwordbins(words):
 cnt = Counter()
 for word in words:
    cnt[word] += 1
 return cnt

 def main(filename, topwords):
   txt = openfile(filename)
   txt = removegarbage(txt)
   words = txt.split(' ')
   bins = getwordbins(words)
   for key, value in bins.most_common(topwords):
    print key,value

  main('filename.txt', 10)

4 个答案:

答案 0 :(得分:1)

我认为做那么多功能太复杂了,为什么不在一个功能中做呢?

# def function if desired
# you may have the filepath/specific words etc as parameters

 f = open("filename.txt")
 counter=0
 for line in f:
     # you can remove punctuation, translate them to spaces,
     # now any interesting words will be surrounded by spaces and
     # you can detect them
     line = line.translate(maketrans(".,!? ","     "))
     words = line.split() # splits on any number of whitespaces
     for word in words:
         if word == specificword:
             # of use a list of specific words: 
             # if word in specificwordlist:
             counter+=1
             print word
             # you could also append the words to some list, 
             # create a dictionary etc
 f.close()

答案 1 :(得分:1)

生成文件中所有单词的生成器派上用场:

from collections import Counter
import re

def words(filename):
    regex = re.compile(r'\w+')
    with open(filename) as f:
        for line in f:
            for word in regex.findall(line):
                yield word.lower()

然后,要么:

wordcount = Counter(words('filename.txt'))               
for word in ['foo', 'bar']:
    print word, wordcount[word]

words_to_count = set(['foo', 'bar'])
wordcount = Counter(word for word in words('filename.txt') 
                    if word in words_to_count)               
print wordcount.items()

答案 2 :(得分:1)

我认为你要找的是一个简单的字典结构。这样您不仅可以跟踪您正在寻找的单词,还可以跟踪它们的计数。

字典将事物存储为键/值对。因此,例如,您可以使用密钥“alice”(您要查找的单词,并将其值设置为您找到该关键字的次数。

检查字典中是否有内容的最简单方法是使用Python的in关键字。即。

if 'pie' in words_in_my_dict: do something

有了这些信息,设置一个字计数器非常容易!

def get_word_counts(words_to_count, filename):
    words = filename.split(' ')
    for word in words:
        if word in words_to_count:
            words_to_count[word] += 1
    return words_to_count

if __name__ == '__main__':

    fake_file_contents = (
        "Alice's Adventures in Wonderland (commonly shortened to "
        "Alice in Wonderland) is an 1865 novel written by English"
        " author Charles Lutwidge Dodgson under the pseudonym Lewis"
        " Carroll.[1] It tells of a girl named Alice who falls "
        "down a rabbit hole into a fantasy world populated by peculiar,"
        " anthropomorphic creatures. The tale plays with logic, giving "
        "the story lasting popularity with adults as well as children."
        "[2] It is considered to be one of the best examples of the literary "
        "nonsense genre,[2][3] and its narrative course and structure, "
        "characters and imagery have been enormously influential[3] in "
        "both popular culture and literature, especially in the fantasy genre."
        )

    words_to_count = {
        'alice' : 0,
        'and' : 0,
        'the' : 0
        }

    print get_word_counts(words_to_count, fake_file_contents)

这给出了输出:

{'and': 4, 'the': 5, 'alice': 0}

由于dictionary存储了我们想要计算的单词它们出现的次数。整个算法只是检查每个单词是否在dict中,如果结果是,我们将1添加到该单词的值。

阅读词典here.

编辑:

如果你想计算所有单词,然后然后找到一个特定的集合,那么这个任务的字典仍然很棒(而且很快!)。

我们需要做的唯一改变是首先检查字典key是否存在,如果不存在,则将其添加到字典中。

实施例

def get_all_word_counts(filename):
    words = filename.split(' ')

    word_counts = {}
    for word in words: 
        if word not in word_counts:     #If not already there
            word_counts[word] = 0   # add it in.
        word_counts[word] += 1          #Increment the count accordingly
    return word_counts

这给出了输出:

and : 4
shortened : 1
named : 1
popularity : 1
peculiar, : 1
be : 1
populated : 1
is : 2
(commonly : 1
nonsense : 1
an : 1
down : 1
fantasy : 2
as : 2
examples : 1
have : 1
in : 4
girl : 1
tells : 1
best : 1
adults : 1
one : 1
literary : 1
story : 1
plays : 1
falls : 1
author : 1
giving : 1
enormously : 1
been : 1
its : 1
The : 1
to : 2
written : 1
under : 1
genre,[2][3] : 1
literature, : 1
into : 1
pseudonym : 1
children.[2] : 1
imagery : 1
who : 1
influential[3] : 1
characters : 1
Alice's : 1
Dodgson : 1
Adventures : 1
Alice : 2
popular : 1
structure, : 1
1865 : 1
rabbit : 1
English : 1
Lutwidge : 1
hole : 1
Carroll.[1] : 1
with : 2
by : 2
especially : 1
a : 3
both : 1
novel : 1
anthropomorphic : 1
creatures. : 1
world : 1
course : 1
considered : 1
Lewis : 1
Charles : 1
well : 1
It : 2
tale : 1
narrative : 1
Wonderland) : 1
culture : 1
of : 3
Wonderland : 1
the : 5
genre. : 1
logic, : 1
lasting : 1

注意:当我们split(' ')文件时,您会看到有一些“失火”。具体而言,某些单词附有开括号或右括号。你必须在你的文件处理中考虑到这一点..但是,我留下来让你弄明白了!

答案 3 :(得分:0)

这可能就足够了......不完全是你问的,但最终的结果是你想要的(我认为)

interesting_words = ["ipsum","dolor"]

some_text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec viverra consectetur sapien, sed posuere sem rhoncus quis. Mauris sit amet ligula et nulla ultrices commodo sed sit amet odio. Nullam vel lobortis nunc. Donec semper sem ut est convallis posuere adipiscing eros lobortis. Nullam tempus rutrum nulla vitae pretium. Proin ut neque id nisi semper faucibus. Sed sodales magna faucibus lacus tristique ornare.
"""

d = Counter(some_text.split())
final_list = filter(lambda item:item[0] in interesting_words,d.items())

然而它的复杂性并不令人敬畏,因此可能需要一段时间才能找到大文件和/或“interesting_words”的大清单