Question

我是Python新手，我正在尝试构建一个脚本，我导入包含文本体的text_file_1。我希望脚本读取文本正文，并查找我在名为（key_words）的列表中定义的某些单词，这些单词包含在开头（Nation）和小写（国家）中带有大写字母的单词。在Python进行搜索之后，它将在一个名为“单词列表”的新文本文件中垂直输出单词列表，以及该单词在正文中出现的次数。如果我使用文本正文读取text_file_2，它将执行相同操作，但添加到原始文件中的单词列表。

示例：

单词列表

文件1：

God: 5
Nation: 4
creater: 8
USA: 3

文件2：

God: 10
Nation: 14
creater: 2
USA: 1

这是我到目前为止所做的：

from sys import argv
from string import punctuation

script = argv[0] all_filenames = argv[1:]

print "Text file to import and read: " + all_filenames
print "\nReading file...\n"
text_file = open(all_filenames, 'r')
all_lines = text_file.readlines()
#print all_lines
text_file.close()

for all_filenames in argv[1:]:
   print "I get: " + all_filenames

print "\nFile read finished!"
#print "\nYour file contains the following text information:"
#print "\n" + text_file.read()

#~ for word, count in word_freq.items():
    #~ print word, count

keyWords = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater', 'Country', 'Almighty',
             'country', 'People', 'people', 'Liberty', 'liberty', 'America', 'Independence', 
             'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage', 'Proclamation',
             'proclamation', 'United States', 'Emancipation', 'emancipation', 'Constitution',
             'constitution', 'Government', 'Citizens', 'citizens']

for word in keyWords:
    if word in word_freq:
        output_file.write( "%s: %d\n" % (word, word_freq[word]) )

output_file = open("List_of_words.txt", "w")

for word in keyWords:
    if word in word_freq:
        output_file.write( "%s: %d\n" % (word, word_freq[word]) )

output_file.close()

也许以某种方式使用此代码？

import fileinput
for line in fileinput.input('List_of_words.txt', inplace = True):
    if line.startswith('Existing file that was read'):
        #if line starts Existing file that was read then do something here
        print "Existing file that was read"
    elif line.startswith('New file that was read'):
        #if line starts with New file that was read then do something here
        print "New file that was read"
    else:
        print line.strip()

Answer 1

这样你就可以在屏幕上看到结果了。

from sys import argv
from collections import Counter
from string import punctuation

script, filename = argv

text_file = open(filename, 'r')

word_freq = Counter([word.strip(punctuation) for line in text_file for word in line.split()])

#~ for word, count in word_freq.items():
    #~ print word, count

key_words = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater'
             'Country', 'country', 'People', 'people', 'Liberty', 'liberty',
             'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage']

for word in key_words:
    if word in word_freq:
        print word, word_freq[word]

现在你必须将它保存在文件中。

如需更多文件，请使用

for filename in argv[1:]:
   # do your job

修改

使用此代码（my_script.py）

for filename in argv[1:]: print( "I get", filename )

您可以运行脚本

python my_script.py file1.txt file2.txt file3.txt

并获取

I get file1.txt I get file2.txt I get file3.txt

您可以使用它来计算许多文件中的单词。

-

使用readlines()将所有行读入内存，因此需要更多内存 - 对于非常非常大的文件，它可能会出现问题。

在当前版本中Counter()计算所有行中的所有单词 - 测试它 - 但使用更少的内存因此，使用readlines()可以获得相同的word_freq，但您可以使用更多内存。

-

writelines(list_of_result)不会添加＆＃34; \ n＆＃34;在每一行之后 - 并且不要添加＆＃39;：＆＃39;在＆＃34;上帝：3＆＃34;

更好地使用与
类似的东西
output_file = open("List_of_words.txt", "w") for word in key_words: if word in word_freq: output_file.write( "%s: %d\n" % (word, word_freq[word]) ) output_file.close()

编辑：新版本 - 它会将结果追加到List_of_words.txt的末尾

from sys import argv from string import punctuation from collections import * keyWords = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater', 'Country', 'Almighty', 'country', 'People', 'people', 'Liberty', 'liberty', 'America', 'Independence', 'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage', 'Proclamation', 'proclamation', 'United States', 'Emancipation', 'emancipation', 'Constitution', 'constitution', 'Government', 'Citizens', 'citizens'] for one_filename in argv[1:]: print "Text file to import and read:", one_filename print "\nReading file...\n" text_file = open(one_filename, 'r') all_lines = text_file.readlines() text_file.close() print "\nFile read finished!" word_freq = Counter([word.strip(punctuation) for line in all_lines for word in line.split()]) print "Append result to the end of file: List_of_words.txt" output_file = open("List_of_words.txt", "a") for word in keyWords: if word in word_freq: output_file.write( "%s: %d\n" % (word, word_freq[word]) ) output_file.close()

编辑：在一个文件中写入结果总和

from sys import argv from string import punctuation from collections import * keyWords = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater', 'Country', 'Almighty', 'country', 'People', 'people', 'Liberty', 'liberty', 'America', 'Independence', 'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage', 'Proclamation', 'proclamation', 'United States', 'Emancipation', 'emancipation', 'Constitution', 'constitution', 'Government', 'Citizens', 'citizens'] word_freq = Counter() for one_filename in argv[1:]: print "Text file to import and read:", one_filename print "\nReading file...\n" text_file = open(one_filename, 'r') all_lines = text_file.readlines() text_file.close() print "\nFile read finished!" word_freq.update( [word.strip(punctuation) for line in all_lines for word in line.split()] ) print "Write sum of results: List_of_words.txt" output_file = open("List_of_words.txt", "w") for word in keyWords: if word in word_freq: output_file.write( "%s: %d\n" % (word, word_freq[word]) ) output_file.close()

读取文本文件并查找关键词列表中的某些单词

1 个答案: