我开发了一个代码,负责读取txt文件的单词,在我的情况下" elquijote.txt"然后使用字典{key:value}来显示出现的单词及其出现次数。
例如对于文件" test1.txt"用以下词语:
hello hello hello good bye bye
我的程序输出是:
hello 3
good 1
bye 2
该程序的另一个选项是,它显示的那些单词出现的次数比我们通过参数引入的数字要多。
如果在shell中,我们将以下命令" python readingwords.py text.txt 2" , 将显示文件中包含的那些单词" test1.txt"出现的次数多于我们输入的次数,在本例中为2
输出:
hello 3
现在我们可以引入常见词的第三个参数,例如确定连词,它们是如此通用,我们不希望在字典中显示或引入。
我的代码工作正常,问题是使用大文件,例如" elquijote.txt"需要很长时间才能完成整个过程。
我一直在想,这是因为我使用我的辅助列表来消除单词。
我认为解决方案是不在我的列表中引入由参数输入的txt文件中出现的单词,其中包含要丢弃的单词。
这是我的代码:
def contar(aux):
counts = {}
for palabra in aux:
palabra = palabra.lower()
if palabra not in counts:
counts[palabra] = 0
counts[palabra] += 1
return counts
def main():
characters = '!?¿-.:;-,><=*»¡'
aux = []
counts = {}
with open(sys.argv[1],'r') as f:
aux = ''.join(c for c in f.read() if c not in characters)
aux = aux.split()
if (len(sys.argv)>3):
with open(sys.argv[3], 'r') as f:
remove = "".join(c for c in f.read())
remove = remove.split()
#Borrar del archivo
for word in aux:
if word in remove:
aux.remove(word)
counts = contar(aux)
for word, count in counts.items():
if count > int(sys.argv[2]):
print word, count
if __name__ == '__main__':
main()
Contar函数引入字典中的单词。
主要功能介绍在&#34; aux&#34;列出那些不包含符号字符的单词,然后从同一列表中删除那些&#34;禁止&#34;从另一个.txt文件加载的单词。
我认为正确的解决办法是丢弃禁止使用的字词,我会丢弃不接受的符号,但在尝试了几种方法之后,我还没有成功地做到这一点。
您可以在线测试我的代码: https://repl.it/Nf3S/54 感谢。
答案 0 :(得分:2)
以下是一些优化措施:
加速一点,但不是一个数量级。
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import os
import collections
def contar(aux):
return collections.Counter(aux)
def main():
characters = '!?¿-.:;-,><=*»¡'
aux = []
counts = {}
with open(sys.argv[1],'r') as f:
text = f.read().lower().translate(None, characters)
aux = text.split()
if (len(sys.argv)>3):
with open(sys.argv[3], 'r') as f:
remove = set(f.read().strip().split())
else:
remove = []
counts = contar(aux)
for r in remove:
counts.pop(r, None)
for word, count in counts.items():
if count > int(sys.argv[2]):
print word, count
if __name__ == '__main__':
main()
答案 1 :(得分:1)
这里有一些效率低下的问题。我已经重写了您的代码以利用其中的一些优化。每个更改的原因都在注释/ doc字符串中:
# -*- coding: utf-8 -*-
import sys
from collections import Counter
def contar(aux):
"""Here I replaced your hand made solution with the
built-in Counter which is quite a bit faster.
There's no real reason to keep this function, I left it to keep your code
interface intact.
"""
return Counter(aux)
def replace_special_chars(string, chars, replace_char=" "):
"""Replaces a set of characters by another character, a space by default
"""
for c in chars:
string = string.replace(c, replace_char)
return string
def main():
characters = '!?¿-.:;-,><=*»¡'
aux = []
counts = {}
with open(sys.argv[1], "r") as f:
# You were calling lower() once for every `word`. Now we only
# call it once for the whole file:
contents = f.read().strip().lower()
contents = replace_special_chars(contents, characters)
aux = contents.split()
#Borrar del archivo
if len(sys.argv) > 3:
with open(sys.argv[3], "r") as f:
# what you had here was very ineffecient:
# remove = "".join(c for c in f.read())
# that would create an array or characters then join them together as a string.
# this is a bit silly because it's identical to f.read():
# "".join(c for c in f.read()) === f.read()
ignore_words = set(f.read().strip().split())
"""ignore_words is a `set` to allow for very fast inclusion/exclusion checks"""
aux = (word for word in aux if word not in ignore_words)
counts = contar(aux)
for word, count in counts.items():
if count > int(sys.argv[2]):
print word, count
if __name__ == '__main__':
main()
答案 2 :(得分:1)
一些变化和推理:
__name__ == 'main'
下的命令行参数:通过执行此操作,您可以强制执行代码的模块化,因为它只会在您运行此脚本本身时请求命令行参数,而不是导入函数来自另一个剧本。[aA-zZ0-9]+
过滤掉不是字母数字的单词。try
except
块来pythonic尝试将最小计数定义为sys.argv[2]
并捕获IndexError
的例外以将最小计数默认为{{ 1}}。Python脚本:
0
文字档案:
# sys
import sys
# regex
import re
def main(text_file, min_count):
word_count = {}
with open(text_file, 'r') as words:
# Clean words of linebreaks and split
# by ' ' to get list of words
words = words.read().strip().split(' ')
# Filter words that are not alphanum
pattern = re.compile(r'^[aA-zZ0-9]+$')
words = filter(pattern.search,words)
# Iterate through words and collect
# count
for word in words:
if word in word_count:
word_count[word] = word_count[word] + 1
else:
word_count[word] = 1
# Iterate for output
for word, count in word_count.items():
if count > min_count:
print('%s %s' % (word, count))
if __name__ == '__main__':
# Get text file name
text_file = sys.argv[1]
# Attempt to get minimum count
# from command line.
# Default to 0
try:
min_count = int(sys.argv[2])
except IndexError:
min_count = 0
main(text_file, min_count)
命令:
hello hello hello good bye goodbye !bye bye¶ b?e goodbye
输出:
python script.py text.txt
使用最小计数命令:
bye 1
good 1
hello 3
goodbye 2
输出:
python script.py text.txt 2