我写了一个检查文本文档中诅咒词的程序。 我将文档转换成单词列表,然后将每个单词传递给网站以检查它是否是一个诅咒单词。 问题是,如果文本太大,则运行速度非常慢。 如何使其更快?
import urllib.request
def read_text():
quotes = open(r"C:\Self\General\Pooja\Edu_Career\Learning\Python\Code\Udacity_prog_foundn_python\movie_quotes.txt") #built in function
contents_of_file = quotes.read().split()
#print(contents_of_file)
quotes.close()
check_profanity(contents_of_file)
def check_profanity(text_to_check):
flag = 0
for word in text_to_check:
connection = urllib.request.urlopen("http://www.wdylike.appspot.com/?q="+word)
output = connection.read()
# print(output)
if b"true" in output: # file is opened in bytes mode and output is in byte so compare byte to byte
flag= flag +1
if flag > 0:
print("profanity alert")
else:
print("the text has no curse words")
connection.close()
read_text()
答案 0 :(得分:1)
您使用的网站每次抓取支持多个单词。因此,为了使您的代码更快: A)当您找到第一个诅咒词时,请打破循环。 B)发送超级词到网站。 因此:
def check_profanity(text_to_check):
flag = 0
super_word = ''
for i in range(len(text_to_check)):
if i < 100 and i < len(text_to_check): #100 or max number of words you can check at the same time
super_word = super_word + " " + word
else:
connection = urllib.request.urlopen("http://www.wdylike.appspot.com/?q="+super_word)
super_word = ''
output = connection.read()
if b"true" in output:
flag = flag +1
break
if flag > 0:
print("profanity alert")
else:
print("the text has no curse words")
答案 1 :(得分:1)
首先,像Menno Van Dijk suggests一样,在本地存储一个子句中一些常见的已知诅咒字词,可以快速进行亵渎检查,而无需查询该网站;如果找到了已知的诅咒字词,您可以立即发出警报,而无需进行其他任何检查。
第二,反过来建议至少在本地缓存前几千个最常见的 non 诅咒词;没有理由每个包含单词“ is”,“ the”或“ a”的文本都应反复检查这些单词。由于绝大多数书面英语都使用了最常见的两千个单词(甚至更多的人几乎只使用了一万个最常用的单词),因此可以省去很多检查。
第三,在检查您的单词之前要先将它们弄清楚;如果重复使用一个单词,那么第二次和第一次一样好坏,因此两次检查都是很浪费的。
最后,该网站以MTMD suggests的身份允许您批量查询。
在所有这些建议之间,您可能会从需要100,000个连接的100,000个单词文件变为仅需要1-2个文件。尽管多线程可能帮助了您的原始代码(以 slamming web服务为代价),但这些修复程序应该使多线程毫无意义。只有1-2个请求,您可以等待一到两个请求才能依次运行。
作为一个纯粹的风格问题,read_text
呼叫check_profanity
很奇怪;这些实际上应该是分开的行为(read_text
返回可以随后调用check_profanity
的文本)。
根据我的建议(假设存在每行一个单词,一个坏单词,一个好单词的文件):
import itertools # For islice, useful for batching
import urllib.request
def load_known_words(filename):
with open(filename) as f:
return frozenset(map(str.rstrip, f))
known_bad_words = load_known_words(r"C:\path\to\knownbadwords.txt")
known_good_words = load_known_words(r"C:\path\to\knowngoodwords.txt")
def read_text():
with open(r"C:\Self\General\Pooja\Edu_Career\Learning\Python\Code\Udacity_prog_foundn_python\movie_quotes.txt") as quotes:
return quotes.read()
def check_profanity(text_to_check):
# Uniquify contents so words aren't checked repeatedly
if not isinstance(text_to_check, (set, frozenset)):
text_to_check = set(text_to_check)
# Remove words known to be fine from set to check
text_to_check -= known_good_words
# Precheck for any known bad words so loop is skipped completely if found
has_profanity = not known_bad_words.isdisjoint(text_to_check)
while not has_profanity and text_to_check:
block_to_check = frozenset(itertools.islice(text_to_check, 100))
text_to_check -= block_to_check
with urllib.request.urlopen("http://www.wdylike.appspot.com/?q="+' '.join(block_to_check)) as connection:
output = connection.read()
# print(output)
has_profanity = b"true" in output
if has_profanity:
print("profanity alert")
else:
print("the text has no curse words")
text = read_text()
check_profanity(text.split())
答案 2 :(得分:0)
您可以做一些事情:
答案 3 :(得分:0)
使用多线程。
阅读成批的文本。
将每个批次分配给一个线程,并分别检查所有批次。
答案 4 :(得分:0)
一次发送多个单词。将number_of_words更改为您要一次发送的单词数。
import urllib.request
def read_text():
quotes = open("test.txt")
contents_of_file = quotes.read().split()
quotes.close()
check_profanity(contents_of_file)
def check_profanity(text):
number_of_words = 200
word_lists = [text[x:x+number_of_words] for x in range(0, len(text), number_of_words)]
flag = False
for word_list in word_lists:
connection = urllib.request.urlopen("http://www.wdylike.appspot.com/?q=" + "%20".join(word_list))
output = connection.read()
if b"true" in output:
flag = True
break
connection.close()
if flag:
print("profanity alert")
else:
print("the text has no curse words")
read_text()