Question

我写了一个检查文本文档中诅咒词的程序。我将文档转换成单词列表，然后将每个单词传递给网站以检查它是否是一个诅咒单词。问题是，如果文本太大，则运行速度非常慢。如何使其更快？

import urllib.request

def read_text():
   quotes = open(r"C:\Self\General\Pooja\Edu_Career\Learning\Python\Code\Udacity_prog_foundn_python\movie_quotes.txt") #built in function
   contents_of_file = quotes.read().split()
   #print(contents_of_file)
    quotes.close()
    check_profanity(contents_of_file)

def check_profanity(text_to_check):
   flag = 0
   for word in text_to_check:
   connection = urllib.request.urlopen("http://www.wdylike.appspot.com/?q="+word)
   output = connection.read()
   # print(output)
   if b"true" in output:     # file is opened in bytes mode and output is in byte so compare byte to byte
       flag= flag +1

   if flag > 0:
       print("profanity alert")
   else:
       print("the text has no curse words")

  connection.close()

read_text()

Answer 1

您使用的网站每次抓取支持多个单词。因此，为了使您的代码更快： A）当您找到第一个诅咒词时，请打破循环。 B）发送超级词到网站。因此：

def check_profanity(text_to_check):
  flag = 0
  super_word = ''
  for i in range(len(text_to_check)):
    if i < 100 and i < len(text_to_check): #100 or max number of words you can check at the same time
      super_word = super_word + " " + word
    else:         
      connection = urllib.request.urlopen("http://www.wdylike.appspot.com/?q="+super_word)
      super_word = ''
      output = connection.read()
      if b"true" in output:   
        flag = flag +1
        break
  if flag > 0:
    print("profanity alert")
  else:
    print("the text has no curse words")

Answer 2

首先，像Menno Van Dijk suggests一样，在本地存储一个子句中一些常见的已知诅咒字词，可以快速进行亵渎检查，而无需查询该网站；如果找到了已知的诅咒字词，您可以立即发出警报，而无需进行其他任何检查。

第二，反过来建议至少在本地缓存前几千个最常见的 non 诅咒词；没有理由每个包含单词“ is”，“ the”或“ a”的文本都应反复检查这些单词。由于绝大多数书面英语都使用了最常见的两千个单词（甚至更多的人几乎只使用了一万个最常用的单词），因此可以省去很多检查。

第三，在检查您的单词之前要先将它们弄清楚；如果重复使用一个单词，那么第二次和第一次一样好坏，因此两次检查都是很浪费的。

最后，该网站以MTMD suggests的身份允许您批量查询。

在所有这些建议之间，您可能会从需要100,000个连接的100,000个单词文件变为仅需要1-2个文件。尽管多线程可能帮助了您的原始代码（以 slamming web服务为代价），但这些修复程序应该使多线程毫无意义。只有1-2个请求，您可以等待一到两个请求才能依次运行。

作为一个纯粹的风格问题，read_text呼叫check_profanity很奇怪；这些实际上应该是分开的行为（read_text返回可以随后调用check_profanity的文本）。

根据我的建议（假设存在每行一个单词，一个坏单词，一个好单词的文件）：

import itertools  # For islice, useful for batching
import urllib.request

def load_known_words(filename):
    with open(filename) as f:
        return frozenset(map(str.rstrip, f))

known_bad_words = load_known_words(r"C:\path\to\knownbadwords.txt")
known_good_words = load_known_words(r"C:\path\to\knowngoodwords.txt")

def read_text():
    with open(r"C:\Self\General\Pooja\Edu_Career\Learning\Python\Code\Udacity_prog_foundn_python\movie_quotes.txt") as quotes:
        return quotes.read()

def check_profanity(text_to_check):
    # Uniquify contents so words aren't checked repeatedly
    if not isinstance(text_to_check, (set, frozenset)):
        text_to_check = set(text_to_check)

    # Remove words known to be fine from set to check
    text_to_check -= known_good_words

    # Precheck for any known bad words so loop is skipped completely if found
    has_profanity = not known_bad_words.isdisjoint(text_to_check)
    while not has_profanity and text_to_check:
        block_to_check = frozenset(itertools.islice(text_to_check, 100))
        text_to_check -= block_to_check

        with urllib.request.urlopen("http://www.wdylike.appspot.com/?q="+' '.join(block_to_check)) as connection:
            output = connection.read()
        # print(output)
        has_profanity = b"true" in output

    if has_profanity:
        print("profanity alert")
    else:
        print("the text has no curse words")

text = read_text()
check_profanity(text.split())

Answer 3

您可以做一些事情：

阅读成批文本
将每批文本交给工作进程，然后检查其亵渎行为。
引入一种缓存，该缓存可以离线保存常用的诅咒字词，以最大程度地减少所需的HTTP请求数量。

Answer 4

使用多线程。
阅读成批的文本。
将每个批次分配给一个线程，并分别检查所有批次。

Answer 5

一次发送多个单词。将number_of_words更改为您要一次发送的单词数。

import urllib.request

def read_text():
    quotes = open("test.txt")
    contents_of_file = quotes.read().split()
    quotes.close()
    check_profanity(contents_of_file)

def check_profanity(text):
    number_of_words = 200
    word_lists = [text[x:x+number_of_words] for x in range(0, len(text), number_of_words)]
    flag = False
    for word_list in word_lists:
        connection = urllib.request.urlopen("http://www.wdylike.appspot.com/?q=" + "%20".join(word_list))
        output = connection.read()
        if b"true" in output:
            flag = True
            break
        connection.close()
    if flag:
        print("profanity alert")
    else:
        print("the text has no curse words")

read_text()

使程序运行更快

5 个答案: