Question

我正在尝试使用常规文本文件并删除单独文件中标识的单词（停用词），其中包含要通过回车符（“\ n”）分隔的要删除的单词。

现在我正在将两个文件转换为列表，以便可以比较每个列表的元素。我有这个功能工作，但它没有删除我在stopwords文件中指定的所有单词。非常感谢任何帮助。

def elimstops(file_str): #takes as input a string for the stopwords file location
  stop_f = open(file_str, 'r')
  stopw = stop_f.read()
  stopw = stopw.split('\n')
  text_file = open('sample.txt') #Opens the file whose stop words will be eliminated
  prime = text_file.read()
  prime = prime.split(' ') #Splits the string into a list separated by a space
  tot_str = "" #total string
  i = 0
  while i < (len(stopw)):
    if stopw[i] in prime:
      prime.remove(stopw[i]) #removes the stopword from the text
    else:
      pass
    i += 1
  # Creates a new string from the compilation of list elements 
  # with the stop words removed
  for v in prime:
    tot_str = tot_str + str(v) + " " 
  return tot_str

Answer 1

这是使用生成器表达式的替代解决方案。

tot_str = ' '.join(word for word in prime if word not in stopw)

为了提高效率，请使用stopw将set变为stopw = set(stopw)。

如果sample.txt不仅仅是一个空格分隔文件，您可能会遇到当前方法的问题，例如，如果您使用标点符号的普通句子，那么在空格上拆分会将标点符号作为单词的一部分。要解决此问题，您可以使用re模块在空格或标点符号上拆分字符串：

import re
prime = re.split(r'\W+', text_file.read())

Answer 2

我不知道python，但这是一种通用的方法，它是O（n）+ O（m）时间 - 线性。

1：将停用词文件中的所有单词添加到地图中 2：阅读常规文本文件并尝试将单词添加到列表中。当您执行＃2检查当前读取的单词是否在地图中时，如果它是跳过它，否则将其添加到列表中。

最后，列表应该包含您需要的所有单词 - 您想要删除的单词。

Answer 3

我认为你的问题是这一行：

    if stopw[i] in prime:
      prime.remove(stopw[i]) #removes the stopword from the text

仅从stopw[i]中删除第一次出现的prime。要解决此问题，您应该这样做：

    while stopw[i] in prime:
      prime.remove(stopw[i]) #removes the stopword from the text

然而，这将非常缓慢地运行，因为in prime和prime.remove位都必须迭代素数。这意味着你的字符串长度最终会有quadratic个运行时间。如果您使用像F.J. suggests这样的生成器，您的运行时间将是线性的，这要好得多。

从文件中删除单词

3 个答案: