我有一个Python脚本,它接收'.html'文件删除停用词并返回python词典中的所有其他单词。但是如果在多个文件中出现相同的单词,我希望它只返回一次。即包含不间断的单词,每次只包含一次。
def run():
filelist = os.listdir(path)
regex = re.compile(r'.*<div class="body">(.*?)</div>.*', re.DOTALL | re.IGNORECASE)
reg1 = re.compile(r'<\/?[ap][^>]*>', re.DOTALL | re.IGNORECASE)
quotereg = re.compile(r'"', re.DOTALL | re.IGNORECASE)
puncreg = re.compile(r'[^\w]', re.DOTALL | re.IGNORECASE)
f = open(stopwordfile, 'r')
stopwords = f.read().lower().split()
filewords = {}
htmlfiles = []
for file in filelist:
if file[-5:] == '.html':
htmlfiles.append(file)
totalfreq = {}
for file in htmlfiles:
f = open(path + file, 'r')
words = f.read().lower()
words = regex.findall(words)[0]
words = quotereg.sub(' ', words)
words = reg1.sub(' ', words)
words = puncreg.sub(' ', words)
words = words.strip().split()
for w in stopwords:
while w in words:
words.remove(w)
freq = {}
for w in words:
words=words
print words
if __name__ == '__main__':
run()
答案 0 :(得分:6)
使用set。只需将您找到的每个单词添加到集合中;它忽略了重复。
假设你有一个迭代器返回文件中的每个单词(这是纯文本; HTML会更复杂):
def words(filename):
with open(filename) as wordfile:
for line in wordfile:
for word in line.split():
yield word
然后将它们变成set
很简单:
wordlist = set(words("words.txt"))
如果您有多个文件,请执行以下操作:
wordlist = set()
wordfiles = ["words1.txt", "words2.txt", "words3.txt"]
for wordfile in wordfiles:
wordlist |= set(words(wordfile))
您还可以使用一组用于停用词。然后你可以在事实之后简单地从单词列表中减去它们,这可能比在添加之前检查每个单词是否为停用词更快。
stopwords = set(["a", "an", "the"])
wordlist -= stopwords