Question

所以，我循环浏览一堆文档，创建文档中所有独特单词和连续单词组的列表（显然，我看到的字符串非常短）。

globallist=[]
for filename in glob.glob(os.path.join(path, '*.html')):
     mystr = "some text I want"
     stuff = re.sub("[^\w]", " ",  mystr).split()
     wordlist = [''.join(stuff[i:j]) for i in range(len(stuff)) for j in range(i+1, len(stuff)+1)]
     globallist = set.union(set(globallist), set(wordlist))

我希望在我去的情况下跟踪globallist的事件，以便最后我已经计算了 包含每个字符串的文件数量 名单。我打算删除只出现在一个文档中的任何元素。最好的方法是什么？

Answer 1

为每个文档创建set个单词，并使用每个文件单词更新collections.Counter。每个文件set可避免每个文件多次计算单词，Counter可跨文件无缝求和。对于计算单个单词的超简单示例（不跟踪它们来自哪个文件）：

from collections import Counter

totals = Counter()
for file in allfiles:
    with open(file) as f:
        totals.update(set(f.read().split()))

Answer 2

下面的脚本应该有助于您提供一些想法。

您正在尝试解析HTML文件，因此理想情况下，您只需从每个文件中提取文本而不使用任何HTML标记。这可以使用BeautifulSoup等库来完成。接下来最好小写所有单词以确保使用不同的大小写来捕获单词。 Python的collections.Counter可用于计算所有单词，并且可以构造一个仅包含计数为1的单词的列表。最后可以计算你的短语。

然后，所有这些信息都可以基于每个文件存储到file_stats中。然后在结尾显示结果。

由此，您将能够看到有多少文档包含您要查找的文本。

from bs4 import BeautifulSoup
import collections
import glob
import re   
import os

path = r'mypath'
file_stats = []

search_list = ['some text I want', 'some other text']
search_list = [phrase.lower() for phrase in search_list]    # Ensure list is all lowercase

for filename in glob.glob(os.path.join(path, '*.html')):
    with open(filename, 'r') as f_input:
        html = f_input.read()

    soup = BeautifulSoup(html, 'html.parser')

    # Remove style and script sections from the HTML
    for script in soup(["style", "script"]):
        script.extract() 

    # Extract all text
    text = soup.get_text().encode('utf-8')

    # Create a word list in lowercase
    word_list = [word.lower() for word in re.sub("[^\w]", " ",  text).split()]

    # Search for matching phrases
    phrase_counts = dict()
    text = ' '.join(word_list)

    for search in search_list:
        phrase_counts[search] = text.count(search)

    # Calculate the word counts
    word_counts = collections.Counter(word_list)

    # Filter unique words
    unique_words = sorted(word for word, count in word_counts.items() if count == 1)

    # Create a list of unique words and phrase matches for each file
    file_stats.append([filename, unique_words, phrase_counts])

# Display the results for all files
for filename, unique_words, phrase_counts in file_stats:
    print '{:30} {}'.format(filename, unique_words)
    for phrase, count in phrase_counts.items():
        print '  {} : {}'.format(phrase, count)

如何使用set.union（）跟踪出现次数

2 个答案: