Question

我正在学习用python参考书来分析文档。当我读到书中的一些代码时，我感到困惑，这里的代码是： Example 6. A document summarization algorithm based principally upon sentence detection and frequency analysis within sentences

让我困惑的是：

    s = [...words of sentence...]
    word_idx = []

    # For each word in the word list...
    for w in important_words:
        try:
            # Compute an index for where any important words occur in the sentence.

            word_idx.append(s.index(w))
        except ValueError, e: # w not in this particular sentence
            pass

    word_idx.sort()

为什么不使用它：

        for i in range(len(s)):
            w = s[i]
            if w in important_words:
                word_idx.append(i)

他们之间的区别在于：前者并不计算重复的单词，而后者则计算，例如：

s = [u'fes', u'watch', u'\u2014', u'e-paper', u'watch', u',', u'including', u'strap', u'.']

以前的打印[0, 1, 2, 3, 5, 6, 7, 8]，而后者打印[0, 1, 2, 3, 4, 5, 6, 7, 8]

当我计算句子分数时，我应该计算重复的单词吗？

Answer 1

我的感觉是，当你有很多important_words时，第一个算法是低效的。您的算法效率更高，但它会计算两次单词，这似乎不对。 watch watch watch watch得分应高于Bobby has a watch吗？

答案取决于您的需求。在自然文本分析方面，没有“最佳”解决方案。谷歌在分析HTML网页时有不同的需求，比如考古学家。

所以我认为这归结为：是否有更高效的算法产生与书中相同的结果？

是：使用您的代码并将这些字词放入set()以删除重复项：

s = set(s)

根据您的Python版本，此步骤可能会对单词重新排序，但我认为这并不重要，因为本书中的代码在第一次循环后没有使用s。

如果订单很重要，您需要过滤清单。

Answer 2

如果您继续阅读Example 6. A document summarization algorithm based principally upon sentence detection and frequency analysis within sentences，您将看到以下行

for c in clusters:
    significant_words_in_cluster = len(c)
    total_words_in_cluster = c[-1] - c[0] + 1
    score = 1.0 * significant_words_in_cluster \
        * significant_words_in_cluster / total_words_in_cluster

，而c是您比较的列表：

＆＃34;前者[0,1,2,3,5,6,7,8]，而后者打印[0,1,2,3,4,5,6,7,8] ＆＃34;

＆＃34; total_words_in_cluster＆＃34;对于每个列表保持不变，＆＃34; significant_words_in_cluster＆＃34;不同。这会影响群集＆＃34;得分＆＃34;。

作者创建群集列表的方式是进一步计算的选择。我认为你的尝试本身就是合法的。

顺便说一下。也是创建集群的方式，

if word_idx[i] - word_idx[i - 1] < CLUSTER_THRESHOLD:
     cluster.append(word_idx[i])

取决于idx_s列表中的值...

当我在文档分析中计算句子得分时，我应该计算重复的单词吗？

2 个答案: