Question

我需要创建PDF的内容。

Answer 1

如果你在Python中将所有文本都作为一个字符串（由于你的相关帖子，我假设你这样做），那么你可以使用Natural Language Toolkit用于Python。您可以下载from here。

示例：

import nltk, re, pprint
from nltk import FreqDist

tokens = nltk.word_tokenize(pdf_text)
text = nltk.Text(tokens)
fdist = FreqDist(text)
vocabulary = fdist.keys()

print vocabulary[:50] # Print the 50 most common words

有关基础知识的详细信息，请查看Chapter 1的Book。

Answer 2

使用pdftotext（xpdf附带）将您的pdf文件转储到文本文件中。您可以使用subprocess.call通过Python脚本调用此方法。

使用collections.Counter.most_common或ntlk查找最常用的字词：

import collections
keywords = collections.Counter(open(<...>).read()).most_common(20)

请参阅this question。

Answer 3

您可以使用collections.Counter跟踪字数。我会使用正则表达式来捕获页面上的所有单词，将每个单词添加到计数器，然后转到下一页。您可以为每个单词同时保留查找索引，然后过滤常用单词（counter[word] > threshold），也可以再次浏览文档，只构建常用单词的索引。

a）这会有点慢 b）你必须处理像'a'，'the'，'和'等词，以确保不计算这些词。

创建PDF的内容

3 个答案: