Question

在此先感谢..我需要使用apache spark创建一个术语文档矩阵...任何人都可以告诉我如何使用spark的java mllib库来实现这一点。我们也可以使用mllib实现亲和传播算法。 / p>

Answer 1

有关详细信息，请查看以下blog。以下是摘要：

sc = SparkContext('local', 'term_doc')
corpus = sc.parallelize([
    "It is the east, and Juliet is the sun.",
    "A dish fit for the gods.",
    "Brevity is the soul of wit."])

tokens = corpus.map(lambda raw_text: raw_text.split()).cache()   
local_vocab_map = tokens.flatMap(lambda token: token).distinct()\
                        .zipWithIndex().collectAsMap()

vocab_map = sc.broadcast(local_vocab_map)
vocab_size = sc.broadcast(len(local_vocab_map))

term_document_matrix = tokens \
 .map(Counter) \
 .map(lambda counts: {vocab_map.value[token]: float(counts[token]) for token in counts})\
 .map(lambda index_counts: SparseVector(vocab_size.value, index_counts))

for doc in term_document_matrix.collect():
    print doc`

这将产生以下输出：

>>> tokens.first()
['It', 'is', 'the', 'east,', 'and', 'Juliet', 'is', 'the', 'sun.']

>>> local_vocab_map
{'and': 0, 'A': 1, 'fit': 14, 'for': 13, 'of': 3, 'is': 4, 'gods.': 7, 'It': 11,\
'Brevity': 10, 'soul': 12, 'sun.': 8, 'dish': 2, 'east,': 9, 'the': 5, 'wit.': 6, 'Juliet': 15}

>>> for doc in term_document_matrix.collect():
        print doc
(16,[0,4,5,8,9,11,15],[1.0,2.0,2.0,1.0,1.0,1.0,1.0])
(16,[1,2,5,7,13,14],[1.0,1.0,1.0,1.0,1.0,1.0])
(16,[3,4,5,6,10,12],[1.0,1.0,1.0,1.0,1.0,1.0])

Answer 2

对于Spark中的文档分类，请查看朴素贝叶斯 http://spark.apache.org/docs/latest/mllib-naive-bayes.html

您可能还想查看TF-IDF方法 http://spark.apache.org/docs/latest/mllib-feature-extraction.html

使用apache spark mllib库的术语文档矩阵

2 个答案: