Question

我有一组文件，我正在阅读火花数据框。我已经对文本进行了标记和矢量化，现在我想将矢量化数据提供给mllib LDA模型。 LDA API文档似乎要求数据为：

rdd - 文档的RDD，它是文档ID和术语（字）计数向量的元组。术语计数向量是具有固定大小词汇的“词袋”（其中词汇量大小是向量的长度）。文档ID必须唯一且＆gt; = 0。

如何从数据框到达合适的rdd？

from pyspark.mllib.clustering import LDA
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import CountVectorizer

#read the data
tf = sc.wholeTextFiles("20_newsgroups/*")

#transform into a data frame
df = tf.toDF(schema=['file','text'])

#tokenize
tokenizer = Tokenizer(inputCol="text", outputCol="words")
tokenized = tokenizer.transform(df)

#vectorize
cv = CountVectorizer(inputCol="words", outputCol="vectors")
model = cv.fit(tokenized)
result = model.transform(tokenized)

#transform into a suitable rdd
myrdd = ?

#LDA
model = LDA.train(myrdd, k=2, seed=1)

PS：我使用的是Apache Spark 1.6.3

Answer 1

让我们先组织导入，读取数据，删除一些简单的特殊字符并将其转换为import re # needed to remove special character from pyspark import Row from pyspark.ml.feature import StopWordsRemover from pyspark.ml.feature import Tokenizer, CountVectorizer from pyspark.mllib.clustering import LDA from pyspark.sql import functions as F from pyspark.sql.types import StructType, StructField, LongType pattern = re.compile('[\W_]+') rdd = sc.wholeTextFiles("./data/20news-bydate/*/*/*") \ .mapValues(lambda x: pattern.sub(' ', x)).cache() # ref. https://stackoverflow.com/a/1277047/3415409 df = rdd.toDF(schema=['file', 'text'])：

Row

我们需要为每个row_with_index = Row(*["id"] + df.columns) def make_row(columns): def _make_row(row, uid): row_dict = row.asDict() return row_with_index(*[uid] + [row_dict.get(c) for c in columns]) return _make_row f = make_row(df.columns) indexed = (df.rdd .zipWithUniqueId() .map(lambda x: f(*x)) .toDF(StructType([StructField("id", LongType(), False)] + df.schema.fields)))添加一个索引。以下代码段的灵感来自于有关添加primary keys with Apache Spark的问题：

# tokenize
tokenizer = Tokenizer(inputCol="text", outputCol="tokens")
tokenized = tokenizer.transform(indexed)

# remove stop words
remover = StopWordsRemover(inputCol="tokens", outputCol="words")
cleaned = remover.transform(tokenized)

# vectorize
cv = CountVectorizer(inputCol="words", outputCol="vectors")
count_vectorizer_model = cv.fit(cleaned)
result = count_vectorizer_model.transform(cleaned)

一旦我们添加了索引，我们就可以继续进行功能清理，提取和转换：

corpus = result.select(F.col('id').cast("long"), 'vectors').rdd \
    .map(lambda x: [x[0], x[1]])

现在，让我们将结果数据帧转换回rdd

# training data
lda_model = LDA.train(rdd=corpus, k=10, seed=12, maxIterations=50)
# extracting topics
topics = lda_model.describeTopics(maxTermsPerTopic=10)
# extraction vocabulary
vocabulary = count_vectorizer_model.vocabulary

我们的数据现已准备好接受培训：

for topic in range(len(topics)):
    print("topic {} : ".format(topic))
    words = topics[topic][0]
    scores = topics[topic][1]
    [print(vocabulary[words[word]], "->", scores[word]) for word in range(len(words))]

我们现在可以按照以下方式打印主题描述：

{{1}}

PS：上面的代码是用 Spark 1.6.3 进行测试的。

使用PySpark 1.6为LDA培训准备数据

1 个答案: