Question

我想使用tf-idf将文本文档转换为特征向量，然后训练一个朴素的贝叶斯算法对它们进行分类。

我可以轻松加载没有标签的文本文件，并使用HashingTF（）将其转换为矢量，然后使用IDF（）根据它们的重要性对单词进行加权。但是，如果我这样做，我摆脱了标签，即使订单相同，似乎也不可能将标签与矢量重新组合。

另一方面，我可以在每个单独的文档上调用HashingTF（）并保留标签，但是我不能在其上调用IDF（），因为它需要整个文档集（并且标签会阻碍。）

天真贝叶的spark文档只有一个例子，其中的点已被标记和矢量化，因此没有多大帮助。

我还看了一下这个指南：http://help.mortardata.com/technologies/spark/train_a_machine_learning_model 但是在这里他只对没有idf的每个文档应用散列函数。

所以我的问题是，是否有一种方法不仅可以矢量化，还可以使用idf为天真的贝叶斯分类器加权单词？主要问题似乎是火花坚持只接受labelPoints的rdds作为NaiveBayes的输入。

def parseLine(line):
    label = row[1] # the label is the 2nd element of each row
    features = row[3] # the text is the 4th element of each row
    features = tokenize(features)
    features = hashingTF.transform(features)
    return LabeledPoint(label, features)
labeledData = data1.map(parseLine)

Answer 1

标准PySpark方法（split - ＆gt; transform - ＆gt; zip）似乎工作得很好：

from pyspark.mllib.feature import HashingTF, IDF
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import NaiveBayes   

training_raw = sc.parallelize([
    {"text": "foo foo foo bar bar protein", "label": 1.0},
    {"text": "foo bar dna for bar", "label": 0.0},
    {"text": "foo bar foo dna foo", "label": 0.0},
    {"text": "bar foo protein foo ", "label": 1.0}])


# Split data into labels and features, transform
# preservesPartitioning is not really required
# since map without partitioner shouldn't trigger repartitiong
labels = training_raw.map(
    lambda doc: doc["label"],  # Standard Python dict access 
    preservesPartitioning=True # This is obsolete.
)

tf = HashingTF(numFeatures=100).transform( ## Use much larger number in practice
    training_raw.map(lambda doc: doc["text"].split(), 
    preservesPartitioning=True))

idf = IDF().fit(tf)
tfidf = idf.transform(tf)

# Combine using zip
training = labels.zip(tfidf).map(lambda x: LabeledPoint(x[0], x[1]))

# Train and check
model = NaiveBayes.train(training)
labels_and_preds = labels.zip(model.predict(tfidf)).map(
    lambda x: {"actual": x[0], "predicted": float(x[1])})

要获得一些统计信息，您可以使用MulticlassMetrics：

from pyspark.mllib.evaluation import MulticlassMetrics
from operator import itemgetter

metrics = MulticlassMetrics(
    labels_and_preds.map(itemgetter("actual", "predicted")))

metrics.confusionMatrix().toArray()
## array([[ 2.,  0.],
##        [ 0.,  2.]])

如何使用Spark Naive Bayes分类器进行IDF文本分类？

1 个答案: