Spark CountVectorizer返回udt而不是vector

时间:2018-05-28 13:43:37

标签: apache-spark apache-spark-sql apache-spark-mllib

我尝试在Spark 2.3.0中为LDA分析创建令牌计数向量。我已经按照一些教程,每次使用CountVectorizer轻松地将String of String转换为Vector。

我在Databricks笔记本上运行了这个简短的例子:

import org.apache.spark.ml.feature.CountVectorizer

val testW = Seq(
  (8, Array("Zara", "Nuha", "Ayan", "markle")),
  (9, Array("fdas", "test", "Ayan", "markle")),
  (10, Array("qwertzu", "test", "Ayan", "fdaf"))
  ).toDF("id", "filtered")

// Set params for CountVectorizer
val vectorizer = new CountVectorizer()
  .setInputCol("filtered")
  .setOutputCol("features")
  .setVocabSize(5) 
  .setMinDF(2) 
  .fit(testW)

// Create vector of token counts
val articlesCountVector = vectorizer.transform(testW).select("id", "features")
display(articlesCountVector)

,输出如下: output

但在我读过的所有教程中,“功能”的类型是 vector 。 为什么我的情况是 udt

我忘记了什么吗?为什么它不是矢量?

可以转换它吗?因为我无法使用此udt类型创建LDA模型。

1 个答案:

答案 0 :(得分:1)

这里没有问题。您看到的是Databricks显示功能的实现细节。

在内部,o.a.s.mllib.linalg.VectorDataset都未在UDT API中本地表示,并使用UserDefinedTypes s( "dependencies": { "axios": "^0.18.0", "electron-log": "^2.2.14", "electron-logger": "^0.0.3", "electron-packager": "^12.1.0", "moment": "^2.22.1", "nfc-pcsc": "^0.6.2", "vue": "^2.0.1", "vue-config": "^1.0.0", "vue-momentjs": "^0.1.2", "vue-router": "^3.0.1", "vuetify": "^1.0.17" }, "devDependencies": { "@babel/core": "^7.0.0-beta.46", "@babel/plugin-proposal-class-properties": "^7.0.0-beta.46", "@babel/polyfill": "^7.0.0-beta.46", "@babel/preset-env": "^7.0.0-beta.46", "@babel/register": "^7.0.0-beta.46", "babel-loader": "8.0.0-beta.2", "cross-env": "^5.1.4", "css-loader": "^0.28.11", "devtron": "^1.4.0", "electron": "^2.0.0", "electron-debug": "^1.5.0", "electron-rebuild": "^1.7.3", "file-loader": "^1.1.11", "html-loader": "^0.5.5", "html-webpack-plugin": "^3.2.0", "rimraf": "^2.6.2", "vue-devtools": "^3.1.9", "vue-loader": "^15.0.9", "vue-template-compiler": "^2.5.16", "webpack": "^4.8.0", "webpack-cli": "^2.1.3", "webpack-dev-server": "^3.1.4", "webpack-merge": "^4.1.2" } )。因此输出。

您可以在Understanding Output of VectorAssembler --- Spark

中找到所有字段的确切含义