我尝试在Spark 2.3.0中为LDA分析创建令牌计数向量。我已经按照一些教程,每次使用CountVectorizer轻松地将String of String转换为Vector。
我在Databricks笔记本上运行了这个简短的例子:
import org.apache.spark.ml.feature.CountVectorizer
val testW = Seq(
(8, Array("Zara", "Nuha", "Ayan", "markle")),
(9, Array("fdas", "test", "Ayan", "markle")),
(10, Array("qwertzu", "test", "Ayan", "fdaf"))
).toDF("id", "filtered")
// Set params for CountVectorizer
val vectorizer = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.setVocabSize(5)
.setMinDF(2)
.fit(testW)
// Create vector of token counts
val articlesCountVector = vectorizer.transform(testW).select("id", "features")
display(articlesCountVector)
,输出如下: output
但在我读过的所有教程中,“功能”的类型是 vector 。 为什么我的情况是 udt ?
我忘记了什么吗?为什么它不是矢量?
可以转换它吗?因为我无法使用此udt类型创建LDA模型。
答案 0 :(得分:1)
这里没有问题。您看到的是Databricks显示功能的实现细节。
在内部,o.a.s.mllib.linalg.Vector
和Dataset
都未在UDT
API中本地表示,并使用UserDefinedTypes
s( "dependencies": {
"axios": "^0.18.0",
"electron-log": "^2.2.14",
"electron-logger": "^0.0.3",
"electron-packager": "^12.1.0",
"moment": "^2.22.1",
"nfc-pcsc": "^0.6.2",
"vue": "^2.0.1",
"vue-config": "^1.0.0",
"vue-momentjs": "^0.1.2",
"vue-router": "^3.0.1",
"vuetify": "^1.0.17"
},
"devDependencies": {
"@babel/core": "^7.0.0-beta.46",
"@babel/plugin-proposal-class-properties": "^7.0.0-beta.46",
"@babel/polyfill": "^7.0.0-beta.46",
"@babel/preset-env": "^7.0.0-beta.46",
"@babel/register": "^7.0.0-beta.46",
"babel-loader": "8.0.0-beta.2",
"cross-env": "^5.1.4",
"css-loader": "^0.28.11",
"devtron": "^1.4.0",
"electron": "^2.0.0",
"electron-debug": "^1.5.0",
"electron-rebuild": "^1.7.3",
"file-loader": "^1.1.11",
"html-loader": "^0.5.5",
"html-webpack-plugin": "^3.2.0",
"rimraf": "^2.6.2",
"vue-devtools": "^3.1.9",
"vue-loader": "^15.0.9",
"vue-template-compiler": "^2.5.16",
"webpack": "^4.8.0",
"webpack-cli": "^2.1.3",
"webpack-dev-server": "^3.1.4",
"webpack-merge": "^4.1.2"
}
)。因此输出。