Question

据我所知，MLlib只支持整数然后我想在scala中将字符串转换为interger。例如，我在txtfile中有很多reviewerID，productID。

reviewerID    productID
03905X0912    ZXASQWZXAS
0325935ODD    PDLFMBKGMS
...

Answer 1

user-edit.php是解决方案。它将通过估算器和变压器适应ML管道。基本上，一旦设置了输入列，它就会计算每个类别的频率，并从0开始对它们进行编号。如果需要，可以在管道末尾添加StringIndexer以替换原始字符串。

您可以查看ML文档以了解＆＃34;估算，转换和选择功能＆＃34;有关详细信息。

在你的情况下它会像：

IndexToString

Answer 2

您可以为每个reviewerID，productID添加一个唯一ID的新行。您可以通过以下方式添加新行。

按monotonicallyIncreasingId：

import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
  ("123xyx", "ab"),
  ("123xyz", "cd")
)).toDF("reviewerID", "productID")
data.withColumn("uniqueReviID", monotonicallyIncreasingId).show()

使用zipWithUniqueId：

val rows = data.rdd.zipWithUniqueId.map {
  case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)
}

val finalDf = spark.createDataFrame(rows, StructType(StructField("uniqueRevID", LongType, false) +: data.schema.fields))

finalDf.show()

您也可以在SQL语法中使用row_number()来执行此操作：

import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
  ("123xyx", "ab"),
  ("123xyz", "cd")
)).toDF("reviewerID", "productID").createOrReplaceTempView("review")
val tmpTable1 = spark.sqlContext.sql(
  "select row_number() over (order by reviewerID) as id, reviewerID, productID from review")

希望这有帮助！

在spark MLlib中，如何在spark scala中将字符串转换为整数？

2 个答案: