如何从“数据”框中的列中查找字数?
我正在尝试从DF下面的注释一栏中找到字数
CustID - Comments
101 [[Nice one, Nice One,Nice]]
102 [[This was nice, Nice]
这是我试图在上述用例中实现的代码
val result = DF1.withColumn("Count of comments ", DF1("Comments")).map(events => (events,1)).reduce
在这里,我无法在元组上应用“ reduceByKey”功能,只有“ reduce”功能在列表中
这是我想要达到的预期输出
CustID - Comments - Count of comments**
101 [[Nice one, Nice One,Nice]] Nice one 2, Nice 1
102 [[This was nice, Nice] This was nice 1, Nice
有人可以帮助我并提供正确的建议以实现上述输出吗?
答案 0 :(得分:0)
请在此处找到解决方案:
修剪括号后的源数据如下所示:
+------+----------------------+
|CustID|Comments |
+------+----------------------+
|101 |Nice one,Nice One,Nice|
|102 |This was nice, Nice |
+------+----------------------+
代码如下:
def countElments(row: Row): Row =
{
val str:String = row.getAs[String]("Comments")
val list=str.split("\\,").map(_.toLowerCase()).toList
val newCol=list.groupBy(identity).mapValues(_.size).mkString(",")
Row.merge(row, Row(newCol))
}
val rdd=df.rdd.map(row =>countElments(row))
val newSchema=df.schema.add("Count of comments", StringType, true)
val final_df=spark.createDataFrame(rdd, newSchema)
final_df.show(false)
输出看起来像这样:
+------+----------------------+-----------------------------+
|CustID|Comments |Count of comments |
+------+----------------------+-----------------------------+
|101 |Nice one,Nice One,Nice|nice -> 1,nice one -> 2 |
|102 |This was nice, Nice |this was nice -> 1, nice -> 1|
+------+----------------------+-----------------------------+