Question

我有一个带有类型和描述文本的csv文件

type ; text
  0  ; hello world
  0  ; hello text 2
  1  ; text1
  1  ; text
  2  ; world base
  2  ; Hey you
  2  ; test

事实上，我想创建一个dictionnary，并有另一个csv文件结构如下，每个类型的唯一行和描述中每个单词的频率

type ; hello ; world ; text ; 2 ; text1 ; base ; hey ; you ; test
  0  ;  2    ;  1    ;  1   ; 1 ;   0   ;   0  ;  0  ;  0  ;   0
  1  ;  0    ;  0    ;  1   ; 0 ;   1   ;   0  ;  0  ;  0  ;   0
  2  ;  0    ;  1    ;  0   ; 0 ;   0   ;   1  ;  1  ;  1  ;   1

我的csv文件中有很多行有很多字符串，这只是一个例子。

这些天我刚刚开始使用spark和scala。需要任何帮助。

由于

Answer 1

尝试：

import org.apache.spark.sql.functions._

df.withColumn("text", explode(split($"text", "\\s+")))
  .groupBy("type")
  .pivot("text")
  .count.na.fill(0)

算上Scala并创建一个词典

1 个答案: