使用Scala将特定格式的文本文件转换为Spark中的DataFrame

时间:2019-07-08 15:27:06

标签: scala apache-spark

我正在尝试通过Scala将对话转换为Spark。此人及其消息由制表符空格长度分隔。每个对话都在新行中。

文本文件如下:

alpha   hello,beta! how are you?
beta    I am fine alpha.How about you?
alpha   I am also doing fine...
alpha   Actually, beta, I am bit busy nowadays and sorry I hadn't call U

我需要如下数据框:

------------------------------------
|Person  |  Message
------------------------------------
|1       |  hello,beta! how are you?
|2       |  I am fine alpha.How about you?
|1       |  I am also doing fine...
|1       |  Actually, beta, I am bit busy nowadays and sorry I hadn't call 
-------------------------------------

2 个答案:

答案 0 :(得分:1)

首先,我使用提供的数据创建了一个文本文件,并将其放在temp / data.txt下的HDFS位置

data.txt:

alpha   hello,beta! how are you?
beta    I am fine alpha.How about you?
alpha   I am also doing fine...
alpha   Actually, beta, I am bit busy nowadays and sorry I hadn't call U

然后我创建了一个案例类,读取文件,并将其处理为数据框:

case class PersonMessage(Person: String, Message: String)
  val df = sc.textFile("temp/data.txt").map(x => {
  val splits = x.split("\t")
  PersonMessage(splits(0), splits(1))
}).toDF("Person", "Message")
df.show
+------+--------------------+
|Person|             Message|
+------+--------------------+
| alpha|hello,beta! how a...|
|  beta|I am fine alpha.H...|
| alpha|I am also doing f...|
| alpha|Actually, beta, I...|
+------+--------------------+

答案 1 :(得分:0)

如果您阅读文本文件并对其进行解析:

示例:

@n