我正在研究虚拟机上的火花。我使用./bin/spark-shell打开spark并使用scala。现在我对使用scala的键值格式感到困惑。
我在home / feng / spark / data中有一个txt文件,如下所示:
panda 0
pink 3
pirate 3
panda 1
pink 4
我使用sc.textFile来获取此txt文件。如果我做
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7")
然后我可以使用rdd.collect()在屏幕上显示rdd:
scala> rdd.collect()
res26: Array[String] = Array(panda 0, pink 3, pirate 3, panda 1, pink 4)
然而,如果我这样做
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7.txt")
其中没有" .txt"这里。然后,当我使用rdd.collect()时,我遇到了一个错误:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/feng/spark/A.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
......
但我看到了其他例子。所有人都有" .txt"在末尾。我的代码或系统有问题吗?
另一件事是我试图做的事情:
scala> val rddd = rdd.map(x => (x.split(" ")(0),x))
rddd: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[2] at map at <console>:29
scala> rddd.collect()
res0: Array[(String, String)] = Array((panda,panda 0), (pink,pink 3), (pirate,pirate 3), (panda,panda 1), (pink,pink 4))
我打算选择数据的第一列并将其用作键。但是rddd.collect()看起来不是这样,因为单词出现两次,这是不对的。我不能继续做mapbykey,reducebykey或其他人的休息操作。我哪里做错了?
非常感谢任何帮助。
答案 0 :(得分:1)
Just for example I create a String
with your dataset, after this I split the record by line, and use SparkContext
's parallelize
method to create an RDD
. Notice that after I create the RDD
I use its map
method to split the String
stored in each record and convert it to a Row
.
import org.apache.spark.sql.Row
val text = "panda 0\npink 3\npirate 3\npanda 1\npink 4"
val rdd = sc.parallelize(text.split("\n")).map(x => Row(x.split(" "):_*))
rdd.take(3)
The output from the take
method is:
res4: Array[org.apache.spark.sql.Row] = Array([panda,0], [pink,3], [pirate,3])
About your first question, there is no need for files to have any extension. Because, in this case files are seen as plain text.