如何在Spark 2.1.0中使用SparkSQL将“.txt”转换为“.parquet”?

时间:2017-07-05 03:43:49

标签: apache-spark apache-spark-sql spark-dataframe parquet

看,我使用“spark-shell”命令来测试它。(https://spark.apache.org/docs/latest/sql-programming-guide.html


scala> case class IP(country: String) extends Serializable
17/07/05 11:20:09 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.50.3:42868 in memory (size: 33.1 KB, free: 93.3 MB)
17/07/05 11:20:09 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.50.3:40888 in memory (size: 33.1 KB, free: 93.3 MB)
17/07/05 11:20:09 INFO ContextCleaner: Cleaned accumulator 0
17/07/05 11:20:09 INFO ContextCleaner: Cleaned accumulator 1
defined class IP

scala> import spark.implicits._
import spark.implicits._

scala> import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode

scala> val df = spark.sparkContext.textFile("/test/guchao/ip.txt").map(x => x.split("\\|", -1)).map(x => IP(x(0))).toDF()
17/07/05 11:20:36 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 216.5 KB, free 92.9 MB)
17/07/05 11:20:36 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 20.8 KB, free 92.8 MB)
17/07/05 11:20:36 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.50.3:42868 (size: 20.8 KB, free: 93.3 MB)
17/07/05 11:20:36 INFO SparkContext: Created broadcast 2 from textFile at :33
df: org.apache.spark.sql.DataFrame = [country: string]

scala> df.write.mode(SaveMode.Overwrite).save("/test/guchao/ip.parquet")
17/07/05 11:20:44 INFO ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
17/07/05 11:20:44 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
17/07/05 11:20:44 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
17/07/05 11:20:44 INFO CodeGenerator: Code generated in 88.405717 ms
17/07/05 11:20:44 INFO FileInputFormat: Total input paths to process : 1
17/07/05 11:20:44 INFO SparkContext: Starting job: save at :36
17/07/05 11:20:44 INFO DAGScheduler: Got job 1 (save at :36) with 2 output partitions
17/07/05 11:20:44 INFO DAGScheduler: Final stage: ResultStage 1 (save at :36)
17/07/05 11:20:44 INFO DAGScheduler: Parents of final stage: List()
17/07/05 11:20:44 INFO DAGScheduler: Missing parents: List()
17/07/05 11:20:44 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[12] at save at :36), which has no missing parents
17/07/05 11:20:44 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 77.3 KB, free 92.8 MB)
17/07/05 11:20:44 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 29.3 KB, free 92.7 MB)
17/07/05 11:20:44 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.50.3:42868 (size: 29.3 KB, free: 93.2 MB)
17/07/05 11:20:44 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:996
17/07/05 11:20:44 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[12] at save at :36)
17/07/05 11:20:44 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
17/07/05 11:20:44 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, 192.168.50.3, executor 0, partition 0, ANY, 6027 bytes)
17/07/05 11:20:44 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.50.3:40888 (size: 29.3 KB, free: 93.3 MB)
17/07/05 11:20:45 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.50.3:40888 (size: 20.8 KB, free: 93.2 MB)
17/07/05 11:20:45 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, 192.168.50.3, executor 0, partition 1, ANY, 6027 bytes)
17/07/05 11:20:45 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 679 ms on 192.168.50.3 (executor 0) (1/2)
17/07/05 11:20:46 INFO DAGScheduler: ResultStage 1 (save at :36) finished in 1.476 s
17/07/05 11:20:46 INFO DAGScheduler: Job 1 finished: save at :36, took 1.597097 s
17/07/05 11:20:46 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 804 ms on 192.168.50.3 (executor 0) (2/2)
17/07/05 11:20:46 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/07/05 11:20:46 INFO FileFormatWriter: Job null committed.

但结果是: [root@master ~]# hdfs dfs -ls -h /test/guchao 17/07/05 11:20:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 2 items drwxr-xr-x - root supergroup 0 2017-07-05 11:20 /test/guchao/ip.parquet -rw-r--r-- 1 root supergroup 23.9 M 2017-07-05 10:05 /test/guchao/ip.txt

为什么“ip.parquet”的大小为0?我不理解和混淆。

谢谢!

2 个答案:

答案 0 :(得分:0)

/test/guchao/ip.parquet是一个目录,进入目录,您应该找到类似于00000的内容,这将是您要查找的文件。

hadoop fs -ls /test/guchao/ip.parquet

答案 1 :(得分:0)

hdfs dfs -ls -h <path>显示文件的大小,并显示该目录的0。

df.write.mode(SaveMode.Overwrite).save("/test/guchao/ip.parquet")

这会将目录创建为/test/guchao/ip.parquet,其中包含此目录中的部分文件,这就是为什么它显示0大小

hadoop fs -ls /test/guchao/ip.parquet 

这应该显示输出文件的实际大小

如果你想获得目录的大小而不是你可以使用

hadoop fs -du -s /test/guchao/ip.parquet

希望这有帮助!