Spark SQL - DataFrame - 选择 - 转换还是动作?

时间:2017-10-05 09:41:46

标签: java apache-spark

在Spark SQL(使用Java API)中,我有一个DataFrame

DataFrameselect方法。 我想知道这是转型还是行动?

我只需要一个确认和一个很好的参考,明确说明。

3 个答案:

答案 0 :(得分:4)

这是转型。请参阅:https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html

  

数据集是特定于域的对象的强类型集合   可以使用函数或关系并行转换   操作。每个数据集还有一个称为DataFrame的无类型视图,   这是Row的数据集。

     

数据集上可用的操作分为转换和   动作。转换是产生新数据集的转换   动作是触发计算和返回结果的动作。   示例转换包括map,filter,select和aggregate   (通过...分组)。示例操作将数据计数,显示或写入文件   系统

答案 1 :(得分:0)

  

select是一个转换函数

Refer spark documentation

有关更多信息和说明,read this

答案 2 :(得分:-1)

如果执行以下代码,您将能够在控制台中看到输出

import org.apache.spark.sql.SparkSession

object learnSpark2 extends App {
    val sparksession = SparkSession.builder()
        .appName("Learn Spark")
        .config("spark.master", "local")
        .getOrCreate()

    val range = sparksession.range(1, 500).toDF("numbers")
    range.select(range.col("numbers"), range.col("numbers") + 10).show(2)
}

+ ------- + -------------- +

|数字|(数字+ 10)|

+ ------- + -------------- +

| 1 | 11 |

| 2 | 12 |

如果仅执行select而未显示,则执行以下代码,尽管代码已执行,但您将看不到任何输出,这意味着select只是一个转换,不是动作。因此它将不会被评估。

object learnSpark2 extends App {
    val sparksession = SparkSession.builder()
        .appName("Learn Spark")
        .config("spark.master","local")
        .getOrCreate()

    val range = sparksession.range(1, 500).toDF("numbers")
    range.select(range.col("numbers"), range.col("numbers") + 10)
}

在控制台中:

19/01/03 22:46:25 INFO Utils: Successfully started service 'sparkDriver' on port 55531.

19/01/03 22:46:25 INFO SparkEnv: Registering MapOutputTracker

19/01/03 22:46:25 INFO SparkEnv: Registering BlockManagerMaster

19/01/03 22:46:25 INFO BlockManagerMasterEndpoint: Using

org.apache.spark.storage.DefaultTopologyMapper for getting topology information

19/01/03 22:46:25 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up

19/01/03 22:46:25 INFO DiskBlockManager: Created local directory at

C:\Users\swilliam\AppData\Local\Temp\blockmgr-9abc8a2c-15ee-4e4f-be04-9ef37ace1b7c

19/01/03 22:46:25 INFO MemoryStore: MemoryStore started with capacity 1992.9 MB

19/01/03 22:46:25 INFO SparkEnv: Registering OutputCommitCoordinator

19/01/03 22:46:25 INFO Utils: Successfully started service 'SparkUI' on port 4040.

19/01/03 22:46:26 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at

http://10.192.99.214:4040

19/01/03 22:46:26 INFO Executor: Starting executor ID driver on host localhost

19/01/03 22:46:26 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 55540.
19/01/03 22:46:26 INFO NettyBlockTransferService: Server created on 10.192.99.214:55540

19/01/03 22:46:26 INFO BlockManager: Using

org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy

19/01/03 22:46:26 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.192.99.214, 55540, None)

19/01/03 22:46:26 INFO BlockManagerMasterEndpoint: Registering block manager 10.192.99.214:55540 with 1992.9 MB RAM, BlockManagerId(driver, 10.192.99.214, 55540, None)

19/01/03 22:46:26 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.192.99.214, 55540, None)

19/01/03 22:46:26 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.192.99.214, 55540, None)
19/01/03 22:46:26 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/C:/UDEMY/SparkJob/spark-warehouse/').
19/01/03 22:46:26 INFO SharedState: Warehouse path is 'file:/C:/UDEMY/SparkJob/spark-warehouse/'.
19/01/03 22:46:27 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
19/01/03 22:46:29 INFO SparkContext: Invoking stop() from shutdown hook
19/01/03 22:46:29 INFO SparkUI: Stopped Spark web UI at http://10.192.99.214:4040
19/01/03 22:46:29 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/01/03 22:46:29 INFO MemoryStore: MemoryStore cleared
19/01/03 22:46:29 INFO BlockManager: BlockManager stopped
19/01/03 22:46:29 INFO BlockManagerMaster: BlockManagerMaster stopped
19/01/03 22:46:29 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/01/03 22:46:29 INFO SparkContext: Successfully stopped SparkContext
19/01/03 22:46:29 INFO ShutdownHookManager: Shutdown hook called
19/01/03 22:46:29 INFO ShutdownHookManager: Deleting directory C:\Users\swilliam\AppData\Local\Temp\spark-c69bfb9b-f351-45af-9947-77950b23dd15
Picked up JAVA_TOOL_OPTIONS: -Djavax.net.ssl.trustStore="C:\Program Files\SquirrelSQL\certificates\jssecacerts"