Question

我正在使用标准hdfs运行amazon emr的spark作业，而不是S3来存储我的文件。我在hdfs：// user / hive / warehouse /中有一个hive表，但是在我的spark作业运行时找不到它。我配置了spark属性spark.sql.warehouse.dir以反映我的hdfs目录，并且当yarn日志确实说：

17/03/28 19:54:05 INFO SharedState: Warehouse path is 'hdfs://user/hive/warehouse/'.

稍后会在日志中说明（页面末尾的完整日志）：

LogType:stdout
Log Upload Time:Tue Mar 28 19:54:15 +0000 2017
LogLength:854
Log Contents:
Traceback (most recent call last):
  File "test.py", line 25, in <module>
    parquet_example(spark)
  File "test.py", line 9, in parquet_example
    tests = spark.read.parquet("test.parquet")
  File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 274, in parquet
  File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-xxx-xx-xx-xxx.ec2.internal:8020/user/hadoop/test.parquet;'
End of LogType:stdout

由于道路上存在不匹配，我做错了什么？

这是hive / warehouse的hdfs目录：

hdfs dfs -ls 

/user/hive/warehouse
Found 1 items
drwxrwxrwt   - hadoop hadoop          0 2017-03-28 18:50 /user/hive/warehouse/test

这是/ user / hadoop /给我的：

hdfs dfs -ls /user/hadoop/
Found 2 items
drwxr-xr-x   - hadoop hadoop          0 2017-03-28 16:53 /user/hadoop/.hiveJars
drwxr-xr-x   - hadoop hadoop          0 2017-03-28 19:54 /user/hadoop/.sparkStaging

这是我在python中的火花工作：

from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.sql import Row

def parquet_example(spark):
    tests = spark.read.parquet("test.parquet")
    tests.createOrReplaceTempView("tests")
    tests_result = spark.sql("SELECT * FROM test")
    tests_result.show()

if __name__ == "__main__":
    warehouseLocation = "hdfs://user/hive/warehouse/"
    spark = SparkSession.builder.appName("example").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()

    parquet_example(spark)
    spark.stop()

全纱日志：

Container: container_1490717578939_0012_01_000001 on ip-xxx-xx-xx-xxx.ec2.internal_8041
=========================================================================================
LogType:stderr
Log Upload Time:Tue Mar 28 19:54:15 +0000 2017
LogLength:14054
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/yarn/usercache/hadoop/filecache/131/__spark_libs__713193244228500015.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/03/28 19:54:01 INFO SignalUtils: Registered signal handler for TERM
17/03/28 19:54:01 INFO SignalUtils: Registered signal handler for HUP
17/03/28 19:54:01 INFO SignalUtils: Registered signal handler for INT
17/03/28 19:54:02 INFO ApplicationMaster: Preparing Local resources
17/03/28 19:54:03 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1490717578939_0012_000001
17/03/28 19:54:03 INFO SecurityManager: Changing view acls to: yarn,hadoop
17/03/28 19:54:03 INFO SecurityManager: Changing modify acls to: yarn,hadoop
17/03/28 19:54:03 INFO SecurityManager: Changing view acls groups to: 
17/03/28 19:54:03 INFO SecurityManager: Changing modify acls groups to: 
17/03/28 19:54:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users  with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
17/03/28 19:54:03 INFO ApplicationMaster: Starting the user application in a separate Thread
17/03/28 19:54:03 INFO ApplicationMaster: Waiting for spark context initialization...
17/03/28 19:54:03 INFO SparkContext: Running Spark version 2.1.0
17/03/28 19:54:03 INFO SecurityManager: Changing view acls to: yarn,hadoop
17/03/28 19:54:03 INFO SecurityManager: Changing modify acls to: yarn,hadoop
17/03/28 19:54:03 INFO SecurityManager: Changing view acls groups to: 
17/03/28 19:54:03 INFO SecurityManager: Changing modify acls groups to: 
17/03/28 19:54:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users  with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
17/03/28 19:54:03 INFO Utils: Successfully started service 'sparkDriver' on port 33579.
17/03/28 19:54:04 INFO SparkEnv: Registering MapOutputTracker
17/03/28 19:54:04 INFO SparkEnv: Registering BlockManagerMaster
17/03/28 19:54:04 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/03/28 19:54:04 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/03/28 19:54:04 INFO DiskBlockManager: Created local directory at /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/blockmgr-f3713d64-91da-4cb5-9b55-d4a18c607a74
17/03/28 19:54:04 INFO DiskBlockManager: Created local directory at /mnt1/yarn/usercache/hadoop/appcache/application_1490717578939_0012/blockmgr-634c7d4b-026c-4df7-abf4-7846bd7fc958
17/03/28 19:54:04 INFO DiskBlockManager: Created local directory at /mnt2/yarn/usercache/hadoop/appcache/application_1490717578939_0012/blockmgr-19f0a265-755a-42f0-9282-1e3d98a57ab1
17/03/28 19:54:04 INFO MemoryStore: MemoryStore started with capacity 414.4 MB
17/03/28 19:54:04 INFO SparkEnv: Registering OutputCommitCoordinator
17/03/28 19:54:04 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
17/03/28 19:54:04 INFO Utils: Successfully started service 'SparkUI' on port 37056.
17/03/28 19:54:04 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://xxx.xx.xx.xxx:37056
17/03/28 19:54:04 INFO YarnClusterScheduler: Created YarnClusterScheduler
17/03/28 19:54:04 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1490717578939_0012 and attemptId Some(appattempt_1490717578939_0012_000001)
17/03/28 19:54:04 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
17/03/28 19:54:04 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34414.
17/03/28 19:54:04 INFO NettyBlockTransferService: Server created on xxx.xx.xx.xxx:34414
17/03/28 19:54:04 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/03/28 19:54:04 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, xxx.xx.xx.xxx, 34414, None)
17/03/28 19:54:04 INFO BlockManagerMasterEndpoint: Registering block manager xxx.xx.xx.xxx:34414 with 414.4 MB RAM, BlockManagerId(driver, xxx.xx.xx.xxx, 34414, None)
17/03/28 19:54:04 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, xxx.xx.xx.xxx, 34414, None)
17/03/28 19:54:04 INFO BlockManager: external shuffle service port = 7337
17/03/28 19:54:04 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, xxx.xx.xx.xxx, 34414, None)
17/03/28 19:54:05 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1490717578939_0012_1
17/03/28 19:54:05 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
17/03/28 19:54:05 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
17/03/28 19:54:05 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
17/03/28 19:54:05 INFO YarnClusterScheduler: YarnClusterScheduler.postStartHook done
17/03/28 19:54:05 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark://YarnAM@xxx.xx.xx.xxx:33579)
17/03/28 19:54:05 INFO ApplicationMaster: 
===============================================================================
YARN executor launch context:
  env:
    CLASSPATH -> /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*<CPS>{{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/*<CPS>/usr/lib/hadoop-lzo/lib/*<CPS>/usr/share/aws/emr/emrfs/conf<CPS>/usr/share/aws/emr/emrfs/lib/*<CPS>/usr/share/aws/emr/emrfs/auxlib/*<CPS>/usr/share/aws/emr/lib/*<CPS>/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar<CPS>/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar<CPS>/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar<CPS>/usr/lib/spark/yarn/lib/datanucleus-api-jdo.jar<CPS>/usr/lib/spark/yarn/lib/datanucleus-core.jar<CPS>/usr/lib/spark/yarn/lib/datanucleus-rdbms.jar<CPS>/usr/share/aws/emr/cloudwatch-sink/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*<CPS>/usr/lib/hadoop-lzo/lib/*<CPS>/usr/share/aws/emr/emrfs/conf<CPS>/usr/share/aws/emr/emrfs/lib/*<CPS>/usr/share/aws/emr/emrfs/auxlib/*<CPS>/usr/share/aws/emr/lib/*<CPS>/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar<CPS>/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar<CPS>/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar<CPS>/usr/share/aws/emr/cloudwatch-sink/lib/*
    SPARK_YARN_STAGING_DIR -> hdfs://ip-xxx-xx-xx-xxx.ec2.internal:8020/user/hadoop/.sparkStaging/application_1490717578939_0012
    SPARK_USER -> hadoop
    SPARK_YARN_MODE -> true
    PYTHONPATH -> {{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.4-src.zip

  command:
    LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:$LD_LIBRARY_PATH" \ 
      {{JAVA_HOME}}/bin/java \ 
      -server \ 
      -Xmx5120m \ 
      '-verbose:gc' \ 
      '-XX:+PrintGCDetails' \ 
      '-XX:+PrintGCDateStamps' \ 
      '-XX:+UseConcMarkSweepGC' \ 
      '-XX:CMSInitiatingOccupancyFraction=70' \ 
      '-XX:MaxHeapFreeRatio=70' \ 
      '-XX:+CMSClassUnloadingEnabled' \ 
      '-XX:OnOutOfMemoryError=kill -9 %p' \ 
      -Djava.io.tmpdir={{PWD}}/tmp \ 
      '-Dspark.history.ui.port=18080' \ 
      -Dspark.yarn.app.container.log.dir=<LOG_DIR> \ 
      org.apache.spark.executor.CoarseGrainedExecutorBackend \ 
      --driver-url \ 
      spark://CoarseGrainedScheduler@xxx.xx.xx.xxx:33579 \ 
      --executor-id \ 
      <executorId> \ 
      --hostname \ 
      <hostname> \ 
      --cores \ 
      2 \ 
      --app-id \ 
      application_1490717578939_0012 \ 
      --user-class-path \ 
      file:$PWD/__app__.jar \ 
      1><LOG_DIR>/stdout \ 
      2><LOG_DIR>/stderr

  resources:
    py4j-0.10.4-src.zip -> resource { scheme: "hdfs" host: "ip-xxx-xx-xx-xxx.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1490717578939_0012/py4j-0.10.4-src.zip" } size: 74096 timestamp: 1490730839170 type: FILE visibility: PRIVATE
    __spark_conf__ -> resource { scheme: "hdfs" host: "ip-xxx-xx-xx-xxx.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1490717578939_0012/__spark_conf__.zip" } size: 75741 timestamp: 1490730839402 type: ARCHIVE visibility: PRIVATE
    pyspark.zip -> resource { scheme: "hdfs" host: "ip-xxx-xx-xx-xxx.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1490717578939_0012/pyspark.zip" } size: 452353 timestamp: 1490730838849 type: FILE visibility: PRIVATE
    __spark_libs__ -> resource { scheme: "hdfs" host: "ip-xxx-xx-xx-xxx.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1490717578939_0012/__spark_libs__713193244228500015.zip" } size: 196686961 timestamp: 1490730836856 type: ARCHIVE visibility: PRIVATE
    hive-site.xml -> resource { scheme: "hdfs" host: "ip-xxx-xx-xx-xxx.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1490717578939_0012/hive-site.xml" } size: 2375 timestamp: 1490730837023 type: FILE visibility: PRIVATE

===============================================================================
17/03/28 19:54:05 INFO RMProxy: Connecting to ResourceManager at ip-xxx-xx-xx-xxx.ec2.internal/xxx-xx-xx-xxx:8030
17/03/28 19:54:05 INFO YarnRMClient: Registering the ApplicationMaster
17/03/28 19:54:05 INFO SharedState: Warehouse path is 'hdfs://user/hive/warehouse/'.
17/03/28 19:54:05 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
17/03/28 19:54:05 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
17/03/28 19:54:05 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
17/03/28 19:54:06 INFO metastore: Trying to connect to metastore with URI thrift://ip-xxx-xx-xx-xxx.ec2.internal:9083
17/03/28 19:54:06 INFO metastore: Connected to metastore.
17/03/28 19:54:06 INFO SessionState: Created local directory: /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/tmp/yarn
17/03/28 19:54:06 INFO SessionState: Created local directory: /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/tmp/5f653144-e990-45b0-ba73-cdb4d10e9f7a_resources
17/03/28 19:54:06 INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/5f653144-e990-45b0-ba73-cdb4d10e9f7a
17/03/28 19:54:06 INFO SessionState: Created local directory: /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/tmp/yarn/5f653144-e990-45b0-ba73-cdb4d10e9f7a
17/03/28 19:54:06 INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/5f653144-e990-45b0-ba73-cdb4d10e9f7a/_tmp_space.db
17/03/28 19:54:06 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs://user/hive/warehouse/
17/03/28 19:54:06 ERROR ApplicationMaster: User application exited with status 1
17/03/28 19:54:06 INFO ApplicationMaster: Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
17/03/28 19:54:06 INFO SparkContext: Invoking stop() from shutdown hook
17/03/28 19:54:06 INFO SparkUI: Stopped Spark web UI at http://xxx.xx.xx.xxx:37056
17/03/28 19:54:06 INFO YarnClusterSchedulerBackend: Shutting down all executors
17/03/28 19:54:06 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
17/03/28 19:54:06 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
17/03/28 19:54:06 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/03/28 19:54:06 INFO MemoryStore: MemoryStore cleared
17/03/28 19:54:06 INFO BlockManager: BlockManager stopped
17/03/28 19:54:06 INFO BlockManagerMaster: BlockManagerMaster stopped
17/03/28 19:54:06 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/03/28 19:54:06 INFO SparkContext: Successfully stopped SparkContext
17/03/28 19:54:06 INFO ShutdownHookManager: Shutdown hook called
17/03/28 19:54:06 INFO ShutdownHookManager: Deleting directory /mnt1/yarn/usercache/hadoop/appcache/application_1490717578939_0012/spark-3a6db594-2b44-47fe-8e48-4220b93e789a
17/03/28 19:54:06 INFO ShutdownHookManager: Deleting directory /mnt2/yarn/usercache/hadoop/appcache/application_1490717578939_0012/spark-a54516f0-48be-4fdb-899b-bbee998468b1
17/03/28 19:54:06 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/spark-552e3cae-c119-47a5-9c63-34d4df59d072
17/03/28 19:54:06 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/spark-552e3cae-c119-47a5-9c63-34d4df59d072/pyspark-a0240093-16c6-43e4-8f2c-dcef309afe97
End of LogType:stderr

LogType:stdout
Log Upload Time:Tue Mar 28 19:54:15 +0000 2017
LogLength:854
Log Contents:
Traceback (most recent call last):
  File "test.py", line 25, in <module>
    parquet_example(spark)
  File "test.py", line 9, in parquet_example
    tests = spark.read.parquet("test.parquet")
  File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 274, in parquet
  File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-xxx-xx-xx-xxx.ec2.internal:8020/user/hadoop/test.parquet;'
End of LogType:stdout

Answer 1

问题中的函数parquet_example将从拼花文件test.parquet创建一个DataFrame，并通过创建临时视图从中查询。

来自评论：
由于名为test的Hive表已存在，因此直接使用创建的SparkSession

查询表

warehouseLocation = "hdfs://user/hive/warehouse/"
spark = SparkSession \
    .builder \
    .appName("example") \
    .config("spark.sql.warehouse.dir", warehouseLocation) \
    .enableHiveSupport() \
    .getOrCreate()
spark.sql("SELECT * FROM test").show()

pyspark.sql.utils.AnalysisException：u'Path不存在

1 个答案: