sqlContext.createDataFrame生成错误

时间:2017-05-12 00:30:26

标签: hadoop

我是Spark环境的新手,我正在尝试将csv文件导入Spark 2.0.2。我在Windows 10上使用pyspark。到目前为止,这是我的代码

    from pyspark.sql.types import *
    import csv
    projectFile = sc.textFile("bankfull.csv",4)
    schema = StructType([StructField("int_field",   IntegerType()),StructField("string_field", StringType())])
    header = projectFile.first()
    projectHeader = projectFile.filter(lambda l: "age" in l)
    projectNoHeader = projectFile.subtract(projectHeader)
    project_rdd = projectNoHeader.mapPartitions(lambda x: csv.reader(x, delimiter=","))
    project_df = sqlContext.createDataFrame(project_rdd,schema)

此时,我收到一条错误消息

An error occurred while calling o23.applySchemaToPythonRDD.
: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: ---------
        at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
        at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:189)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
        at java.lang.reflect.Constructor.newInstance(Unknown Source)
        at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
        at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
        at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
        at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
        at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
        at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
        at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
        at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
        at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
        at org.apache.spark.sql.hive.HiveSessionState$$anon$1.<init>(HiveSessionState.scala:63)
        at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
        at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
        at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
        at org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:666)
        at org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:656)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: ---------
        at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)   
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
        at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
        ... 32 more

我该如何解决这个问题? 谢谢

1 个答案:

答案 0 :(得分:0)

当您在Windows上运行spark时,请尝试通过创建tmp / hive文件夹(数据库)来模拟C:驱动器上的hive数据库。要做到这一点,你需要将winutil.exe放在你设置的$ HADOOP_HOME的bin文件夹中。您可以从此处下载winutil.exe link。如果您仍然遇到问题,请尝试在/ tmp / hive目录上提供完全访问权限,或者以管理员权限创建c:/ tmp / hive目录