Pyspark:Windows中的SaveTable无法处理Windows路径

时间:2020-07-07 17:53:30

标签: pyspark

我正在尝试使用Windows路径(用“”代替“ /”)保存CSV文件。由于Windows路径,我认为它不起作用。

  1. 这是为什么代码不起作用的问题吗?
  2. 是否有解决此问题的方法?

代码:

from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import Row

def init_spark(appname):
  spark = SparkSession.builder.appName(appname).getOrCreate()
  sc = spark.sparkContext
  return spark,sc

def run_on_configs_spark():
  spark,sc = init_spark(appname="bucket_analysis")
  p_configs_RDD = sc.parallelize([1,4,5])
  p_configs_RDD=p_configs_RDD.map(mul)
  schema = StructType([StructField('a', IntegerType()), StructField('b', IntegerType())])
  df=spark.createDataFrame(p_configs_RDD,schema)
  df.write.saveAsTable(r"C:\Users\yuvalr\Desktop\example_csv",format="csv")


def mul(x):
  return (x,x**2)

run_on_configs_spark()

错误代码:

Traceback (most recent call last):
  File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 426, in <module>
    analysis()
  File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 408, in analysis
    run_CDH()
  File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 420, in run_CDH
    max_prob_for_extension=None, max_base_size_B=4096,OP_arr=[0.2],
  File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 173, in settings_print
    dic=get_map_of_worst_seq(params)
  File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 245, in get_map_of_worst_seq
    run_over_settings_spark_test(info_obj)
  File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 239, in run_over_settings_spark_test
    run_on_configs_spark(configs)
  File "C:\Users\yuvalr\Desktop\Git_folder\algo_sim\Bucket_analysis\Set_multiple_configurations\spark_parallelized_configs.py", line 17, in run_on_configs_spark
    df.write.saveAsTable(r"C:\Users\yuvalr\Desktop\example_csv",format="csv")
  File "C:\Users\yuvalr\Desktop\spark\Spark\python\pyspark\sql\readwriter.py", line 868, in saveAsTable
    self._jwrite.saveAsTable(name)
  File "C:\Users\yuvalr\venv\lib\site-packages\py4j\java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "C:\Users\yuvalr\Desktop\spark\Spark\python\pyspark\sql\utils.py", line 137, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.ParseException: 
mismatched input ':' expecting {<EOF>, '.', '-'}(line 1, pos 1)

== SQL ==
C:\Users\yuvalr\Desktop\example_csv
-^^^

2 个答案:

答案 0 :(得分:1)

正如我所见,问题出在您的输出行:

尝试以下方法:

df.write.csv("file:///C:/Users/yuvalr/Desktop/example_csv.csv")
  • 是的,我知道您在Windows上,所以您在中期望反斜杠,但PySpark并非如此
  • Windows对文件扩展名非常敏感-如果没有.csv,您可能只会创建一个名为example_csv的文件夹
  • 您不需要正则表达式r""字符串
  • 双重使用file:///确认这是我们正在讨论的文件

答案 1 :(得分:0)

如您所见,saveAsTable()期望提供一个tablename并可以写成 目录spark.sql.warehouse.dir

saveAsTable(name,format = None,mode = None,partitionBy = None,** options)

参数

  • 名称-表名称
  • format –用于保存的格式
  • 模式–追加,覆盖,错误,错误描述,忽略(默认:错误)之一
  • partitionBy –分区列的名称
  • 选项–所有其他字符串选项

来源:https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

解决方法:(注意Windows C:\\

设置spark.sql.warehouse.dir指向目标目录如下

from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import Row

def init_spark(appname):
  spark = SparkSession.builder\
    .config("spark.sql.warehouse.dir", "C:\\Users\yuvalr\Desktop")\
    .appName(appname).getOrCreate()
  sc = spark.sparkContext
  return spark,sc

def run_on_configs_spark():
  spark,sc = init_spark(appname="bucket_analysis")
  p_configs_RDD = sc.parallelize([1,4,5])
  p_configs_RDD=p_configs_RDD.map(mul)
  schema = StructType([StructField('a', IntegerType()), StructField('b', IntegerType())])
  df=spark.createDataFrame(p_configs_RDD,schema)
  df.write.saveAsTable("example_csv",format="csv",mode="overwrite")


def mul(x):
  return (x,x**2)

run_on_configs_spark()

编辑1: 如果它是外部表(存储基础文件的外部路径),则可以在下面使用

#df.write.option("path","C:\\Users\yuvalr\Desktop").saveAsTable("example_csv",format="csv",mode="overwrite")


from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import Row


def init_spark(appname):
  spark = SparkSession.builder\
    .appName(appname).getOrCreate()
  sc = spark.sparkContext
  return spark,sc

def run_on_configs_spark():
  spark,sc = init_spark(appname="bucket_analysis")
  p_configs_RDD = sc.parallelize([1,4,5])
  p_configs_RDD=p_configs_RDD.map(mul)
  schema = StructType([StructField('a', IntegerType()), StructField('b', IntegerType())])
  df=spark.createDataFrame(p_configs_RDD,schema)
  df.write.option("path","C:\\Users\yuvalr\Desktop").saveAsTable("example_csv",format="csv",mode="overwrite")


def mul(x):
  return (x,x**2)

run_on_configs_spark()