Question

出于某种原因，Spark正在编写空白文件。不确定我做错了什么。

from pyspark.sql import SparkSession, DataFrame, DataFrameWriter, functions as F
from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType, TimestampType


    if __name__ == "__main__":

    print('start')
    spark = SparkSession \
        .builder \
        .appName("testing") \
        .config("spark.ui.enabled", "true") \
        .master("yarn-client") \
        .getOrCreate()


    myschema = StructType([\
                         StructField("field1", TimestampType(), True), \
                         StructField("field2", TimestampType(), True), \
                         StructField("field3", StringType(), True),
                         StructField("field4", StringType(), True), \
                         StructField("field5", StringType(), True), \
                         StructField("field6", IntegerType(), True), \
                         ])

    df = spark.read.load("s3a://bucket/file.csv",\
                 format="csv", \
                 sep=",", \
                 # inferSchema="true", \
                 timestampFormat="MM/dd/yyyy HH:mm:ss",
                 header="true",
                 schema=myschema
                )

    print(df.count()) #output is 50

    df.write.csv(path="s3a://bucket/folder",\
                                                                header="true"
                                                                )

print语句的输出为50，这是正确的。但是S3上的输出文件只有一个没有任何数据的头文件。我应该为write功能添加另一个选项吗？我不确定为什么我没有看到任何数据被写入

Spark正在写空白文件

0 个答案: