在pyspark中读取数据帧时获取空日期?

时间:2018-08-28 02:38:14

标签: python apache-spark pyspark

我有一个csv文件,其中包含以下格式的数据

02/04/2018,MZE-RM00007(Kg.),29530,14.5,428185
02/04/2018,MZE-RM00007(Kg.),29160,14.5,422820
02/04/2018,MZE-RM00007(Kg.),22500,14.501,326272.5
02/04/2018,MZE-RM00007(Kg.),29490,14.5,427605
02/04/2018,MZE-RM00007(Kg.),19750,14.5,286375
02/04/2018,MZE-RM00007(Kg.),30140,14.5,437030
02/04/2018,MZE-RM00007(Kg.),24730,14.25,352402.5
02/04/2018,MZE-RM00007(Kg.),29520,14.5,428040
03/04/2018,CHOLINE CHLORIDE-MD00027(Kg.),3000,93,279000

我正在尝试像下面那样在pyspark中阅读

spark =  SparkSession.builder.\
                appName("Weather_Data_Extraction_To_Delhi_Only_2017").\
                master("local").\
                config("spark.driver.memory", "4g").\
                config("spark.executor.memory", "2g").\
                getOrCreate()

MySchema = StructType([
    StructField("sDate", DateType(), True),        
    StructField("Items", StringType(), True),
    StructField("purchasedQTY", DoubleType(), True),
    StructField("rate", DoubleType(), True),
    StructField("purchasedVolume", DoubleType(), True),
])



linesDataFrame = spark.read.format("csv").schema(MySchema).load("/home/rajnish.kumar/eclipse-workspace/ShivShakti/Data/RMPurchaseData.csv")

print linesDataFrame.printSchema()

我的打印模式是

root
 |-- sDate: date (nullable = true)
 |-- Items: string (nullable = true)
 |-- purchasedQTY: double (nullable = true)
 |-- rate: double (nullable = true)
 |-- purchasedVolume: double (nullable = true)

None

现在查询时:

linesDataFrame.select("sDate","Items","purchasedQTY","rate","purchasedVolume").show()

我得到了低于结果的结果

+-----+-----+------------+----+---------------+
|sDate|Items|purchasedQTY|rate|purchasedVolume|
+-----+-----+------------+----+---------------+
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
| null| null|        null|null|           null|
+-----+-----+------------+----+---------------+
only showing top 20 rows

但是当我查询

linesDataFrame.select("Items","purchasedQTY","rate","purchasedVolume").show()

以下是我的结果

+--------------------+------------+------+---------------+
|               Items|purchasedQTY|  rate|purchasedVolume|
+--------------------+------------+------+---------------+
|    MZE-RM00007(Kg.)|     29530.0|  14.5|       428185.0|
|    MZE-RM00007(Kg.)|     29160.0|  14.5|       422820.0|
|    MZE-RM00007(Kg.)|     22500.0|14.501|       326272.5|
|    MZE-RM00007(Kg.)|     29490.0|  14.5|       427605.0|
|    MZE-RM00007(Kg.)|     19750.0|  14.5|       286375.0|
|    MZE-RM00007(Kg.)|     30140.0|  14.5|       437030.0|
|    MZE-RM00007(Kg.)|     24730.0| 14.25|       352402.5|
|    MZE-RM00007(Kg.)|     29520.0|  14.5|       428040.0|
|CHOLINE CHLORIDE-...|      3000.0|  93.0|       279000.0|
|    MZE-RM00007(Kg.)|     19790.0|  14.0|       277060.0|
|    MZE-RM00007(Kg.)|     28020.0|  14.5|       406290.0|
|    MZE-RM00007(Kg.)|     26330.0|  14.0|       368620.0|
|    MZE-RM00007(Kg.)|     26430.0|  14.0|       370020.0|
|MOP DRY-MD00183(Kg.)|       300.0| 158.0|        47400.0|
|    mop-MD00094(Kg.)|       500.0| 147.0|        73500.0|
|    MZE-RM00007(Kg.)|     23380.0|  14.0|       327320.0|
|    MZE-RM00007(Kg.)|     31840.0|  14.0|       445760.0|
|    MZE-RM00007(Kg.)|     14370.0|  14.5|       208365.0|
|    MZE-RM00007(Kg.)|     20660.0|  14.5|       299570.0|
|    MZE-RM00007(Kg.)|     20220.0|  13.9|       281058.0|
+--------------------+------------+------+---------------+
only showing top 20 rows

为什么用“ sDate”调用查询会给我null以及如何纠正上述问题?

1 个答案:

答案 0 :(得分:0)

一种方法是,尝试将日期列读取为字符串类型

StructField("date_column", StringType(), True)

并使用date_format函数将String转换为Date。

Ex:
df.select(date_format('date_column', 'MM/dd/yyy')