我有一个csv文件,其中包含以下格式的数据
02/04/2018,MZE-RM00007(Kg.),29530,14.5,428185
02/04/2018,MZE-RM00007(Kg.),29160,14.5,422820
02/04/2018,MZE-RM00007(Kg.),22500,14.501,326272.5
02/04/2018,MZE-RM00007(Kg.),29490,14.5,427605
02/04/2018,MZE-RM00007(Kg.),19750,14.5,286375
02/04/2018,MZE-RM00007(Kg.),30140,14.5,437030
02/04/2018,MZE-RM00007(Kg.),24730,14.25,352402.5
02/04/2018,MZE-RM00007(Kg.),29520,14.5,428040
03/04/2018,CHOLINE CHLORIDE-MD00027(Kg.),3000,93,279000
我正在尝试像下面那样在pyspark中阅读
spark = SparkSession.builder.\
appName("Weather_Data_Extraction_To_Delhi_Only_2017").\
master("local").\
config("spark.driver.memory", "4g").\
config("spark.executor.memory", "2g").\
getOrCreate()
MySchema = StructType([
StructField("sDate", DateType(), True),
StructField("Items", StringType(), True),
StructField("purchasedQTY", DoubleType(), True),
StructField("rate", DoubleType(), True),
StructField("purchasedVolume", DoubleType(), True),
])
linesDataFrame = spark.read.format("csv").schema(MySchema).load("/home/rajnish.kumar/eclipse-workspace/ShivShakti/Data/RMPurchaseData.csv")
print linesDataFrame.printSchema()
我的打印模式是
root
|-- sDate: date (nullable = true)
|-- Items: string (nullable = true)
|-- purchasedQTY: double (nullable = true)
|-- rate: double (nullable = true)
|-- purchasedVolume: double (nullable = true)
None
现在查询时:
linesDataFrame.select("sDate","Items","purchasedQTY","rate","purchasedVolume").show()
我得到了低于结果的结果
+-----+-----+------------+----+---------------+
|sDate|Items|purchasedQTY|rate|purchasedVolume|
+-----+-----+------------+----+---------------+
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
+-----+-----+------------+----+---------------+
only showing top 20 rows
但是当我查询
linesDataFrame.select("Items","purchasedQTY","rate","purchasedVolume").show()
以下是我的结果
+--------------------+------------+------+---------------+
| Items|purchasedQTY| rate|purchasedVolume|
+--------------------+------------+------+---------------+
| MZE-RM00007(Kg.)| 29530.0| 14.5| 428185.0|
| MZE-RM00007(Kg.)| 29160.0| 14.5| 422820.0|
| MZE-RM00007(Kg.)| 22500.0|14.501| 326272.5|
| MZE-RM00007(Kg.)| 29490.0| 14.5| 427605.0|
| MZE-RM00007(Kg.)| 19750.0| 14.5| 286375.0|
| MZE-RM00007(Kg.)| 30140.0| 14.5| 437030.0|
| MZE-RM00007(Kg.)| 24730.0| 14.25| 352402.5|
| MZE-RM00007(Kg.)| 29520.0| 14.5| 428040.0|
|CHOLINE CHLORIDE-...| 3000.0| 93.0| 279000.0|
| MZE-RM00007(Kg.)| 19790.0| 14.0| 277060.0|
| MZE-RM00007(Kg.)| 28020.0| 14.5| 406290.0|
| MZE-RM00007(Kg.)| 26330.0| 14.0| 368620.0|
| MZE-RM00007(Kg.)| 26430.0| 14.0| 370020.0|
|MOP DRY-MD00183(Kg.)| 300.0| 158.0| 47400.0|
| mop-MD00094(Kg.)| 500.0| 147.0| 73500.0|
| MZE-RM00007(Kg.)| 23380.0| 14.0| 327320.0|
| MZE-RM00007(Kg.)| 31840.0| 14.0| 445760.0|
| MZE-RM00007(Kg.)| 14370.0| 14.5| 208365.0|
| MZE-RM00007(Kg.)| 20660.0| 14.5| 299570.0|
| MZE-RM00007(Kg.)| 20220.0| 13.9| 281058.0|
+--------------------+------------+------+---------------+
only showing top 20 rows
为什么用“ sDate”调用查询会给我null以及如何纠正上述问题?
答案 0 :(得分:0)
一种方法是,尝试将日期列读取为字符串类型
StructField("date_column", StringType(), True)
并使用date_format
函数将String转换为Date。
Ex:
df.select(date_format('date_column', 'MM/dd/yyy')