我是pySpark的新手。我试图创建一个数据框,但出现错误。
adjustedDf = crime_mongodb_df.withColumn("Reported Date", to_date(col("Reported Date"), "d/MM/yyyy")).withColumn('year', year("Reported Date"))
yearGroup = adjustedDf.groupBy("year").sum("Offence Count")
yearGroup.printSchema()
yearGroup.show()
可以打印模式:
root
|-- year: integer (nullable = true)
|-- sum(Offence Count): long (nullable = true)
尝试显示或访问yearGroup时出现错误:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-113-c8b6150ea8cc> in <module>
4 yearGroup = adjustedDf.groupBy("year").sum("Offence Count")
5 yearGroup.printSchema()
----> 6 yearGroup.show()
7
8 years = sum(yearGroup.select("year").toPandas().values.tolist(),[])
~/FIT5202/jupyter/lib/python3.6/site-packages/pyspark/sql/dataframe.py in show(self, n, truncate, vertical)
378 """
379 if isinstance(truncate, bool) and truncate:
--> 380 print(self._jdf.showString(n, 20, vertical))
381 else:
382 print(self._jdf.showString(n, int(truncate), vertical))
~/FIT5202/jupyter/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
这很奇怪。我的数据有60,000行。如果我尝试使用前800行,那么它将起作用。
我可以寻求帮助吗?
谢谢
答案 0 :(得分:-1)
找到了解决方案。 应该删除空值。 yearGroup = yearGroup.filter(adjustedDf.year.isNotNull())