我有一个名为df
的数据框:
age height weight
20 178 83
36 182 74
...
Nan 168 Nan
我想创建一个新的数据框,它为DF数据框的每一列恢复Nan值的数量。
我希望得到df1数据帧结果:
age 3
height 0
width 29
为此,我做了:
df1=spark.createDataFrame(df.columns, "string").toDF("colonnes")
for i in df1.rdd.collect():
df1['number_missing_values'] = df[df1[i]].isnull().count()
但是我得到这个错误:
u"cannot resolve '`opp_intgid_`' given input columns: [colonnes];;\n'Project ['opp_intgid_]\n+- AnalysisBarrier\n +- Project [value#2335577 AS colonnes#2335579]\n +- LogicalRDD [value#2335577], false\n"
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1166, in __getitem__
return self.select(*item)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1202, in select
jdf = self._jdf.select(self._jcols(*cols))
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u"cannot resolve '`opp_intgid_`' given input columns: [colonnes];;\n'Project ['opp_intgid_]\n+- AnalysisBarrier\n +- Project [value#2335577 AS colonnes#2335579]\n +- LogicalRDD [value#2335577], false\n"
有什么想法吗?
谢谢