我在zeppelin中运行以下笔记本:
%spark.pyspark
l = [('user1', 33, 1.0, 'chess'), ('user2', 34, 2.0, 'tenis'), ('user3', None, None, ''), ('user4', None, 4.0, ' '), ('user5', None, 5.0, 'ski')]
df = spark.createDataFrame(l, ['name', 'age', 'ratio', 'hobby'])
df.show()
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- ratio: double (nullable = true)
|-- hobby: string (nullable = true)
+-----+----+-----+-----+
| name| age|ratio|hobby|
+-----+----+-----+-----+
|user1| 33| 1.0|chess|
|user2| 34| 2.0|tenis|
|user3|null| null| |
|user4|null| 4.0| |
|user5|null| 5.0| ski|
+-----+----+-----+-----+
agg_df = df.select(*[(1.0 - (count(c) / count('*'))).alias(c) for c in df.columns])
agg_df.show()
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- ratio: double (nullable = true)
|-- hobby: string (nullable = true)
+----+---+-------------------+-----+
|name|age| ratio|hobby|
+----+---+-------------------+-----+
| 0.0|0.6|0.19999999999999996| 0.0|
+----+---+-------------------+-----+
现在,我想在agg_df中只选择值为< 0.35。在这种情况下,它应该返回['name','ratio','hobby']
我无法弄明白该怎么做。任何提示?
答案 0 :(得分:4)
你的意思是值< 0.35 ?.这应该
>>> [ key for (key,value) in agg_df.collect()[0].asDict().items() if value < 0.35 ]
['hobby', 'ratio', 'name']
用Null替换空字符串使用以下udf函数。
from pyspark.sql.functions import udf
process = udf(lambda x: None if not x else (x if x.strip() else None))
df.withColumn('hobby', process(df.hobby)).show()
+-----+----+-----+-----+
| name| age|ratio|hobby|
+-----+----+-----+-----+
|user1| 33| 1.0|chess|
|user2| 34| 2.0|tenis|
|user3|null| null| null|
|user4|null| 4.0| null|
|user5|null| 5.0| ski|
+-----+----+-----+-----+
答案 1 :(得分:0)
以下是我根据rogue-one指示尝试我正在寻找的功能。不确定它是最快还是最优化的:
from pyspark.sql.functions import udf, count
from functools import reduce
def filter_columns(df, threshold=0.35):
process = udf(lambda x: None if not x else (x if x.strip() else None)) # udf for stripping string values
string_cols = ([c for c in df.columns if df.select(c).dtypes[0][1] == 'string']) # string columns
new_df = reduce(lambda df, x: df.withColumn(x, process(x)), string_cols, df) # process all string columns
agg_df = new_df.select(*[(1.0 - (count(c) / count('*'))).alias(c) for c in new_df.columns]) # compute non-null/df.count ratio
cols_match_threshold = [ key for (key, value) in agg_df.collect()[0].asDict().items() if value < threshold ] # select only cols which value < threshold
return new_df.select(cols_match_threshold)
filter_columns(df, 0.35).show()
+-----+-----+
|ratio| name|
+-----+-----+
| 1.0|user1|
| 2.0|user2|
| null|user3|
| 4.0|user4|
| 5.0|user5|
+-----+-----+