pyspark delete数据框中Row fild length == 40000的行

时间:2016-12-12 00:18:45

标签: apache-spark dataframe row pyspark

我有DataFrame[item: string, true_recoms: map<string,int>]

使用架构:

StructType(List(StructField(item,StringType,true),StructField(true_recoms,MapType(StringType,IntegerType,true),true)))

我想删除长度为recoms==40000

的行

1 个答案:

答案 0 :(得分:0)

不那么优雅,但是:

sqlContext.udf.register("stringLengthInt", lambda x: len(x),    IntegerType())
train = sqlContext.sql("SELECT * FROM train HAVING len(true_recoms)<40000")
sqlContext.registerDataFrameAsTable(train, "train")

检查:

sqlContext.sql("SELECT item , stringLengthInt(true_recoms) AS l FROM train ORDER BY -l ").collect()