pyspark数据帧过滤器或基于列表包含

时间:2016-11-04 11:44:13

标签: dataframe filter sparc

我正在尝试使用列表过滤pyspark中的数据框。我想要根据列表进行过滤,或者仅包含列表中具有值的记录。我的代码不起作用:

# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])

# define a list of scores
l = [10,18,20]

# filter out records by scores by list l
records = df.filter(df.score in l)
# expected: (0,1), (0,1), (0,2), (1,2)

# include only records with these scores in list l
records = df.where(df.score in l)
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)

给出以下错误: ValueError:无法将列转换为bool:请使用'&'为'和','|' for'或','〜'表示构建DataFrame布尔表达式时的'not'。

3 个答案:

答案 0 :(得分:41)

它说的是" df.score in l"无法评估,因为df.score为您提供了一个列" in"未在该列类型上定义使用" isin"

代码应该是这样的:

public class VelocityTemplateDemo {

protected VelocityEngine velocity;

public VelocityTemplateDemo() {
    velocity = new VelocityEngine();
    velocity.init();
}

public String publish(String templatePath, String jsonString) throws IOException {
    JSONObject jsonObj = new JSONObject(jsonString);
    VelocityContext context = new VelocityContext();
    for (Object key : jsonObj.keySet()) {
        String keyString = String.valueOf(key);
        context.put(keyString, jsonObj.get(keyString));
    }
    Writer writer = new StringWriter();
    velocity.mergeTemplate(templatePath, "UTF-8", context, writer);
    writer.flush();
    return writer.toString();
}

public static void main(String[] args) throws IOException {
    String str = "{\n  \"firstName\": \"Tom\",\n  \"lastName\": \"Geller\",\n  \"department\": \"Retail\",\n  \"manager\": \"Steve\",\n  \"joiningDate\": \"03/08/2011\",\n  \"employees\": [\n    {\n      \"firstName\": \"Paul\",\n      \"lastName\": \"Balmer\",\n      \"department\": \"Retail\",\n      \"manager\": \"Tom Geller\",\n      \"joiningDate\": \"06/21/2014\"\n    },\n    {\n      \"firstName\": \"Eric\",\n      \"lastName\": \"S\",\n      \"department\": \"Retail\",\n      \"manager\": \"Tom Geller\",\n      \"joiningDate\": \"09/13/2014\"\n    }\n  ]\n}";
    String result = new VelocityTemplateDemo().publish("/src/main/resources/template.vm", str);
    System.out.println(result);
}

}

答案 1 :(得分:14)

基于@ user3133475答案,也可以像这样从isin()调用F.col()方法:

import pyspark.sql.functions as F


l = [10,18,20]
df.filter(F.col("score").isin(l))

答案 2 :(得分:0)

对于大型数据帧,我发现# this line的实现明显比join快:

where