Question

我正在尝试使用列表过滤pyspark中的数据框。我想要根据列表进行过滤，或者仅包含列表中具有值的记录。我的代码不起作用：

# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])

# define a list of scores
l = [10,18,20]

# filter out records by scores by list l
records = df.filter(df.score in l)
# expected: (0,1), (0,1), (0,2), (1,2)

# include only records with these scores in list l
records = df.where(df.score in l)
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)

给出以下错误： ValueError：无法将列转换为bool：请使用'＆amp;'为'和'，'|' for'或'，'〜'表示构建DataFrame布尔表达式时的'not'。

Answer 1

它说的是＆＃34; df.score in l＆＃34;无法评估，因为df.score为您提供了一个列＆＃34; in＆＃34;未在该列类型上定义使用＆＃34; isin＆＃34;

代码应该是这样的：

public class VelocityTemplateDemo {

protected VelocityEngine velocity;

public VelocityTemplateDemo() {
    velocity = new VelocityEngine();
    velocity.init();
}

public String publish(String templatePath, String jsonString) throws IOException {
    JSONObject jsonObj = new JSONObject(jsonString);
    VelocityContext context = new VelocityContext();
    for (Object key : jsonObj.keySet()) {
        String keyString = String.valueOf(key);
        context.put(keyString, jsonObj.get(keyString));
    }
    Writer writer = new StringWriter();
    velocity.mergeTemplate(templatePath, "UTF-8", context, writer);
    writer.flush();
    return writer.toString();
}

public static void main(String[] args) throws IOException {
    String str = "{\n  \"firstName\": \"Tom\",\n  \"lastName\": \"Geller\",\n  \"department\": \"Retail\",\n  \"manager\": \"Steve\",\n  \"joiningDate\": \"03/08/2011\",\n  \"employees\": [\n    {\n      \"firstName\": \"Paul\",\n      \"lastName\": \"Balmer\",\n      \"department\": \"Retail\",\n      \"manager\": \"Tom Geller\",\n      \"joiningDate\": \"06/21/2014\"\n    },\n    {\n      \"firstName\": \"Eric\",\n      \"lastName\": \"S\",\n      \"department\": \"Retail\",\n      \"manager\": \"Tom Geller\",\n      \"joiningDate\": \"09/13/2014\"\n    }\n  ]\n}";
    String result = new VelocityTemplateDemo().publish("/src/main/resources/template.vm", str);
    System.out.println(result);
}

}

Answer 2

基于@ user3133475答案，也可以像这样从isin()调用F.col()方法：

import pyspark.sql.functions as F


l = [10,18,20]
df.filter(F.col("score").isin(l))

Answer 3

对于大型数据帧，我发现# this line的实现明显比join快：

where

pyspark数据帧过滤器或基于列表包含

3 个答案: