我正在尝试使用列表过滤pyspark中的数据框。我想要根据列表进行过滤,或者仅包含列表中具有值的记录。我的代码不起作用:
# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
# define a list of scores
l = [10,18,20]
# filter out records by scores by list l
records = df.filter(df.score in l)
# expected: (0,1), (0,1), (0,2), (1,2)
# include only records with these scores in list l
records = df.where(df.score in l)
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)
给出以下错误: ValueError:无法将列转换为bool:请使用'&'为'和','|' for'或','〜'表示构建DataFrame布尔表达式时的'not'。
答案 0 :(得分:41)
它说的是" df.score in l"无法评估,因为df.score为您提供了一个列" in"未在该列类型上定义使用" isin"
代码应该是这样的:
public class VelocityTemplateDemo {
protected VelocityEngine velocity;
public VelocityTemplateDemo() {
velocity = new VelocityEngine();
velocity.init();
}
public String publish(String templatePath, String jsonString) throws IOException {
JSONObject jsonObj = new JSONObject(jsonString);
VelocityContext context = new VelocityContext();
for (Object key : jsonObj.keySet()) {
String keyString = String.valueOf(key);
context.put(keyString, jsonObj.get(keyString));
}
Writer writer = new StringWriter();
velocity.mergeTemplate(templatePath, "UTF-8", context, writer);
writer.flush();
return writer.toString();
}
public static void main(String[] args) throws IOException {
String str = "{\n \"firstName\": \"Tom\",\n \"lastName\": \"Geller\",\n \"department\": \"Retail\",\n \"manager\": \"Steve\",\n \"joiningDate\": \"03/08/2011\",\n \"employees\": [\n {\n \"firstName\": \"Paul\",\n \"lastName\": \"Balmer\",\n \"department\": \"Retail\",\n \"manager\": \"Tom Geller\",\n \"joiningDate\": \"06/21/2014\"\n },\n {\n \"firstName\": \"Eric\",\n \"lastName\": \"S\",\n \"department\": \"Retail\",\n \"manager\": \"Tom Geller\",\n \"joiningDate\": \"09/13/2014\"\n }\n ]\n}";
String result = new VelocityTemplateDemo().publish("/src/main/resources/template.vm", str);
System.out.println(result);
}
}
答案 1 :(得分:14)
基于@ user3133475答案,也可以像这样从isin()
调用F.col()
方法:
import pyspark.sql.functions as F
l = [10,18,20]
df.filter(F.col("score").isin(l))
答案 2 :(得分:0)
对于大型数据帧,我发现# this line
的实现明显比join
快:
where