我想只保留第二个表中引用了部门ID的员工。
Employee table
LastName DepartmentID
Rafferty 31
Jones 33
Heisenberg 33
Robinson 34
Smith 34
Department table
DepartmentID
31
33
我尝试过以下不起作用的代码:
employee = [['Raffery',31], ['Jones',33], ['Heisenberg',33], ['Robinson',34], ['Smith',34]]
department = [31,33]
employee = sc.parallelize(employee)
department = sc.parallelize(department)
employee.filter(lambda e: e[1] in department).collect()
Py4JError: An error occurred while calling o344.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
有什么想法吗?我正在使用Spark 1.1.0和Python。但是,我会接受Scala或Python的答案。
答案 0 :(得分:22)
在这种情况下,您希望实现的是使用department表中包含的数据过滤每个分区: 这将是基本的解决方案:
val dept = deptRdd.collect.toSet
val employeesWithValidDeptRdd = employeesRdd.filter{case (employee, d) => dept.contains(d)}
如果您的部门数据很大,广播变量会通过向所有节点提供一次数据来提高性能,而不必为每个任务序列化数据
val deptBC = sc.broadcast(deptRdd.collect.toSet)
val employeesWithValidDeptRdd = employeesRdd.filter{case (employee, d) => deptBC.value.contains(d)}
尽管使用join会起作用,但这是一个非常昂贵的解决方案,因为它需要分布式数据(shie)来实现连接。鉴于需求是一个简单的过滤器,将数据发送到每个分区(如上所示)将提供更好的性能。
答案 1 :(得分:10)
我终于使用连接实现了一个解决方案。我不得不向部门添加0值以避免来自Spark的异常:
employee = [['Raffery',31], ['Jones',33], ['Heisenberg',33], ['Robinson',34], ['Smith',34]]
department = [31,33]
# invert id and name to get id as the key
employee = sc.parallelize(employee).map(lambda e: (e[1],e[0]))
# add a 0 value to avoid an exception
department = sc.parallelize(department).map(lambda d: (d,0))
employee.join(department).map(lambda e: (e[1][0], e[0])).collect()
output: [('Jones', 33), ('Heisenberg', 33), ('Raffery', 31)]
答案 2 :(得分:0)
过滤多列中的多个值:
如果您要从数据库中提取数据(此示例中为Hive或SQL类型数据库)并且需要对多个列进行过滤,则可能更容易使用第一个过滤器加载表,然后通过RDD迭代过滤器(多个小的迭代是Spark编程的鼓励方式):
{
import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
val first_data_filter = hc.sql("SELECT col1,col2,col2 FROM tableName WHERE col3 IN ('value_1', 'value_2', 'value_3)")
val second_data_filter = first_data_filter.filter(rdd => rdd(1) == "50" || rdd(1) == "20")
val final_filtered_data = second_data_filter.filter(rdd => rdd(0) == "1500")
}
当然,您必须稍微了解一下您的数据,以过滤正确的值,但这是分析过程的一部分。
答案 3 :(得分:0)
对于上述相同的exm,我想只保留第二个表中引用的或包含在部门ID中的员工。 但它必须不是连接操作,我会在“包含”或“在”中看到它, 我的意思是33是“在”334和335
employee = [['Raffery',311], ['Jones',334], ['Heisenberg',335], ['Robinson',34], ['Smith',34]]
department = [31,33]
employee = sc.parallelize(employee)
department = sc.parallelize(department)