比较pyspark中两个rdds的每个值

时间:2016-12-02 12:23:53

标签: pyspark

我有两个rdd。例如,

employee =    [(31, ['Raffery', 31, 'a', 'b']),
               (33, ['Jones', 33, '1', 'b']),
               (32, ['Heisenberg', 33, 'a', 'b']),
               (37, ['Robinson', 34, 'c', 'cc']),
               (38, ['Smith', 34, 'a', 'b'])]` 

department =   [[(31, ['Raffery', 31, 'c', 'b']),
                 (33, ['Jones', 33, 'a', 'b']),
                 (34, ['Heisenberg', 33, 'a', 'b'])]`

我想比较第一个rdd的元素和每个键的第二个元素:

输出应该看起来像

  

31,故障在e [1] [2]

     

33且故障在e [1] [2]

1 个答案:

答案 0 :(得分:1)

我不确定输出需要多严格的格式,但以下内容几乎可以解决所有问题:

使用pyspark数据帧:

cursor.execute(sql, category)

sqlite3.OperationalError: database is locked

加入这些,我认为是用户ID:

>>> employee = spark.createDataFrame([(31, ['Raffery', 31, 'a', 'b']), (33, ['Jones', 33, '1', 'b']), (32, ['Heisenberg', 33, 'a', 'b'])], ["id_e", "list_e"])
>>> employee.show()
+----+----------------------+
|id_e|list_e                |
+----+----------------------+
|31  |[Raffery, 31, a, b]   |
|33  |[Jones, 33, 1, b]     |
|32  |[Heisenberg, 33, a, b]|
+----+----------------------+

>>> department = spark.createDataFrame([(31, ['Raffery', 31, 'c', 'b']), (33, ['Jones', 33, 'a', 'b']), (34, ['Heisenberg', 33, 'a', 'b'])], ["id_d", "list_d"])
>>> department.show()
+----+----------------------+
|id_d|list_d                |
+----+----------------------+
|31  |[Raffery, 31, c, b]   |
|33  |[Jones, 33, a, b]     |
|34  |[Heisenberg, 33, a, b]|
+----+----------------------+

然后映射数据帧之间未共享的元素的用户列表的索引:

>>> joined = employee.join(department, employee.id_e == department.id_d)
>>> joined.show()
+----+-------------------+----+-------------------+
|id_e|             list_e|id_d|             list_d|
+----+-------------------+----+-------------------+
|  31|[Raffery, 31, a, b]|  31|[Raffery, 31, c, b]|
|  33|  [Jones, 33, 1, b]|  33|  [Jones, 33, a, b]|
+----+-------------------+----+-------------------+

希望能帮助你,祝你好运。