我有两个rdd。例如,
employee = [(31, ['Raffery', 31, 'a', 'b']),
(33, ['Jones', 33, '1', 'b']),
(32, ['Heisenberg', 33, 'a', 'b']),
(37, ['Robinson', 34, 'c', 'cc']),
(38, ['Smith', 34, 'a', 'b'])]`
department = [[(31, ['Raffery', 31, 'c', 'b']),
(33, ['Jones', 33, 'a', 'b']),
(34, ['Heisenberg', 33, 'a', 'b'])]`
我想比较第一个rdd的元素和每个键的第二个元素:
输出应该看起来像
31,故障在e [1] [2]
33且故障在e [1] [2]
中
答案 0 :(得分:1)
我不确定输出需要多严格的格式,但以下内容几乎可以解决所有问题:
使用pyspark数据帧:
cursor.execute(sql, category)
sqlite3.OperationalError: database is locked
加入这些,我认为是用户ID:
>>> employee = spark.createDataFrame([(31, ['Raffery', 31, 'a', 'b']), (33, ['Jones', 33, '1', 'b']), (32, ['Heisenberg', 33, 'a', 'b'])], ["id_e", "list_e"])
>>> employee.show()
+----+----------------------+
|id_e|list_e |
+----+----------------------+
|31 |[Raffery, 31, a, b] |
|33 |[Jones, 33, 1, b] |
|32 |[Heisenberg, 33, a, b]|
+----+----------------------+
>>> department = spark.createDataFrame([(31, ['Raffery', 31, 'c', 'b']), (33, ['Jones', 33, 'a', 'b']), (34, ['Heisenberg', 33, 'a', 'b'])], ["id_d", "list_d"])
>>> department.show()
+----+----------------------+
|id_d|list_d |
+----+----------------------+
|31 |[Raffery, 31, c, b] |
|33 |[Jones, 33, a, b] |
|34 |[Heisenberg, 33, a, b]|
+----+----------------------+
然后映射数据帧之间未共享的元素的用户列表的索引:
>>> joined = employee.join(department, employee.id_e == department.id_d)
>>> joined.show()
+----+-------------------+----+-------------------+
|id_e| list_e|id_d| list_d|
+----+-------------------+----+-------------------+
| 31|[Raffery, 31, a, b]| 31|[Raffery, 31, c, b]|
| 33| [Jones, 33, 1, b]| 33| [Jones, 33, a, b]|
+----+-------------------+----+-------------------+
希望能帮助你,祝你好运。