对于大多数python用户来说,这可能很简单。我有一份清单清单:
val schema = "some_schema"
val RDD = {sc.cassandraTable[(String, String, Int, Int, Int, Int)](schema, "Event_table").select("column1" as "_1", "column2" as "_2", "column3" as "_3", "column4" as "_4", "column5" as "_5","column6" as "_6").keyBy[Tuple2[Int,Int]]("column5","column6")}
val RDD2 = {sc.cassandraTable[(Int,Int)](ks, "crew_table").select ("crewid_1" as "_1", "crewid_2" as "_2", "crewid_desc").keyBy[Tuple2[Int, Int]]("crewid_1", "crewid_2")}
val joinedRDD = RDD.leftOuterJoin(RDD2)
joinedRDD.take(10).foreach(println)
val RDD3 = {sc.cassandraTable[(Int,String)](ks, "Crew").select ("crewid_1" as "_1", "crewid_2" as "_2").keyBy[Tuple1[Int]]("crew_id")}
val mjoin = joinedRDD.map { x => (x._1._1, x._2) }
val result = mjoin.join(RDD3)
result.toDebugString
有2个'群组'和4'团队'在数据中。我试图遍历all_adv数据来创建这样的字典排序:
res19: String =
(6) MapPartitionsRDD[27] at leftOuterJoin at <console>:66 []
| MapPartitionsRDD[26] at leftOuterJoin at <console>:66 []
| CoGroupedRDD[25] at leftOuterJoin at <console>:66 []
+-(6) MapPartitionsRDD[21] at map at <console>:60 []
| | MapPartitionsRDD[17] at leftOuterJoin at <console>:58 []
| | MapPartitionsRDD[16] at leftOuterJoin at <console>:58 []
| | CoGroupedRDD[15] at leftOuterJoin at <console>:58 []
| +-(6) CassandraTableScanRDD[2] at RDD at CassandraRDD.scala:15 []
| +-(6) CassandraTableScanRDD[5] at RDD at CassandraRDD.scala:15 []
+-(6) CassandraTableScanRDD[11] at RDD at CassandraRDD.scala:15 []
将所有团队归入&#39;团队&#39;他们各自集团的关键。我无法弄清楚逻辑。我想做这样的事情:
adv1 = [9999, 'Group1', 12345, 'team1']
adv2 = [8888, 'Group2', 12341, 'team2']
adv3 = [8888, 'Group2', 46563, 'team3']
adv4 = [8888, 'Group2', 23478, 'team4']
all_adv = [adv1, adv2, adv3, adv4] # <- list of lists
输出:
{
Group1 : 9999,
teams: {
team1 : 12345
},
{
Group2 : 8888,
teams: {
team2 : 12341,
team3 : 46563,
team4 : 23478
}
}
我不确定,但我想我需要列出一个包含团队词典的个人群组词典。有什么指针吗?
答案 0 :(得分:3)
我建议使用字典词典(而不是字典列表),其中包含组名的键和另一个字典的值,其中包含组的ID及其团队。
adv1 = [9999, 'Group1', 12345, 'team1']
adv2 = [8888, 'Group2', 12341, 'team2']
adv3 = [8888, 'Group2', 46563, 'team3']
adv4 = [8888, 'Group2', 23478, 'team4']
all_adv = [adv1, adv2, adv3, adv4]
d = {}
for i, n, s, t in all_adv:
if n not in d:
d[n] = {'id':i, 'teams':{}}
d[n]['teams'][t] = s
结果:
>>> import pprint
>>> pprint.pprint(d, width=30)
{'Group1': {'id': 9999,
'teams': {'team1': 12345}},
'Group2': {'id': 8888,
'teams': {'team2': 12341,
'team3': 46563,
'team4': 23478}}}
答案 1 :(得分:1)
这可能也是一个解决方案:
from itertools import groupby
from operator import itemgetter
adv1 = [9999, 'Group1', 12345, 'team1']
adv2 = [8888, 'Group2', 12341, 'team2']
adv3 = [8888, 'Group2', 46563, 'team3']
adv4 = [8888, 'Group2', 23478, 'team4']
all_adv = [adv1, adv2, adv3, adv4]
group_id_map = {ii[1]: ii[0] for ii in all_adv}
all_adv.sort(key=itemgetter(1))
groups = {}
for k, r in groupby(all_adv, key=itemgetter(1)):
teams = {ii[3]: ii[2] for ii in r}
group = dict(id=group_id_map[k], team=teams)
groups[k] = group
结果:
import json
print(json.dumps(groups, indent=4))
{
"Group1": {
"id": 9999,
"team": {
"team1": 12345
}
},
"Group2": {
"id": 8888,
"team": {
"team2": 12341,
"team3": 46563,
"team4": 23478
}
}
}
如果我可以决定adv1的类型,adv2 ......,我将使用namedtuple而不是list,因为它更容易使用。
from collections import namedtuple
from itertools import groupby
from operator import attrgetter
Team = namedtuple('Team', 'group_id group_name team_id team_name')
adv1 = [9999, 'Group1', 12345, 'team1']
adv2 = [8888, 'Group2', 12341, 'team2']
adv3 = [8888, 'Group2', 46563, 'team3']
adv4 = [8888, 'Group2', 23478, 'team4']
all_adv = [adv1, adv2, adv3, adv4]
all_adv = [Team(*ii) for ii in all_adv]
group_id_map = {ii.group_name: ii.group_id for ii in all_adv}
all_adv.sort(key=attrgetter('group_name'))
groups = {}
for k, r in groupby(all_adv, key=attrgetter('group_name')):
teams = {ii.team_name: ii.team_id for ii in r}
group = dict(id=group_id_map[k], team=teams)
groups[k] = group
结果应该相同。
参考: