我是火花新手,我正在尝试创建如下字典:
{4: {'aenr': ['earn', 'rane'], 'aerr': ['rare', 'rear'], 'aenw': ['anew', 'wane', 'wean'], 'derw': ['drew']}
基本上这应该是使用spark的结构
{len(word): {sorted(word):[word1,word2,etc]}
我有一个很大的文件,其中包含英语单词,其结构如下:
{
"biennials": 0,
"tripolitan": 0,
"oblocutor": 0,
"leucosyenite": 0,
"chilitis": 0,
"fabianist": 0,
"diazeutic": 0,
"alible": 0,
"deciet":0
}
所以我想逐行读取文件并创建一个可以保存此内容的rdd:
{len(word): {sorted(word):[word1,word2,etc]}
我已经尝试过了:
r = rdd.map(lambda x: {len(x):sorted(x)})
items = r.flatMap(lambda line: (line.items()))
items.take(items.count())
groupedItems = items.groupByKey().mapValues(list)
groupedItems.take(groupedItems.count())#j = filter2_rdd
d = groupedItems.collectAsMap()
但这会打印以下内容:
[
{1: {u'{': [u'{']}},
{9: {u'abeiilnns': [u' "biennials": 0, ']}},
{10: {u'aiilnoprtt': [u' "tripolitan": 0, ']}},
{9: {u'bclooortu': [u' "oblocutor": 0, ']}},
{12: {u'ceeeilnostuy': [u' "leucosyenite": 0, ']}},
{8: {u'chiiilst': [u' "chilitis": 0, ']}},
{9: {u'aabfiinst': [u' "fabianist": 0, ']}},
{9: {u'acdeiituz': [u' "diazeutic": 0, ']}},
{6: {u'abeill': [u' "alible": 0, ']}},
{6: {u'cdeeit': [u' "deciet":0,']}},
{5: {u'doosw': [u' "woods": 4601, ']}},
{14: {u'adeejmnnoprrtu': [u' "preadjournment": 0, ']}},
{7: {u'deiprss': [u' "spiders": 0, ']}},
{9: {u'aabfiimns': [u' "fabianism": 0, ']}},
{11: {u'cdgilnoostu': [u' "outscolding": 0, ']}},
{10: {u'eeilprrsty': [u' "sperrylite": 0, ']}},
{8: {u'agilnrtw': [u' "trawling": 0, ']}},
{13: {u'acdeimmoprrsu': [u' "cardiospermum": 0, ']}},
{10: {u'gghhiilttt': [u' "lighttight": 0, ']}},
{7: {u'deiprsy': [u' "spidery": 0, ']}}
}
我需要将它们按长度和列表中的所有单词分组
答案 0 :(得分:0)
您无法立即map()
到len()
和sorted()
,因为您失去了原始价值。这是一种方法:
map
创建密钥sorted(x)
groupByKey
-sorted(x)
map
创建密钥len(x)
groupByKey
-len(x)
collectAsMap()
如果要打印出来,可能需要将ResultIterable
转换为特定的python类型:
例如(假设您已将所有单词并行化为rdd
):
In []:
(rdd
.map(lambda x: (''.join(sorted(x)), x))
.groupByKey()
.mapValues(lambda x: list(x))
.map(lambda x: (len(x[0]), x))
.groupByKey()
.mapValues(lambda x: dict(x))
.collectAsMap())
Out[]:
{6: {'abeill': ['alible'], 'cdeeit': ['deciet']},
8: {'chiiilst': ['chilitis']},
9: {'aabfiinst': ['fabianist'],
'abeiilnns': ['biennials'],
'acdeiituz': ['diazeutic'],
'bclooortu': ['oblocutor']},
10: {'aiilnoprtt': ['tripolitan']},
12: {'ceeeilnostuy': ['leucosyenite']}}