首先,我理解持久数据结构的概念和关于RDD的不变性..更新是我能想到的唯一一个词:)
我的问题是:
给定字典(或Row对象)的RDD,我如何循环/映射并在该RDD上应用一些转换登录并接收应用了这些转换的新RDD。例如:
给定包含词典的RDD:
fbb = sc.parallelize(
[{'amount_gbp': -43.33,
'balance_gbp': 57.08,
'type': 'GED',
'id': 961690979,
'settled_jrnl_cr_datetime': u'(null)',
'virtual_cash_balance': 0,
'virtual_debt_balance': 0},
{'amount_gbp': 17.08,
'balance_gbp': 40.0,
'type': 'OIP',
'id': 962182953,
'settled_jrnl_cr_datetime': u'(null)',
'virtual_cash_balance': 0,
'virtual_debt_balance': 0}])
我试图应用这个功能:
def update_virtual_cash_balance(x):
x.update({'virtual_cash_balance': x['amount_gbp'] + x['balance_gbp']}) if x['type'] == 'GED' else x
fbb.map(lambda x: update_virtual_cash_balance(x)).collect()
期待:
[{'amount_gbp': -43.33,
'balance_gbp': 57.08,
'type': 'GED',
'id': 961690979,
'settled_jrnl_cr_datetime': u'(null)',
'virtual_cash_balance': 13.75,
'virtual_debt_balance': 0},
{'amount_gbp': 17.08,
'balance_gbp': 40.0,
'type': 'OIP',
'id': 962182953,
'settled_jrnl_cr_datetime': u'(null)',
'virtual_cash_balance': 0,
'virtual_debt_balance': 0}]
但得到了:
Out[411]: [None, None]
对我误解的任何帮助都会很棒。
答案 0 :(得分:1)
update_virtual_cash_balance
并未返回任何内容,因此您获得None
update
方法不会返回任何内容,因此即使None
返回值,您也会获得update_virtual_cash_balance
尝试:
def update_virtual_cash_balance(x):
if x['type'] == 'GED':
z = x.copy() # shallow copy should be enough here
z.update({'virtual_cash_balance': x['amount_gbp'] + x['balance_gbp']}
return z
return x