Question

我有一个json对象的spark RDD（productList），格式如下。

{u'name': u'product_id', u'price': 12, u'quantity': 1}'

现在，我想将此映射到另一个RDD，它只包含＆＃39; product_id＆＃39;和total_amount，这将是价格*数量。以下将导致totalAmounts列表。但是，我如何也可以将product_id与总金额一起映射。

total_amount_list = productList.map(lambda x: x['price']*x['quantity'])

Answer 1

这样的东西？

productList = sc.parallelize([
    {u'name': u'product_id', u'price': 12, u'quantity': 1}])

productList.map(
    lambda x: {'name': x['name'],  'total': x['price'] * x['quantity']}
).first()

## {'name': 'product_id', 'total': 12}

如果您的输入数据是JSONL文件，那么您应该考虑使用DataFrames：

from pyspark.sql.functions import col

s = (
    '{"quantity": 1, "name": "product_id", "price": 12}\n'
    '{"quantity": 3, "name": "product_id2", "price": 5}'
)

with open('/tmp/test.jsonl', 'w') as fw:
  fw.write(s)

df = sqlContext.read.json('/tmp/test.jsonl')
df.withColumn('total', col('price') * col('quantity'))

如何将json对象的spark RDD映射到另一个RDD，该RDD包含仅具有选定属性集的对象

1 个答案: