在PySpark配对的RDD中搜索值,以查找来自另一个RDD的密钥

时间:2019-05-14 13:07:41

标签: pyspark key-value

我是PySpark的新手,我想做以下事情,

考虑以下代码,

import numpy as np
b =np.array([[1,2,100],[3,4,200],[5,6, 300],[7,8, 400]])
a = np.array([[1,2],[3,4],[11,6],[7,8], [1, 2], [7,8]])
RDDa = sc.parallelize(a)
RDDb = sc.parallelize(b)
dsmRDD = RDDb.map(lambda x: (list(x[:2]), x[2]))

我想获取与RDDa的每个值关联的值作为dsmRDD的键,即

result = [100, 200, 0, 400, 100, 400] 

非常感谢您。

2 个答案:

答案 0 :(得分:0)

如果数据不是太大,可以使用如下数据框:


import React, { PureComponent } from 'react';
import ReactDOM from 'react-dom';
import {
  LineChart, Line, XAxis, YAxis, CartesianGrid, Tooltip, Legend,BarChart, Bar, Label
} from 'recharts';

export default class Example extends React.Component{
    render(){
const data = [
  {
    name: '10:00', SmartMeter1: 0, SmartMeter2: 2400,
  },
  {
    name: '10:30', SmartMeter1: 600, SmartMeter2: 1398,
  },
  {
    name: '11:00', SmartMeter1: 1000, SmartMeter2: 1398,
  },
  {
    name: '11:30', SmartMeter1: 1100, SmartMeter2: 2500,
  },
  {
    name: '12:00', SmartMeter1: 1200, SmartMeter2: 1398,
  },
  {
    name: '12:30', SmartMeter1: 1500, SmartMeter2: 2450,
  },
  {
    name: '13:00', SmartMeter1: 1900, SmartMeter2: 9800,
  },
  {
    name: '13:30', SmartMeter1: 2000, SmartMeter2: 3908, 
  },
  {
    name: '14:00', SmartMeter1: 2200, SmartMeter2: 4800,
  },
  {
    name: '14:30', SmartMeter1: 2350, SmartMeter2: 3800, 
  },
  {
    name: '15:00', SmartMeter1: 2400, SmartMeter2: 4300,
  },
];

    return (
      <LineChart
      title = "Tagesverbrauch"
        width={800}
        height={500}
        data={data}
        margin={{
          top: 5, right: 30, left: 20, bottom: 5,
        }}
      >
        <CartesianGrid strokeDasharray="3 3" />
        <XAxis unit=" Uhr" dataKey="name" tick={{ fill: 'white' }}>
            </XAxis>
        <YAxis unit="kWh" tick={{ fill: 'white' }}/>
        <Tooltip />
        <Line name="Smart Meter 1" type="monotone" dataKey="SmartMeter1" stroke="#f59f4a" strokeWidth={2} activeDot={{ r: 8 }} />
      </LineChart>
  );
  }
}

答案 1 :(得分:0)

正如另一个答案所建议的那样,您可以转换为数据框和join。如果您只愿意继续使用rdd,则可以这样做,

import numpy as np

a = np.array([[1,2],[3,4],[11,6],[7,8], [1, 2], [7,8]])
b = np.array([[1,2,100],[3,4,200],[5,6, 300],[7,8, 400]])

RDDa = sc.parallelize(a)
RDDb = sc.parallelize(b)

dsmRDD = RDDa.zipWithIndex()\
         .map(lambda x: (tuple(x[0].tolist()),(0,x[1])))\
         .leftOuterJoin(RDDb.map(lambda x: (tuple(x[:2].tolist()), x[2])))\
         .map(lambda x: (x[1][0][1], x[1][1]) if x[1][1] is not None else (x[1][0][1],x[1][0][0]))

output = map(lambda x:x[1], sorted(dsmRDD.collect()))
print output

为您提供输出,

[100, 200, 0, 400, 100, 400]