Question

我在scala中有字符串的RDD。字符串是id。它会是这样的。

我有另一个带有（id，name）的RDD。

(1, Name1)
(2, Name2)
(3, Name3)
(4, Name4)
(5, Name5)
(6, Name6)

现在我想获取第一个RDD中所有ID的名称。我该怎么做？

我意识到，如果第一个RDD是一对RDD，我可以加入两个RDD＆＃39; s。那么为什么我们只对pairRDD进行连接操作？

Answer 1

试试这个：

rdd1.map(x => (x, null)).join(rdd2).mapValues(x => x._2)

Answer 2

根据您对CafeFeeds的评论，如果ids RDD足够小，您可以考虑“广播加入”。

#include <memory>
#include <iostream>
#include <functional>

struct interface
 {
   virtual int hello() = 0;
   virtual std::unique_ptr<interface> clone () const = 0;
 };

template <typename D>
struct interHelper : public interface 
 {
   std::unique_ptr<interface> clone() const override
    { return std::unique_ptr<interface>(new D((const D &)(*this))); }
 };

struct implementation : public interHelper<implementation>
 {
   int hello() override
    {
      std::cout << "hello()\n";
      return 42;
    }
 };

struct adapter
 {
   struct lambda : public interHelper<lambda>
    {
      std::function<int()> func;

      lambda (std::function<int()> func_): func(func_)
       { }

      int hello() override
       { return this->func(); }
    };

   std::unique_ptr<interface>  obj;

   adapter (std::function<int()>&& func) : obj { lambda{func}.clone() }
    { }

   adapter (interface&& impl) : obj { impl.clone() }
    { }
 };

int main()
 {
   adapter a([]() { std::cout << "hello from lambda\n"; return 99; });

   a.obj->hello();

   adapter b { implementation() };

   b.obj->hello();

   return 0;
 }

这样做的好处是你不需要改变名称RDD，所以如果它显着增大，你将减少需要大量工作的工作量。除此之外，简单的连接方法是最好的。

字符串RDD连接操作

2 个答案: