Question

询问Pandas要求的this question的变体，我有类似的情况，除了我使用spark-shell或pyspark。

我有一个包含域（顶点）列表的数据框：

index            domain
0            airbnb.com
1          facebook.com
2                st.org
3              index.co
4        crunchbase.com
5               avc.com
6        techcrunch.com
7            google.com

我有另一个数据框，其中包含这些域（边）之间的连接：

           source_domain    destination_domain
              airbnb.com            google.com
            facebook.com            google.com
                  st.org          facebook.com
                  st.org            airbnb.com
                  st.org        crunchbase.com
                index.co        techcrunch.com
          crunchbase.com        techcrunch.com
          crunchbase.com            airbnb.com
                 avc.com        techcrunch.com
          techcrunch.com                st.org
          techcrunch.com            google.com
          techcrunch.com          facebook.com

如何使用域（即顶点）数据框中的相应索引替换边数据框中的每个单元格？因此，边数据框中的第一行可能看起来像：

###### Before: ##################### 
           facebook.com google.com   
###### After:  #####################   
           1            7

数据帧将增长到至少几百GB。

我怎样才能在Spark中这样做？

Answer 1

TL; DR 将数据集分别保存为CSV文件，vertices.csv和edges.csv，read和join。

// load the datasets
val vertices = spark.read.option("header", true).csv("vertices.csv")
val edges = spark.read.option("header", true).csv("edges.csv")

// indexify the source_domain
val sources = edges.
  join(vertices).
  where(edges("source_domain") === vertices("domain")).
  withColumnRenamed("index", "source_index")

// indexify the destination_domain
val destinations = edges.
  join(vertices).
  where(edges("destination_domain") === vertices("domain")).
  withColumnRenamed("index", "destination_index")

val result = sources.
  join(destinations, Seq("source_domain", "destination_domain")).
  select("source_index", "destination_index")
scala> result.show
+------------+-----------------+
|source_index|destination_index|
+------------+-----------------+
|           0|                7|
|           1|                7|
|           2|                1|
|           2|                0|
|           2|                4|
|           3|                6|
|           4|                6|
|           4|                0|
|           5|                6|
|           6|                2|
|           6|                7|
|           6|                1|
+------------+-----------------+

如何使用来自另一个DataFrame的匹配ID替换单词（在一个DataFrame中）？

1 个答案: