Question

我是PySpark的新手。我有2个表，1：索引表和2：图像中显示的值表。

我想知道一种有效的方法：

在表1中运行扫描，并获取索引
在表2中运行扫描，并获取与给定索引相对应的值

然后我有多个这样的（Key - Index）表和（Index-Value）表。请让我知道进行这些扫描的最有效和PySpark方式。我想知道一种做法：

rdd1 = scan 'table1' {FILTER => key ='some value'} # Will get Index values rdd2 = scan 'table2', {STARTROW => The Results of table 1}

因此，如果rdd1返回10行，那么这10行的Index字段中的值将用于扫描table2并从table2获取值。这使我按顺序在table2上运行10次扫描，最终耗费了大量时间。我想知道一种在table2上并行化扫描的方法， rdd1.map(lamba x: scan table2 我给了我错误，因为我最终在扫描中扫描，我不能这样做如果您认为更有效，请建议任何替代方法。感谢

Answer 1

有效而简单的做法是使用Dataframes而不是rdd

假设您有类似这样的数据 -

table1 = [(1,'A'),(2,'B'),(3,'C'),(4,'B')]
table2 = [('A',10),('B',20),('D',30),('E',40)]

# create the dataframes based on the data 
df1 = spark.createDataFrame(table1,schema=['k1','v1'])
df2 = spark.createDataFrame(table2,schema=['k2','v2'])

df1.show() 
+---+---+
| k1| v1|
+---+---+
|  1|  A|
|  2|  B|
|  3|  C|
|  4|  B|
+---+---+

 df2.show() 
+---+---+
| k2| v2|
+---+---+
|  A| 10|
|  B| 20|
|  D| 30|
|  E| 40|
+---+---+

# do a simple inner join and only select df2 columns
df2\
.join(df1, df1.v1==df2.k2)\
.select(df2.columns)\
.dropDuplicates()
.show()

+---+---+
| k2| v2|
+---+---+
|  B| 20|
|  A| 10|
+---+---+

只使用Rdds -

rdd1 = sc.parallelize(table1)
rdd2 = sc.parallelize(table2)

rdd2\
.join(rdd1.map(lambda x : (x[1],x[0])))\
.mapValues(lambda x: x[0])\
.distinct()\
.collect()

基于PySpark和HBase中另一个表的索引有效地扫描表

1 个答案: