Question

这个问题与代码的设计有关。今天如何迭代RDD的一部分，以及第二天如何对其进行迭代。

我已经建立了一个20,000,000行的RDD或spark数据帧。我想从lbs.amap.com调用一个API，但每天只能访问300,000次。

def gd_address(line):
    # GET rest api, return a list of values
    ...
# use these values to add columns to my RDD
df.rdd.map(lambda line: (line[0], line[1], gd_address(line)[0], gd_address(line)[1], gd_address(line)[2]), True)

当它仅遍历整个RDD时，如何遍历300,000行然后停止，然后第二天遍历接下来的300,000行并停止，该如何编写程序？任何想法将不胜感激。

Answer 1

正如已经与@Glennie讨论的那样，关键是要使用唯一的增量行ID。这意味着新数据将具有递增的增量ID，而该ID对于旧数据应保持不变。换句话说，我们需要确保一个特定的记录在每次作业执行时都具有相同的对应ID。为了创建这样的唯一ID，您可以使用RDD API提供的zipWithIndex。与monotonically_increasing_id相反，zipWithIndex函数可确保行ID的顺序值。这将对程序的性能起重要作用，因为正如我们将在下面看到的那样，它可以有效减少需要处理的行数。这是zipWithIndex方法的实现：

df.rdd.zipWithIndex() \
           .map(lambda (line, row_id): (row_id, line[0], line[1], gd_address(line)[0], gd_address(line)[1], gd_address(line)[2])) \
           .toDF(["row_id", "c1", "c2", "c3", "c4", "c5"])

第二个先决条件以及唯一行ID的存在是根据我们刚创建的行ID对数据进行排序。然后对于第一天，第二天的期望范围将是0-299.000，对于第三天600.000-899.999则是300.000-599.999，依此类推。在检索和处理每个块之后，您将需要存储最后处理的行ID。您可以选择将最后一个ID保存在文件系统或HDFS中。写入HDFS的一种方法是通过
df.select("max(row_id)").write.text("hdfs://cluster/user/hdfs/last_row_id.txt")并使用spark.read.text("hdfs://cluster/user/hdfs/last_row_id.txt")阅读。

这是完整的代码：

def callAmapAPI(data):
   for row in data:
     # make HTTP call here

# assign row id to df 
df = df.rdd.zipWithIndex() \
           .map(lambda (line, row_id): (row_id, line[0], line[1], gd_address(line)[0], gd_address(line)[1], gd_address(line)[2])) \
           .toDF(["row_id", "c1", "c2", "c3", "c4", "c5"]) 

# retrieve saved row id
lower_bound_rowid = spark.read.text("hdfs://cluster/user/hdfs/last_row_id.txt").first()[0]

chunk_size = 300000
upper_bound_rowid = lower_bound_id + chunk_size
partition_num = 8

# we restrict the number of rows to 300000 based on upper and lower bound
filtered_df = df.where(df["row_id"] > lower_bound_rowid &  df["row_id"] <= upper_bound_rowid) \
# optional allows more control to simultaneous calls to amap API i.e 8 concurrent HTTP calls
.repartition(partition_num, "row_id") \
.orderBy("row_id")

# call API for each partition
filtered_df.foreachPartition(callAmapAPI)

# save max row id for next day
filtered_df.select("max(row_id)") \
.write.mode('overwrite') \
.text("hdfs://cluster/user/hdfs/last_row_id.txt")

使用monotonically_increasing_id的第二种方法（不建议使用）

至于使用monotonically_increasing_id的方法，我相信只有在您的数据集保持相同（没有新行）的情况下才可行，否则就无法保证generate_id对于每一行都保持不变您将无法跟踪最后处理的记录（Spark可能为同一记录产生不同的ID）。尽管如果是这种情况，并且df不变，那么您只能调用monotonically_increasing_id()一次并保存带有新ID的df。在这种情况下，您需要进行以下两项更改。首先将df的定义更改为：

df = df.withColumn("row_id", monotonically_increasing_id())
df.write.csv(...) # or some other storage

与之前的方法相反，上述代码段只能执行一次，前一种方法在每次作业执行时都会计算并分配行ID。

然后将filtered_df的定义更改为：

df = spark.read.csv(...) # retrieve the dataset with monotonically_increasing_id

filtered_df = df.where(df["row_id"] > lower_bound_rowid) \
.orderBy("row_id") \
.limit(chunk_size) \
.repartition(partition_num, "row_id")

这里有两件事要注意。首先，我们不知道upper_bound_rowid（monotonically_increasing_id将为每个分区生成任意ID），因此在where子句中未使用upper_bound_rowid。其次，orderBy应该在limit之前出现，否则我们不能确保topN行。由于orderBy是在较大的数据集上执行的，因此该方法的性能也会降低。

如何每天仅使用RDD的一部分来遍历？

1 个答案: