让我们说我有一个巨大的数据框,可以通过给定的键/ id将其分成较小的子集。
我需要将每个子集传递给一个函数来完成我的工作/计算。
现在,这可行:
public interface MyMapper{
@Mapping(target = "identifier", ignore = true ) /* leave it to aftermapping */
@Mapping(target = "optionNumber", source = "insurancePeriod", qualifiedBy = PensionOfferApplicationFilter.OptionNumber.class)
@Mapping(target = "creationDate", qualifiedBy = PensionOfferApplicationFilter.CreationDate.class)
@Mapping(target = "editedPieces", source = "insurancePeriod", qualifiedBy = PensionOfferApplicationFilter.EditedPieces.class)
Target map( Source source, @Context PensionApplicationRepository repo /* pass the repo as context */);
@AfterMapping
default map( Source source, @AfterMapping Target target, @Context PensionApplicationRepository repo ) {
target.setIdentifier( String.valueOf(repo.getPensionApplicationIdentifier() );
}
}
但这是最好的策略吗?我的意思是,看来原始的巨大数据帧将为群集中的每个工作人员传递。
我想做什么:
def my_func_v1(id, df=df):
mask = df.id_column == id
subset = df.loc[mask, :]
# make my calculations
....
return my_result
sc = spark.sparkContext
list_of_ids = df.id_column.unique().tolist()
rdd = sc.parallelize(list_of_ids)
all_results = rdd.map(my_func_v1).collect()
将较小的子集传递给火花图对我来说更有意义。
问题是:它不起作用。这项工作无法开始。
我想念什么?我在以正确的方式看待这个问题吗?
---------更新-----------
今天集群状态良好,并且my_func_v2也正常工作
但是创建rdd花费了两倍的时间
# create a list of subsets throught generators
list_of_subsets = (df[df.id_column == id] for id in list_of_ids)
def my_func_v2(subset):
# make my calculations
....
return my_result
rdd = sc.parallelize(list_of_subsets)
all_results = rdd.map(my_func_v2).collect()
和两倍的时间映射它
rdd = sc.parallelize(list_of_subsets)