Question

我有一个用例，其中我要在Spark中连接两个数据帧var grouped = Dictionary<Date, [Meal]>() var listOfAllMeals = [Meal]() //already populated self.grouped = Dictionary(grouping: self.listOfAllMeals.sorted(by: { ($0.date ?? nilDate) < ($1.date ?? nilDate) }), by: { calendar.startOfDay(for: $0.date ?? nilDate) }) override func numberOfSections(in tableView: UITableView) -> Int { // #warning Incomplete implementation, return the number of sections return grouped.count } override func tableView(_ tableView: UITableView, titleForHeaderInSection section: Int) -> String? { return Array(grouped.keys)[section] as! String //this throws a thread error }和A。

两个问题：

如何通过Spark UI显示大约30gb的随机播放来减少网络随机播放。
任务数量也非常庞大，大约为1,000,000。有什么技巧可以减少它们吗？

我曾尝试缓存数据帧A -> Huge dataframe approx size: 100 TB B -> Smaller dataframe approx size: 100 MB，但令人惊讶的是，这只会使工作变慢。任何帮助将不胜感激。

Answer 1

您可以尝试将autoBroadcastJoinThreshold增加到100MB，以触发地图侧联接，或者如果这样做没有帮助，则显式广播B（较小）数据帧：

val result = dfA.join(broadcast(dfB),...

那应该彻底消除与联接有关的混洗。

减少网络混乱的技巧

1 个答案: