Question

我对这个话题进行了很多研究。我有一个3吨大小的数据集。以下是该表的数据模式：

root
 |-- user: string (nullable = true)
 |-- attributes: array (nullable = true)
 |    |-- element: string (containsNull = true)

每天，我都会得到一个我需要属性的用户列表。我想知道我是否可以将上述架构写入带有前2个用户字母的镶木地板文件中。例如，

Omkar | [a,b,c,d,e]
Mac   | [a,b,c,d,e]
Zee   | [a,b,c,d,e]
Kim   | [a,b,c,d,e]
Kelly | [a,b,c,d,e]

在上面的数据集中，我可以这样做：

spark.write.mode("overwrite").partitionBy("user".substr(0,2)).parquet("path/to/location")

这样做，我觉得下次加入用户时加载到内存中的数据将会非常少，因为我们只能访问这些分区。

如果有人实施了这样的评论吗？

谢谢！

Answer 1

你可以。只需用以下代码替换您的代码：

def quicksort(arg1):
   ...
   return result

def heapsort(arg1):
   ...
   return result