Question

我在SQL Server上存储了一个非常大的表，试图在该表上运行一些操作。该表存储为群集的列存储索引，大小约为0.23 TB。没有人能够在SQL Server中使用它，因此我试图查看是否可以通过我们的Spark集群使用它，该集群具有4个节点，32个CPU和40G RAM的Yarn。

当我尝试缓存一个非常简单的查询的结果时，我的运行时间很长（〜1天）。但我看不到我的资源得到充分利用。有什么想法我做错了什么，或者关于为实现更好的运行时间而需要做的建议？

到目前为止，我的代码是

config <- spark_config()
config$spark.driver.memory <- "4G"
config$spark.executor.memory <- "4G"
config$spark.executor.cores <- 2
config$spark.executor.instances <- 9
# this is a wrapper for spark_connect that connects to the yarn cluster
sc <- bigly_connect(config)
# this results in 7(?) yarn containers with total memory usage of 31G

# this is a simple wrapper for spark_read_jdbc
# note Im partitioning the read into 280 pieces
full <- spark_read_sql(
  sc,
  "<table>",
  "<db>",
  "<server>",
  "<user>",
  "<pass>",
  memory = F,
  options = list(
    lowerBound = "20150101",
    upperBound = "20180801",
    numPartitions = "280",
    partitionColumn = "date"
  )
)

parts <- full %>% 
  distinct(prem_id) 

# Ive specified kryo serializer in bigly_connect()
sdf_persist(parts, storage.level = "MEMORY_ONLY_SER")

在进行处理时，我通常会看到CPU，内存和存储的使用百分比在0-18％之间。这是怎么回事？我该如何重新思考分析以获得更好的运行时间？

预先感谢您阅读并帮助我：）

sdf_persist在sparklyr上花费了很长时间

0 个答案: