简而言之

Question

我已经在Spark中加载了大数据文件，但希望能够处理其中的一小部分来运行分析，有没有办法做到这一点？我尝试过重新分配，但它带来了很多改组。有没有办法处理Spark中加载的Big文件的唯一一小块？

Answer 1

简而言之

您可以在RDD上使用sample()或randomSplit()转换

sample()

/**
  * Return a sampled subset of this RDD.
  *
  * @param withReplacement can elements be sampled multiple times
  * @param fraction expected size of the sample as a fraction of this RDD's size
  *  without replacement: probability that each element is chosen; fraction must be [0, 1]
  *  with replacement: expected number of times each element is chosen; fraction must be 
  *  greater than or equal to 0
  * @param seed seed for the random number generator
  *
  * @note This is NOT guaranteed to provide exactly the fraction of the count
  * of the given [[RDD]].
  */

  def sample(
      withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T]

示例：

val sampleWithoutReplacement = rdd.sample(false, 0.2, 2)

randomSplit()

/** * Randomly splits this RDD with the provided weights. * * @param weights weights for splits, will be normalized if they don't sum to 1 * @param seed random seed * * @return split RDDs in an array */ def randomSplit( weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]

示例：

val rddParts = randomSplit(Array(0.8, 0.2)) //Which splits RDD into 80-20 ratio

Answer 2

您可以使用以下任何RDD API：

yourRDD.filter(on some condition)
yourRDD.sample(<with replacement>,<fraction of data>,<random seed>)

例如：yourRDD.sample(false, 0.3, System.currentTimeMillis().toInt)

如果您想要任何随机数据，我建议您使用第二种方法。或者，如果您需要满足某些条件的部分数据，请使用第一个。

如何处理spark中的一小部分大数据文件？

2 个答案:

简而言之

sample()

randomSplit()