有没有办法忽略Spark中元素很少的RDD分区上的处理?

时间:2019-06-12 09:00:46

标签: scala apache-spark

我有一个RDD,我需要对每个分区(使用.mapPartition)进行计算,但前提是当前数据分区中的元素数超过X。

示例: RDD的每个分区中的元素数为:

  

80、9、0、0、0、3、60

我只想处理包含50个以上元素的分区。

这有可能吗?

1 个答案:

答案 0 :(得分:1)

也可以在不预先计算尺寸的情况下懒惰地完成。在此示例中,过滤到包含至少两个元素的分区

import React, { useState } from "react

const UserEdit = ({ classes, ...props }) => (

    //creates a state-value and a state-updater function. Set default as false.
    const [ checked, setChecked ] = useState(false)

    <Edit {...props}>
        <SimpleForm>
            <TextInput source="username" validate={required()}/>
            <TextInput source="email" validate={[required(),email()]}/>
            <TextInput source="phoneNumber" validate={[required(),minLength(10),number()]}/>
            <RadioButtonGroupInput 
                 label="Type"
                 source="userType"
                 choices={types}
                 optionText="id"
                 validate={required()}
                 onChange={() => setChecked(!checked)}/> //set to opposite of current-value

            //if not checked, then we will display ReferenceInput, if checked, hide.
            { !checked && (
                <ReferenceInput label="Company" source="company.id" reference="companies">
                    <SelectInput optionText="name" />
                </ReferenceInput>

            )}
        </SimpleForm>
    </Edit>
);

输出:

import org.apache.spark.Partitioner

object DemoPartitioner extends Partitioner {
  override def numPartitions: Int = 3
  override def getPartition(key: Any): Int = key match {
    case num: Int => num
  }
}

sc.parallelize(Seq((0, "a"), (0, "a"), (0, "a"), (1, "b"), (2, "c"), (2, "c")))
  .partitionBy(DemoPartitioner) // create 3 partitions of sizes 3,1,2
  .mapPartitions { it =>
    val firstElements = it.take(2).toSeq
    if (firstElements.size < 2) {
      Iterator.empty
    } else {
      firstElements.iterator ++ it
    }
  }.foreach(println)

因此只跳过了一个元素的分区1