我有一个RDD,我需要对每个分区(使用.mapPartition
)进行计算,但前提是当前数据分区中的元素数超过X。
示例: RDD的每个分区中的元素数为:
80、9、0、0、0、3、60
我只想处理包含50个以上元素的分区。
这有可能吗?
答案 0 :(得分:1)
也可以在不预先计算尺寸的情况下懒惰地完成。在此示例中,过滤到包含至少两个元素的分区
import React, { useState } from "react
const UserEdit = ({ classes, ...props }) => (
//creates a state-value and a state-updater function. Set default as false.
const [ checked, setChecked ] = useState(false)
<Edit {...props}>
<SimpleForm>
<TextInput source="username" validate={required()}/>
<TextInput source="email" validate={[required(),email()]}/>
<TextInput source="phoneNumber" validate={[required(),minLength(10),number()]}/>
<RadioButtonGroupInput
label="Type"
source="userType"
choices={types}
optionText="id"
validate={required()}
onChange={() => setChecked(!checked)}/> //set to opposite of current-value
//if not checked, then we will display ReferenceInput, if checked, hide.
{ !checked && (
<ReferenceInput label="Company" source="company.id" reference="companies">
<SelectInput optionText="name" />
</ReferenceInput>
)}
</SimpleForm>
</Edit>
);
输出:
import org.apache.spark.Partitioner
object DemoPartitioner extends Partitioner {
override def numPartitions: Int = 3
override def getPartition(key: Any): Int = key match {
case num: Int => num
}
}
sc.parallelize(Seq((0, "a"), (0, "a"), (0, "a"), (1, "b"), (2, "c"), (2, "c")))
.partitionBy(DemoPartitioner) // create 3 partitions of sizes 3,1,2
.mapPartitions { it =>
val firstElements = it.take(2).toSeq
if (firstElements.size < 2) {
Iterator.empty
} else {
firstElements.iterator ++ it
}
}.foreach(println)
因此只跳过了一个元素的分区1