Kafka Connect S3接收器连接器具有自定义分区程序的奇怪行为

时间:2020-07-09 10:34:57

标签: java amazon-s3 apache-kafka apache-kafka-connect


我计划使用自定义的基于Field和TimeBased的分区程序对s3中的数据进行分区,如下所示: / part_ <字段名称> = <字段值> / part_date = YYYY-MM-dd / part_hour = HH / .... parquet。

我的分区程序工作正常,一切都在我的S3存储桶中。

The problem is linked to the performance of the sink
我的输入主题中有400kB / s / broker =〜1.2MB / s,接收器可以处理尖峰并提交少量记录。

如果我使用经典的TimeBasedPartitioner,enter image description here

所以我的问题似乎出在我的自定义分区程序中。这是代码:

package test;
import ...;

public final class FieldAndTimeBasedPartitioner<T> extends TimeBasedPartitioner<T> {

private static final Logger log = LoggerFactory.getLogger(FieldAndTimeBasedPartitioner.class);
private static final String FIELD_SUFFIX = "part_";
private static final String FIELD_SEP = "=";
private long partitionDurationMs;
private DateTimeFormatter formatter;
private TimestampExtractor timestampExtractor;
private PartitionFieldExtractor partitionFieldExtractor;

protected void init(long partitionDurationMs, String pathFormat, Locale locale, DateTimeZone timeZone, Map<String, Object> config) {

    this.delim = (String)config.get("directory.delim");
    this.partitionDurationMs = partitionDurationMs;

    try {
        this.formatter = getDateTimeFormatter(pathFormat, timeZone).withLocale(locale);
        this.timestampExtractor = this.newTimestampExtractor((String)config.get("timestamp.extractor"));
        this.timestampExtractor.configure(config);
        this.partitionFieldExtractor = new PartitionFieldExtractor((String)config.get("partition.field"));
    } catch (IllegalArgumentException e) {
        ConfigException ce = new ConfigException("path.format", pathFormat, e.getMessage());
        ce.initCause(e);
        throw ce;
    }
}

private static DateTimeFormatter getDateTimeFormatter(String str, DateTimeZone timeZone) {
    return DateTimeFormat.forPattern(str).withZone(timeZone);
}

public static long getPartition(long timeGranularityMs, long timestamp, DateTimeZone timeZone) {
    long adjustedTimestamp = timeZone.convertUTCToLocal(timestamp);
    long partitionedTime = adjustedTimestamp / timeGranularityMs * timeGranularityMs;
    return timeZone.convertLocalToUTC(partitionedTime, false);
}

public String encodePartition(SinkRecord sinkRecord, long nowInMillis) {
    final Long timestamp = this.timestampExtractor.extract(sinkRecord, nowInMillis);
    final String partitionField = this.partitionFieldExtractor.extract(sinkRecord);
    return this.encodedPartitionForFieldAndTime(sinkRecord, timestamp, partitionField);
}

public String encodePartition(SinkRecord sinkRecord) {
    final Long timestamp = this.timestampExtractor.extract(sinkRecord);
    final String partitionFieldValue = this.partitionFieldExtractor.extract(sinkRecord);
    return encodedPartitionForFieldAndTime(sinkRecord, timestamp, partitionFieldValue);
}

private String encodedPartitionForFieldAndTime(SinkRecord sinkRecord, Long timestamp, String partitionField) {

    if (timestamp == null) {
        String msg = "Unable to determine timestamp using timestamp.extractor " + this.timestampExtractor.getClass().getName() + " for record: " + sinkRecord;
        log.error(msg);
        throw new ConnectException(msg);
    } else if (partitionField == null) {
        String msg = "Unable to determine partition field using partition.field '" + partitionField  + "' for record: " + sinkRecord;
        log.error(msg);
        throw new ConnectException(msg);
    }  else {
        DateTime recordTime = new DateTime(getPartition(this.partitionDurationMs, timestamp.longValue(), this.formatter.getZone()));
        return this.FIELD_SUFFIX
                + config.get("partition.field")
                + this.FIELD_SEP
                + partitionField
                + this.delim
                + recordTime.toString(this.formatter);
    }
}

static class PartitionFieldExtractor {

    private final String fieldName;

    PartitionFieldExtractor(String fieldName) {
        this.fieldName = fieldName;
    }

    String extract(ConnectRecord<?> record) {
        Object value = record.value();
        if (value instanceof Struct) {
            Struct struct = (Struct)value;
            return (String) struct.get(fieldName);
        } else {
            FieldAndTimeBasedPartitioner.log.error("Value is not of Struct !");
            throw new PartitionException("Error encoding partition.");
        }
    }
}

public long getPartitionDurationMs() {
    return partitionDurationMs;
}

public TimestampExtractor getTimestampExtractor() {
    return timestampExtractor;
}
}

它或多或少是FieldPartitioner和TimeBasedPartitioner的合并。

关于为什么我在接收邮件时表现不好的任何线索? 使用记录中的字段进行分区时,反序列化和从消息中提取数据会导致此问题? 由于我有大约80个不同的字段值,这可能是内存问题,因为它将在堆中维护80倍以上的缓冲区?

感谢您的帮助。

1 个答案:

答案 0 :(得分:0)

仅供参考,问题在于分区程序本身。我的分区程序需要解码整个消息并获取信息。 由于我收到很多消息,因此处理所有这些事件都需要时间。