Question

我有一个复制管道，可以使用数据工厂将少量文件从S3存储桶中的日常文件夹复制到Azure中的数据湖中。我遇到了这个非常奇怪的问题。

假设S3存储桶中有三个文件。一个是30MB，另一个是50MB，最后一个是70MB。如果我将30M文件放在“第一个”（将其命名为test0.tsv），则它声称已成功将所有三个文件复制到ADLS。但是，第二个和第三个文件被截断为30M。每个文件的数据都是正确的，但是会被截断。如果我先放置70M文件，则它们都将被正确复制。因此，它使用第一个文件长度作为最大文件大小，并截断所有后续的较长文件。这对我来说也很令人担忧，因为Azure数据工厂声称它已成功复制了它们。

这是我的管道：

{
"name": "[redacted]Pipeline",
"properties": {
    "description": "[redacted]",
    "activities": [
        {
            "type": "Copy",
            "typeProperties": {
                "source": {
                    "type": "FileSystemSource",
                    "recursive": true
                },
                "sink": {
                    "type": "AzureDataLakeStoreSink",
                    "copyBehavior": "PreserveHierarchy",
                    "writeBatchSize": 0,
                    "writeBatchTimeout": "00:00:00"
                }
            },
            "inputs": [
                {
                    "name": "InputDataset"
                }
            ],
            "outputs": [
                {
                    "name": "OutputDataset"
                }
            ],
            "policy": {
                "retry": 3
            },
            "scheduler": {
                "frequency": "Day",
                "interval": 1
            },
            "name": "[redacted]"
        }
    ],
    "start": "2018-07-06T04:00:00Z",
    "end": "2018-07-30T04:00:00Z",
    "isPaused": false,
    "hubName": "[redacted]",
    "pipelineMode": "Scheduled"
}

}

这是我的输入数据集：

{
"name": "InputDataset",
"properties": {
    "published": false,
    "type": "AmazonS3",
    "linkedServiceName": "[redacted]",
    "typeProperties": {
        "bucketName": "[redacted",
        "key": "$$Text.Format('{0:yyyy}/{0:MM}/{0:dd}/', SliceStart)"
    },
    "availability": {
        "frequency": "Day",
        "interval": 1
    },
    "external": true,
    "policy": {}
}

}

这是我的输出数据集：

{
"name": "OutputDataset",
"properties": {
    "published": false,
    "type": "AzureDataLakeStore",
    "linkedServiceName": "[redacted]",
    "typeProperties": {
        "folderPath": "[redacted]/{Year}/{Month}/{Day}",
        "partitionedBy": [
            {
                "name": "Year",
                "value": {
                    "type": "DateTime",
                    "date": "SliceStart",
                    "format": "yyyy"
                }
            },
            {
                "name": "Month",
                "value": {
                    "type": "DateTime",
                    "date": "SliceStart",
                    "format": "MM"
                }
            },
            {
                "name": "Day",
                "value": {
                    "type": "DateTime",
                    "date": "SliceStart",
                    "format": "dd"
                }
            }
        ]
    },
    "availability": {
        "frequency": "Day",
        "interval": 1
    }
}

}

我已经删除了输入和输出数据集中的格式字段，因为我认为将其作为二进制副本可能会解决该问题，但这没用。

Azure数据工厂从S3存储桶到ADLS的复制管道中截断文件

0 个答案: