我有一个复制管道,可以使用数据工厂将少量文件从S3存储桶中的日常文件夹复制到Azure中的数据湖中。我遇到了这个非常奇怪的问题。
假设S3存储桶中有三个文件。一个是30MB,另一个是50MB,最后一个是70MB。如果我将30M文件放在“第一个”(将其命名为test0.tsv),则它声称已成功将所有三个文件复制到ADLS。但是,第二个和第三个文件被截断为30M。每个文件的数据都是正确的,但是会被截断。如果我先放置70M文件,则它们都将被正确复制。因此,它使用第一个文件长度作为最大文件大小,并截断所有后续的较长文件。这对我来说也很令人担忧,因为Azure数据工厂声称它已成功复制了它们。
这是我的管道:
{
"name": "[redacted]Pipeline",
"properties": {
"description": "[redacted]",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "AzureDataLakeStoreSink",
"copyBehavior": "PreserveHierarchy",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"policy": {
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "[redacted]"
}
],
"start": "2018-07-06T04:00:00Z",
"end": "2018-07-30T04:00:00Z",
"isPaused": false,
"hubName": "[redacted]",
"pipelineMode": "Scheduled"
}
}
这是我的输入数据集:
{
"name": "InputDataset",
"properties": {
"published": false,
"type": "AmazonS3",
"linkedServiceName": "[redacted]",
"typeProperties": {
"bucketName": "[redacted",
"key": "$$Text.Format('{0:yyyy}/{0:MM}/{0:dd}/', SliceStart)"
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
这是我的输出数据集:
{
"name": "OutputDataset",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "[redacted]",
"typeProperties": {
"folderPath": "[redacted]/{Year}/{Month}/{Day}",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
}
]
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
我已经删除了输入和输出数据集中的格式字段,因为我认为将其作为二进制副本可能会解决该问题,但这没用。