我正在使用 Azure数据工厂将数据从 REST API 复制到 Azure Data Lake Store 。以下是我活动的JSON
{
"name": "CopyDataFromGraphAPI",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "HttpSource",
"httpRequestTimeout": "00:30:40"
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"enableStaging": false,
"cloudDataMovementUnits": 0,
"translator": {
"type": "TabularTranslator",
"columnMappings": "id: id, name: name, email: email, administrator: administrator"
}
},
"inputs": [
{
"referenceName": "MembersHttpFile",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "MembersDataLakeSink",
"type": "DatasetReference"
}
]
}
REST API由我创建。首先用于测试目的,我只返回2500行,我的管道工作正常。它将REST API调用中的数据复制到Azure Data Lake Store。
测试后我更新了REST API,现在返回125000行。我在REST客户端测试了该API,并且工作正常。但是在 Azure数据工厂的复制活动中将数据复制到Azure Data Lake Store时出现以下错误。
{
"errorCode": "2200",
"message": "Failure happened on 'Sink' side. ErrorCode=UserErrorFailedToReadHttpFile,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed to read data from http source file.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Net.WebException,Message=The remote server returned an error: (500) Internal Server Error.,Source=System,'",
"failureType": "UserError",
"target": "CopyDataFromGraphAPI"
}
接收方是Azure Data Lake Store。我是否从REST调用复制到Azure Data Lake Store的内容大小有限制。
我还通过更新REST API调用(2500行)来重新测试管道,它工作正常,当我更新API调用并返回125000行时。我的管道开始给出相同的上述错误。
我在复制活动中的源数据集是
{
"name": "MembersHttpFile",
"properties": {
"linkedServiceName": {
"referenceName": "WM_GBS_LinikedService",
"type": "LinkedServiceReference"
},
"type": "HttpFile",
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "name",
"type": "String"
},
{
"name": "email",
"type": "String"
},
{
"name": "administrator",
"type": "Boolean"
}
],
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects",
"jsonPathDefinition": {
"id": "$.['id']",
"name": "$.['name']",
"email": "$.['email']",
"administrator": "$.['administrator']"
}
},
"relativeUrl": "api/workplace/members",
"requestMethod": "Get"
}
}
}
接收数据集
{
"name": "MembersDataLakeSink",
"properties": {
"linkedServiceName": {
"referenceName": "DataLakeLinkService",
"type": "LinkedServiceReference"
},
"type": "AzureDataLakeStoreFile",
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "name",
"type": "String"
},
{
"name": "email",
"type": "String"
},
{
"name": "administrator",
"type": "Boolean"
}
],
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects",
"jsonPathDefinition": {
"id": "$.['id']",
"name": "$.['name']",
"email": "$.['email']",
"administrator": "$.['administrator']"
}
},
"fileName": "WorkplaceMembers.json",
"folderPath": "rawSources"
}
}
}
答案 0 :(得分:0)
据我所知,文件大小没有限制。我有一个包含数百万行的10 gb csv,数据湖并不关心。
我可以看到的是,虽然错误显示“接收”端,但错误代码是UserErrorFailedToReadHttpFile,所以我认为如果您更改源上的httpRequestTimeout可能会解决问题,截至目前为“00:30: 40“也许因为它而中断了行传输。 2500分钟是2500行的很多时间,但也许125k不适合那里。
希望这有帮助!