我有一个复制管道设置,使用数据工厂将一些文件从S3存储桶中的每日文件夹复制到Azure中的数据湖中 . 我遇到了这个非常奇怪的问题 .

假设S3存储桶中有三个文件 . 一个是30MB,另一个是50MB,最后一个是70MB . 如果我将30M文件“放在首位”(将其命名为test0.tsv),则声称它成功将所有三个文件复制到ADLS . 但是,第二个和第三个文件被截断为30M . 每个文件的数据都是正确的,但它会被截断 . 如果我先放入70M文件,那么它们都会被正确复制 . 因此,它使用第一个文件长度作为最大文件大小并截断所有后续较长文件 . 这对我来说也非常令人担忧,因为Azure数据工厂声称它已成功复制它们 .

这是我的管道:

{
"name": "[redacted]Pipeline",
"properties": {
    "description": "[redacted]",
    "activities": [
        {
            "type": "Copy",
            "typeProperties": {
                "source": {
                    "type": "FileSystemSource",
                    "recursive": true
                },
                "sink": {
                    "type": "AzureDataLakeStoreSink",
                    "copyBehavior": "PreserveHierarchy",
                    "writeBatchSize": 0,
                    "writeBatchTimeout": "00:00:00"
                }
            },
            "inputs": [
                {
                    "name": "InputDataset"
                }
            ],
            "outputs": [
                {
                    "name": "OutputDataset"
                }
            ],
            "policy": {
                "retry": 3
            },
            "scheduler": {
                "frequency": "Day",
                "interval": 1
            },
            "name": "[redacted]"
        }
    ],
    "start": "2018-07-06T04:00:00Z",
    "end": "2018-07-30T04:00:00Z",
    "isPaused": false,
    "hubName": "[redacted]",
    "pipelineMode": "Scheduled"
}

}

这是我的输入数据集:

{
"name": "InputDataset",
"properties": {
    "published": false,
    "type": "AmazonS3",
    "linkedServiceName": "[redacted]",
    "typeProperties": {
        "bucketName": "[redacted",
        "key": "$$Text.Format('{0:yyyy}/{0:MM}/{0:dd}/', SliceStart)"
    },
    "availability": {
        "frequency": "Day",
        "interval": 1
    },
    "external": true,
    "policy": {}
}

}

这是我的输出数据集:

{
"name": "OutputDataset",
"properties": {
    "published": false,
    "type": "AzureDataLakeStore",
    "linkedServiceName": "[redacted]",
    "typeProperties": {
        "folderPath": "[redacted]/{Year}/{Month}/{Day}",
        "partitionedBy": [
            {
                "name": "Year",
                "value": {
                    "type": "DateTime",
                    "date": "SliceStart",
                    "format": "yyyy"
                }
            },
            {
                "name": "Month",
                "value": {
                    "type": "DateTime",
                    "date": "SliceStart",
                    "format": "MM"
                }
            },
            {
                "name": "Day",
                "value": {
                    "type": "DateTime",
                    "date": "SliceStart",
                    "format": "dd"
                }
            }
        ]
    },
    "availability": {
        "frequency": "Day",
        "interval": 1
    }
}

}

我已经删除了输入和输出数据集中的格式字段,因为我认为可能是二进制副本会修复它,但这不起作用 .