我有一个复制管道设置,使用数据工厂将一些文件从S3存储桶中的每日文件夹复制到Azure中的数据湖中 . 我遇到了这个非常奇怪的问题 .
假设S3存储桶中有三个文件 . 一个是30MB,另一个是50MB,最后一个是70MB . 如果我将30M文件“放在首位”(将其命名为test0.tsv),则声称它成功将所有三个文件复制到ADLS . 但是,第二个和第三个文件被截断为30M . 每个文件的数据都是正确的,但它会被截断 . 如果我先放入70M文件,那么它们都会被正确复制 . 因此,它使用第一个文件长度作为最大文件大小并截断所有后续较长文件 . 这对我来说也非常令人担忧,因为Azure数据工厂声称它已成功复制它们 .
这是我的管道:
{
"name": "[redacted]Pipeline",
"properties": {
"description": "[redacted]",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "AzureDataLakeStoreSink",
"copyBehavior": "PreserveHierarchy",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"policy": {
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "[redacted]"
}
],
"start": "2018-07-06T04:00:00Z",
"end": "2018-07-30T04:00:00Z",
"isPaused": false,
"hubName": "[redacted]",
"pipelineMode": "Scheduled"
}
}
这是我的输入数据集:
{
"name": "InputDataset",
"properties": {
"published": false,
"type": "AmazonS3",
"linkedServiceName": "[redacted]",
"typeProperties": {
"bucketName": "[redacted",
"key": "$$Text.Format('{0:yyyy}/{0:MM}/{0:dd}/', SliceStart)"
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
这是我的输出数据集:
{
"name": "OutputDataset",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "[redacted]",
"typeProperties": {
"folderPath": "[redacted]/{Year}/{Month}/{Day}",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
}
]
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
我已经删除了输入和输出数据集中的格式字段,因为我认为可能是二进制副本会修复它,但这不起作用 .