我有简单的文件 FD_GROUP.TXT 与内容:
~0100〜^〜乳制品和蛋制品〜
~0200〜^〜香料和草药〜
~0300~ ^〜婴儿食品〜
~0400~ ^〜脂肪和油〜
~0500~ ^〜家禽产品〜
我试图使用Azure Data Factory将这些文件(一些包含700,000行)批量导入SQL数据库 .
策略是首先用 ^ 分隔列,然后用空字符替换波浪号(〜),所以我丢失了波浪号(〜),然后插入发生 .
1. SQL solution:
DECLARE @CsvFilePath NVARCHAR(1000) = 'D:\CodePurehope\Dev\NutrientData\FD_GROUP.txt';
CREATE TABLE #TempTable
(
[FoodGroupCode] VARCHAR(666) NOT NULL,
[FoodGroupDescription] VARCHAR(60) NOT NULL
)
DECLARE @sql NVARCHAR(4000) = 'BULK INSERT #TempTable FROM ''' + @CsvFilePath + ''' WITH ( FIELDTERMINATOR =''^'', ROWTERMINATOR =''\n'' )';
EXEC(@sql);
UPDATE #TempTable
SET [FoodGroupCode] = REPLACE([FoodGroupCode], '~', ''),
[FoodGroupDescription] = REPLACE([FoodGroupDescription], '~', '')
GO
INSERT INTO [dbo].[FoodGroupDescriptions]
(
[FoodGroupCode],
[FoodGroupDescription]
)
SELECT
[FoodGroupCode],
[FoodGroupDescription]
FROM
#TempTable
GO
DROP TABLE #TempTable
2. SSIS ETL package solution:
使用 ^ 和派生列转换来分隔平面文件源,以替换上面照片中看到的不必要的波浪线(〜) .
How do you do it with Microsoft Azure Data Factory?
我已将 FD_GROUP.TXT 在Azure Storage Blob上上载为 input ,并在Azure SQL Server上为 output 准备好表格 .
我有:
- 2个链接服务:AzureStorage和AzureSQL .
- 2个数据集:Blob作为输入,SQL作为输出
- 1管道
FoodGroupDescriptionsAzureBlob 设置
{
"name": "FoodGroupDescriptionsAzureBlob",
"properties": {
"structure": [
{
"name": "FoodGroupCode",
"type": "Int32"
},
{
"name": "FoodGroupDescription",
"type": "String"
}
],
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "FD_GROUP.txt",
"folderPath": "nutrition-data/NutrientData/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "^"
}
},
"availability": {
"frequency": "Minute",
"interval": 15
}
}
}
FoodGroupDescriptionsSQLAzure 设置
{
"name": "FoodGroupDescriptionsSQLAzure",
"properties": {
"structure": [
{
"name": "FoodGroupCode",
"type": "Int32"
},
{
"name": "FoodGroupDescription",
"type": "String"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "FoodGroupDescriptions"
},
"availability": {
"frequency": "Minute",
"interval": 15
}
}
}
FoodGroupDescriptionsPipeline 设置
{
"name": "FoodGroupDescriptionsPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60.00:00:00"
}
},
"inputs": [
{
"name": "FoodGroupDescriptionsAzureBlob"
}
],
"outputs": [
{
"name": "FoodGroupDescriptionsSQLAzure"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst"
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "CopyFromBlobToSQL",
"description": "Bulk Import FoodGroupDescriptions"
}
],
"start": "2015-07-13T00:00:00Z",
"end": "2015-07-14T00:00:00Z",
"isPaused": false,
"hubName": "gymappdatafactory_hub",
"pipelineMode": "Scheduled"
}
}
这个东西在Azure数据工厂上不起作用我不知道如何在这个上下文中使用replace . 任何帮助赞赏 .
1 回答
我正在使用您的代码,我可以通过执行以下操作来使其工作:
在您的FoodGroupDescriptionsAzureBlob json定义中,您需要在属性节点中添加“external”:true . Blob输入文件是从外部源创建的,而不是从azure数据工厂管道创建的,通过将其设置为true,它让azure数据工厂知道此输入应该可以使用 .
同样在blob输入定义中添加:“quoteChar”:“〜”到“format”节点,因为看起来数据用“〜”包装,这将从数据中剥离那些你定义的INT将正确插入的方式进入你的sql表 .
全blo def:
}
由于您每隔15分钟设置一个间隔,并且管道开始和结束日期为一整天,因此您将在整个管道运行持续时间内每15分钟创建一个切片,因为您只需要在开始和结束时将其运行一次是这样的:
这将创建1个切片 .
希望这可以帮助 .