首页 文章

U-SQL根据文件中的不同值将CSV文件拆分为多个文件

提问于
浏览
1

我在Azure Data Lake Store中有数据,我使用U-SQL处理Azure Data Analytic Job中的数据 . 我有几个包含空间数据的CSV文件,类似于:

File_20170301.csv

longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    45.121    | 21.123    | 2017-03-01   | 01   | 20
    45.121    | 21.123    | 2017-03-01   | 02   | 10
    45.121    | 21.123    | 2017-03-01   | 03   | 50
    48.121    | 35.123    | 2017-03-01   | 01   | 60
    48.121    | 35.123    | 2017-03-01   | 02   | 15
    48.121    | 35.123    | 2017-03-01   | 03   | 80

File_20170302.csv

longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    45.121    | 21.123    | 2017-03-02   | 01   | 20
    45.121    | 21.123    | 2017-03-02   | 02   | 10
    45.121    | 21.123    | 2017-03-02   | 03   | 50
    48.121    | 35.123    | 2017-03-02   | 01   | 60
    48.121    | 35.123    | 2017-03-02   | 02   | 15
    48.121    | 35.123    | 2017-03-02   | 03   | 80

每个文件包含不同日期和所有经度 - 纬度组合的数据 .

我想合并我拥有的所有文件并拆分数据,这样我就可以得到每个longtitude-lattitude组合的一个文件 .

因此,在遍历我的文件夹中的所有文件并追加所有日期的所有数据之后,我最终会得到以下结果:

File_45_21.csv

longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    45.121    | 21.123    | 2017-03-01   | 01   | 20
    45.121    | 21.123    | 2017-03-01   | 02   | 10
    45.121    | 21.123    | 2017-03-01   | 03   | 50
    45.121    | 21.123    | 2017-03-02   | 01   | 20
    45.121    | 21.123    | 2017-03-02   | 02   | 10
    45.121    | 21.123    | 2017-03-02   | 03   | 50

File_48_35.csv

longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    48.121    | 35.123    | 2017-03-01   | 01   | 60
    48.121    | 35.123    | 2017-03-01   | 02   | 15
    48.121    | 35.123    | 2017-03-01   | 03   | 80
    48.121    | 35.123    | 2017-03-02   | 01   | 60
    48.121    | 35.123    | 2017-03-02   | 02   | 15
    48.121    | 35.123    | 2017-03-02   | 03   | 80

理论上应该发生以下情况:

  • 为数据中的longtitude和lattitude组合查找不同的值

  • 获取上述不同值数组并为每个组合创建一个文件,并根据两个参数(longtitude和lattitude)从源文件中提取数据

我正在努力的是如何开始循环和基于源中的两个参数提取数据,以及如何通过参数组合的不同值“分区”数据源 .

2 回答

  • 1

    我建议查看文件集的概念来操作许多文件(参见https://msdn.microsoft.com/en-us/library/azure/mt771650.aspx)以及一些基于值进行动态输出的建议,直到该功能可用为止(参见How do I partition a large file into files/directories using only U-SQL and certain fields in the file?作为示例) .

  • 0

    U-SQL不直接支持动态U-SQL,但可以使用“脚本脚本”技术来创建输出 . 然后,您可以手动运行此输出,或使用Powershell或Azure Data Factory等操作来运行它 .

    我基于您的测试数据创建了一个简单的示例,部分基于here中的示例 .

    // Get the initial fileset
    @input =
        EXTRACT longtitude float,
                lattitude float,
                date string,
                hour int,
                value1 int,
                filename string
        FROM "/input/File_201703{filename}"
        USING Extractors.Csv();
    
    
    // Add int version of the long and lat columns for grouping on
    @working =
        SELECT *,
               (int) longtitude AS int_long,
               (int) lattitude AS int_lat
        FROM @input;
    
    
    // Work out the filenames
    @filenames =
        SELECT String.Format("File {0}_{1}.csv", int_long, int_lat) AS outputFilename,
               int_long,
               int_lat
        FROM
        (
            SELECT int_long,
                   int_lat
            FROM @working
            GROUP BY int_long,
                     int_lat
        ) AS x;
    
    
    // Construct the dynamic usql and output it
    @output =
        SELECT x.xsort, "@input = EXTRACT longtitude float, lattitude float, date string, hour int, value1 int, filename string FROM \"input/File_201703{filename}\" USING Extractors.Csv();" AS usql
       FROM ( VALUES ( 10 ) ) AS x(xsort)
    
        UNION ALL
    
        SELECT x.xsort, "@working = SELECT *, (int) longtitude AS int_long, (int) lattitude AS int_lat FROM @input;" AS usql
        FROM ( VALUES ( 20 ) ) AS x(xsort)
    
        UNION ALL
    
        SELECT 30 AS xsort, String.Format("OUTPUT (SELECT * FROM @working WHERE int_long == {0} AND int_lat == {1}) TO \"/output/{2}\" USING Outputters.Csv();", int_long, int_lat, outputFilename) AS usql
        FROM @filenames;
    
    
    // Select only the usql column and sort the output
    @output =
        SELECT usql
        FROM @output
    ORDER BY xsort
    FETCH 100;
    
    
    OUTPUT @output
    TO "/output/dynamic.usql"
    USING Outputters.Text(delimiter : ' ', quoting : false);
    

相关问题