首页 文章

在其间创建U-SQL JOIN以创建月度结果

提问于
浏览
1

编辑:我一直在研究所需的结果 . 让我更好地解释一下:我试图达到的目的是为每个SomeData创建一个行,每个SomeData在它的开始和结束日期之间 .

例如SomeData“88888888888888888881”,startDate“2005-12-06 00:00:00.000”和EndDate 2006-03-13 00:00:000“ . 我希望行像:

88888888888888888881, 200512
88888888888888888881, 200601
88888888888888888881, 200602
88888888888888888881, 200603

我知道这可能会将结果“爆炸”成一个巨大的文件 .

在我的帖子下面:

我试图在U-SQL中重写一些我们之前在T-SQL中所做的事情 .

问题是U-SQL不允许在连接期间发生间隔 .

T-SQL JOIN看起来像这样:

SELECT rf.SomeData AS SomeData,
       rd.YearMonth AS YearMonth,
      (rf.SomeData + '-' + rd.YearMonth.ToString()) AS MonthlyKey,
       rf.SomeKey AS SomeKey
FROM MyTable rf
    INNER JOIN dbo.DimDate rd
    ON rd.Date >= rf.StartDate
    AND rd.Date <= (CASE WHEN rf.EndDate IS NULL THEN GETDATE() ELSE rf.EndDate END)

在U-SQL中,我就是这样开始的,但是我现在应该怎么写JOIN?

@EditedTable =
    SELECT rf.SomeData AS SomeData,
           rd.YearMonth AS YearMonth,
           (rf.SomeData + "-" + rd.YearMonth.ToString()) AS MonthlyKey,
           rf.SomeKey AS SomeKey
    FROM @MyTable AS rf
         INNER JOIN
             @date AS rd
         ON

重要的是我们在开始日期和结束日期之间获取所有数据,并创建每月密钥,以便稍后可以将“SomeData”与另一个表连接起来 .

我尝试过使用交叉连接,但在运行时,它会停留在80%并且似乎永远不会结束 . 它一直在一个顶点写GB . 此外,我不确定这会产生相同的结果 .

@EditedTableCROSS =
    SELECT rfj.SomeData AS SomeData,
           rfj.StartDate AS StartDate,
           rfj.EndDate AS EndDate,
           (rfj.SomeData + "-" + dtj.YearMonth.ToString()) AS MonthlyKey, 
           rfj.SomeKey AS SomeKey

        FROM
    (
        SELECT SomeData AS SomeData,
               StartDate AS StartDate,
               EndDate AS EndDate,
               SomeKey AS SomeKey
        FROM @TableA
        WHERE SomeData != ""
    ) AS rfj

    CROSS JOIN

    (
    SELECT DISTINCT
           dt.Date AS Date,
           dt.YearMonth AS YearMonth,
           dt.Month AS Month,
           rf.StartDate AS StartDate
    FROM @date AS dt INNER JOIN @TableA AS rf ON rf.StartDate == dt.Date
    WHERE rf.StartDate >= dt.Date AND
          dt.Date <= DateTime.Now
          ) AS dtj

    WHERE rfj.StartDate <= dtj.Date AND
          rfj.EndDate >= dtj.Date;

上面代码的问题是“INNER JOIN @TableA AS rf ON rf.StartDate == dt.Date”不在唯一键上,某些日期不止一次 . 所以我怀疑这是方式....

请分享您的想法?

编辑:人们要求提供样本数据,结束日期可能包含:

2006-03-13 10:27:13.000
2016-03-02 18:48:11.000
2016-03-02 18:42:57.000
NULL
2013-09-12 09:19:05.000
NULL
2016-03-02 18:59:37.000
NULL
NULL

开始日期:

2005-12-06 00:00:00.000
2011-03-29 20:57:51.000
2007-11-01 00:00:00.000
2007-11-01 00:00:00.000
2007-11-01 00:00:00.000
2011-02-28 00:00:00.000
2011-02-28 00:00:00.000
2011-02-28 00:00:00.000
2008-01-17 00:00:00.000

DimDate包含日期从2000年到2018年的日期 .

SomeDate和SomeKey看起来像:

88888888888888888881
88888888888888888882
88888888888888888883
88888888888888888884
88888888888888888885
88888888888888888886
88888888888888888887
88888888888888888888
88888888888888888889

2 回答

  • 4

    我得到了这个脚本来处理我生成的一些示例数据 .

    @dateDim =
        EXTRACT xdate DateTime,
                yearMonth string
        FROM "/input/dbo.DimDate.tsv"
        USING Extractors.Tsv();
    
    @data =
        EXTRACT 
                someKey int,
                someData string,
                startDate DateTime,
                endDate DateTime?
        FROM "/input/dbo.MyTable.tsv"
        USING Extractors.Tsv();
    
    /*
    // Use U-SQL ISNULL conditional operator which is ?
    @working =
        SELECT COUNT( * ) AS records
        FROM
        (
            SELECT *
            FROM @dateDim AS dd
                 CROSS JOIN
                     @data AS d
            WHERE dd.xdate BETWEEN d.startDate AND (d.endDate == (DateTime?)null ? DateTime.Now : d.endDate)
        ) AS x;
    */
    
    @working =
        SELECT COUNT( * ) AS records
        FROM
        (
            SELECT *
            FROM @dateDim AS dd
                 CROSS JOIN
                     @data AS d
            WHERE dd.xdate >= d.startDate
              AND dd.xdate <= (d.endDate == (DateTime?)null ? DateTime.Now : d.endDate)
    ) AS x;
    
  • 3

    U-SQL在谓词中不支持BETWEEN的原因是没有适用于非等值连接的横向扩展连接算法 . 即使我们在语法上允许它,它仍然会在计划中进入CROSS JOIN .

    你想要做的是获得一个可以分区的连接 . 一种方法是,如果您可以在分区键上具有相等连接,然后在该分区中具有交叉连接 .

    在你的情况下,我认为你真的不需要加入 . 我想你想要做的是在开始和结束日期之间每天生成一行 .

    我会用没有比例限制的CROSS APPLY EXPLODE做到这一点 . 这是一个例子:

    @MyTable = 
      SELECT * 
      FROM (VALUES
            (81,81,(DateTime?) DateTime.Parse("2005-12-06 00:00:00.000"),(DateTime?) DateTime.Parse("2006-03-13 10:27:13.000")),
            (82,82,(DateTime?) DateTime.Parse("2011-03-29 20:57:51.000"),(DateTime?) DateTime.Parse("2016-03-02 18:48:11.000")),
            (83,83,(DateTime?) DateTime.Parse("2007-11-01 00:00:00.000"),(DateTime?) DateTime.Parse("2016-03-02 18:42:57.000")),
            (84,84,(DateTime?) DateTime.Parse("2007-11-01 00:00:00.000"),(DateTime?) null),
            (85,85,(DateTime?) DateTime.Parse("2007-11-01 00:00:00.000"),(DateTime?) DateTime.Parse("2013-09-12 09:19:05.000")),
            (86,86,(DateTime?) DateTime.Parse("2011-02-28 00:00:00.000"),(DateTime?) null),
            (87,87,(DateTime?) DateTime.Parse("2011-02-28 00:00:00.000"),(DateTime?) DateTime.Parse("2016-03-02 18:59:37.000")),
            (88,88,(DateTime?) DateTime.Parse("2011-02-28 00:00:00.000"),(DateTime?) null),
            (89,89,(DateTime?) DateTime.Parse("2008-01-17 00:00:00.000"),(DateTime?) null)
        ) AS T(SomeKey, SomeData, StartDate, EndDate);
    
    @res = 
      SELECT SomeKey, SomeData, StartDate, EndDate, DailyDate 
      FROM @MyTable 
           CROSS APPLY EXPLODE 
             (Enumerable.Range(0, 
                   1 + (EndDate == (DateTime?) null ? DateTime.Now 
                                                    : EndDate.Value).Subtract(StartDate.Value).Days)
               .Select(offset => StartDate.Value.AddDays(offset))
              ) AS T(DailyDate);
    
    OUTPUT @res
    TO "/output/test.csv"
    USING Outputters.Csv(outputHeader : true);
    

    这是一个典型的例子,在这里问问题基于场景而不是要求翻译会更容易回答:) .

相关问题