首页 文章

日期之间的SQL / Oracle聚合桶

提问于
浏览
3

我有一个SQL相关的问题,我会喜欢一些帮助,因为一个合适的答案一直困扰我 .

Background

我正在使用一个供应商产品,该产品具有作为后端的Oracle数据库 . 我有能力编写任何特殊的SQL来查询底层表,但我不能对它们的底层结构(或数据模型本身)进行任何更改 . 我感兴趣的表目前有大约1M行,基本上跟踪用户会话 . 它有4列感兴趣: session_id (每个会话是主键和唯一), user_namestart_date (跟踪会话开始的日期)和 stop_date (跟踪会话结束的日期) . 我的目标是根据月,日和小时执行活动会话的数据聚合,给出设置的开始日期和结束日期 . 我需要创建一个视图(或3个单独的视图),它可以自己执行聚合,也可以作为中间对象,然后我可以从中查询和执行聚合 . 我理解最终的SQL /视图实际上可能需要3个不同的视图(一个用于一个月,一个用于一天,一个用于一小时),但在我看来,无论时间段如何,概念(一旦实现)应该是相同的 .

Current table example

表名= web_session

| Session_id | user_name | start_date            | stop_date    
----------------------------------------------------------------------------
|      1     |    joe    | 4/20/2017 10:42:10 PM | 4/21/2017 2:42:10 AM  |
|      2     |   matt    | 4/20/2017 5:43:10 PM  | 4/20/2017 5:59:10 PM  |
|      3     |   matt    | 4/20/2017 3:42:10 PM  | 4/20/2017 5:42:10 PM  |
|      4     |    joe    | 4/20/2017 11:20:10 AM | 4/20/2017 4:42:10 PM  |
|      5     |   john    | 4/20/2017 8:42:10 AM  | 4/20/2017 11:42:10 AM |
|      6     |   matt    | 4/20/2017 7:42:10 AM  | 4/20/2017 11:42:10 PM | 
|      7     |    joe    | 4/19/2017 11:20:10 PM | 4/20/2017 1:42:10 AM  |

Ideal Output For Hour View

例如,-12:00可以是0或24

| Date           | HR   | active_sessions | distinct_users |
------------------------------------------------------------
| 4/21/2017      | 2    | 1               | 1              |
| 4/21/2017      | 1    | 1               | 1              |
| 4/20/2017      | 0    | 1               | 1              |
| 4/20/2017      | 23   | 1               | 1              |
| 4/20/2017      | 22   | 1               | 1              |
| 4/20/2017      | 17   | 2               | 1              |
| 4/20/2017      | 16   | 2               | 2              |
| 4/20/2017      | 15   | 2               | 2              |
| 4/20/2017      | 14   | 1               | 1              |
| 4/20/2017      | 13   | 1               | 1              |
| 4/20/2017      | 12   | 1               | 1              |
| 4/20/2017      | 11   | 3               | 3              |
| 4/20/2017      | 10   | 2               | 2              |
| 4/20/2017      | 9    | 2               | 2              |
| 4/20/2017      | 8    | 2               | 2              |
| 4/20/2017      | 7    | 1               | 1              |
| 4/20/2017      | 1    | 1               | 1              |
| 4/20/2017      | 0    | 1               | 1              |
| 4/19/2017      | 23   | 1               | 1              |

End Goal and Other Options

我最终尝试使用此输出实现的是填充折线图,该折线图显示两个日期之间的月,日或小时(在示例输出中使用)的活动会话数 . 在小时示例中,将沿着X轴使用与HR组合的日期,并且沿Y轴使用活动会话 . 如果用户悬停在图表上的点上,则可以使用不同的用户计数 . FYI活动会话是在间隔期间的任何时间点打开的会话总数 . 不同用户是间隔期间不同用户的总数 . 如果我在同一小时内登录和关闭两次,那么它将是2个活动会话,但只有1个不同的用户 .

Alternative Solutions

这似乎是一个问题,必须提前几次,但从我的所有谷歌搜索和堆栈溢出研究,我似乎无法找到正确的方法 . 如果我正在考虑查询或理想输出不正确,我会打开替换建议,这使我可以获得所需的输出,以在前端适当地填充图表 .

Some SQL I Have Tried (Good Faith Effort)

我尝试了很多查询,但我会从这个开始,因为它是我得到的最接近但是非常慢(不可用),它仍然不会产生我需要的结果 .

Select * FROM (

  SELECT 
  u.YearDt, u.MonthDt, u.DayDt, u.HourDt, u.MinDt,
  COUNT(Distinct u.session_id) as unique_sessions,
  COUNT(Distinct u.user_name) as unique_users,
  LISTAGG(u.user_name, ', ') WITHIN GROUP (ORDER BY u.user_name ASC) as users
  FROM
  (SELECT  EXTRACT(year FROM l.start_date) as YearDt,
              EXTRACT(month FROM l.start_date) as MonthDt,
              EXTRACT(day FROM l.start_date) as DayDt,
              EXTRACT(HOUR FROM CAST(l.start_date AS TIMESTAMP)) as HourDt,
              EXTRACT(MINUTE FROM CAST(l.start_date AS TIMESTAMP)) as MinDt,
              l.session_id,
              l.user_name,
              l.start_date as act_date,
              1 as is_start
  FROM web_session l
  UNION ALL
  SELECT  EXTRACT(year FROM l.stop_date) as YearDt,
              EXTRACT(month FROM l.stop_date) as MonthDt,
              EXTRACT(day FROM l.stop_date) as DayDt,
              EXTRACT(HOUR FROM CAST(l.stop_date AS TIMESTAMP)) as HourDt,
              EXTRACT(MINUTE FROM CAST(l.stop_date AS TIMESTAMP)) as MinDt,
              l.session_id,
              l.user_name,
              l.stop_date as act_date,
              0 as is_start
  FROM web_session l
  ) u
  GROUP BY CUBE ( u.YearDt, u.MonthDt, u.DayDt, u.HourDt, u.MinDt)
) c

3 回答

  • 1

    您可以使用CTE(查询1)或相关的分层查询(查询2)来生成时间范围内的小时数,然后进行聚合 . 这只需要一次表扫描:

    SQL Fiddle

    Oracle 11g R2 Schema Setup

    CREATE TABLE Web_Session ( Session_id, user_name, start_date, stop_date ) AS
    SELECT 1, 'joe',  CAST( TIMESTAMP '2017-04-20 22:42:10' AS DATE ), CAST( TIMESTAMP '2017-04-21 02:42:10' AS DATE ) FROM DUAL UNION ALL
    SELECT 2, 'matt', TIMESTAMP '2017-04-20 17:43:10', TIMESTAMP '2017-04-20 17:59:10' FROM DUAL UNION ALL
    SELECT 3, 'matt', TIMESTAMP '2017-04-20 15:42:10', TIMESTAMP '2017-04-20 17:42:10' FROM DUAL UNION ALL
    SELECT 4, 'joe',  TIMESTAMP '2017-04-20 11:20:10', TIMESTAMP '2017-04-20 16:42:10' FROM DUAL UNION ALL
    SELECT 5, 'john', TIMESTAMP '2017-04-20 08:42:10', TIMESTAMP '2017-04-20 11:42:10' FROM DUAL UNION ALL
    SELECT 6, 'matt', TIMESTAMP '2017-04-20 07:42:10', TIMESTAMP '2017-04-20 23:42:10' FROM DUAL UNION ALL
    SELECT 7, 'joe',  TIMESTAMP '2017-04-19 23:20:10', TIMESTAMP '2017-04-20 01:42:10' FROM DUAL;
    

    Query 1

    WITH hours ( session_id, user_name, hour, duration ) AS (
      SELECT session_id,
             user_name,
             CAST( TRUNC( start_date, 'HH24' ) AS DATE ),
             ( TRUNC( stop_date, 'HH24' ) - TRUNC( start_date, 'HH24' ) ) * 24
      FROM   web_session
    UNION ALL
      SELECT session_id,
             user_name,
             hour + INTERVAL '1' HOUR, -- There is a bug in SQLFiddle that subtracts
                                       -- hours instead of adding so -1 is used there.
             duration - 1
      FROM   hours
      WHERE  duration > 0
    )
    SELECT hour,
           COUNT( session_id ) AS active_sessions,
           COUNT( DISTINCT user_name ) AS distinct_users
    FROM   hours
    GROUP BY hour
    ORDER BY hour
    

    Results

    |                 HOUR | ACTIVE_SESSIONS | DISTINCT_USERS |
    |----------------------|-----------------|----------------|
    | 2017-04-19T23:00:00Z |               1 |              1 |
    | 2017-04-20T00:00:00Z |               1 |              1 |
    | 2017-04-20T01:00:00Z |               1 |              1 |
    | 2017-04-20T07:00:00Z |               1 |              1 |
    | 2017-04-20T08:00:00Z |               2 |              2 |
    | 2017-04-20T09:00:00Z |               2 |              2 |
    | 2017-04-20T10:00:00Z |               2 |              2 |
    | 2017-04-20T11:00:00Z |               3 |              3 |
    | 2017-04-20T12:00:00Z |               2 |              2 |
    | 2017-04-20T13:00:00Z |               2 |              2 |
    | 2017-04-20T14:00:00Z |               2 |              2 |
    | 2017-04-20T15:00:00Z |               3 |              2 |
    | 2017-04-20T16:00:00Z |               3 |              2 |
    | 2017-04-20T17:00:00Z |               3 |              1 |
    | 2017-04-20T18:00:00Z |               1 |              1 |
    | 2017-04-20T19:00:00Z |               1 |              1 |
    | 2017-04-20T20:00:00Z |               1 |              1 |
    | 2017-04-20T21:00:00Z |               1 |              1 |
    | 2017-04-20T22:00:00Z |               2 |              2 |
    | 2017-04-20T23:00:00Z |               2 |              2 |
    | 2017-04-21T00:00:00Z |               1 |              1 |
    | 2017-04-21T01:00:00Z |               1 |              1 |
    | 2017-04-21T02:00:00Z |               1 |              1 |
    

    Execution Plan

    -------------------------------------------------------------------------------------------------------
    | Id  | Operation                                      | Name        | Rows | Bytes | Cost | Time     |
    -------------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT                               |             |   14 |   364 |    7 | 00:00:01 |
    |   1 |   SORT GROUP BY                                |             |   14 |   364 |    7 | 00:00:01 |
    |   2 |    VIEW                                        | VW_DAG_0    |   14 |   364 |    7 | 00:00:01 |
    |   3 |     HASH GROUP BY                              |             |   14 |   364 |    7 | 00:00:01 |
    |   4 |      VIEW                                      |             |   14 |   364 |    6 | 00:00:01 |
    |   5 |       UNION ALL (RECURSIVE WITH) BREADTH FIRST |             |      |       |      |          |
    |   6 |        TABLE ACCESS FULL                       | WEB_SESSION |    7 |   245 |    3 | 00:00:01 |
    | * 7 |        RECURSIVE WITH PUMP                     |             |      |       |      |          |
    -------------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ------------------------------------------
    * 7 - filter("DURATION">0)
    
    Note
    -----
    - dynamic sampling used for this statement
    

    Query 2

    SELECT t.COLUMN_VALUE AS hour,
           COUNT( session_id ) AS active_sessions,
           COUNT( DISTINCT user_name ) AS distinct_users
    FROM   web_session w
           CROSS JOIN
           TABLE(
             CAST(
               MULTISET(
                 SELECT TRUNC( w.start_date, 'HH24' ) + ( LEVEL - 1 ) / 24
                 FROM   DUAL
                 CONNECT BY TRUNC( w.start_date, 'HH24' ) + ( LEVEL - 1 ) / 24 < w.stop_date
               ) AS SYS.ODCIDATELIST
             )
           ) t
    GROUP BY t.COLUMN_VALUE
    ORDER BY hour
    

    Results

    |                 HOUR | ACTIVE_SESSIONS | DISTINCT_USERS |
    |----------------------|-----------------|----------------|
    | 2017-04-19T23:00:00Z |               1 |              1 |
    | 2017-04-20T00:00:00Z |               1 |              1 |
    | 2017-04-20T01:00:00Z |               1 |              1 |
    | 2017-04-20T07:00:00Z |               1 |              1 |
    | 2017-04-20T08:00:00Z |               2 |              2 |
    | 2017-04-20T09:00:00Z |               2 |              2 |
    | 2017-04-20T10:00:00Z |               2 |              2 |
    | 2017-04-20T11:00:00Z |               3 |              3 |
    | 2017-04-20T12:00:00Z |               2 |              2 |
    | 2017-04-20T13:00:00Z |               2 |              2 |
    | 2017-04-20T14:00:00Z |               2 |              2 |
    | 2017-04-20T15:00:00Z |               3 |              2 |
    | 2017-04-20T16:00:00Z |               3 |              2 |
    | 2017-04-20T17:00:00Z |               3 |              1 |
    | 2017-04-20T18:00:00Z |               1 |              1 |
    | 2017-04-20T19:00:00Z |               1 |              1 |
    | 2017-04-20T20:00:00Z |               1 |              1 |
    | 2017-04-20T21:00:00Z |               1 |              1 |
    | 2017-04-20T22:00:00Z |               2 |              2 |
    | 2017-04-20T23:00:00Z |               2 |              2 |
    | 2017-04-21T00:00:00Z |               1 |              1 |
    | 2017-04-21T01:00:00Z |               1 |              1 |
    | 2017-04-21T02:00:00Z |               1 |              1 |
    

    Execution Plan

    --------------------------------------------------------------------------------------------------
    | Id  | Operation                              | Name        | Rows  | Bytes   | Cost | Time     |
    --------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT                       |             | 57176 | 2115512 |  200 | 00:00:03 |
    |   1 |   SORT GROUP BY                        |             | 57176 | 2115512 |  200 | 00:00:03 |
    |   2 |    NESTED LOOPS                        |             | 57176 | 2115512 |  195 | 00:00:03 |
    |   3 |     TABLE ACCESS FULL                  | WEB_SESSION |     7 |     245 |    3 | 00:00:01 |
    |   4 |     COLLECTION ITERATOR SUBQUERY FETCH |             |  8168 |   16336 |   27 | 00:00:01 |
    | * 5 |      CONNECT BY WITHOUT FILTERING      |             |       |         |      |          |
    |   6 |       FAST DUAL                        |             |     1 |         |    2 | 00:00:01 |
    --------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ------------------------------------------
    * 5 - filter(TRUNC(:B1,'fmhh24')+(LEVEL-1)/24<:B2)
    
    Note
    -----
    - dynamic sampling used for this statement
    
  • 2

    马特,

    您需要做的是生成时间维,作为静态表或在运行时动态生成:

    create table time_dim (
      ts date primary key,
      year number not null,
      month number not null,
      day number not null,
      wday number not null,
      dy varchar2(3) not null,
      hr number not null
    );
    
    insert into time_dim (ts, year, month, day, wday, dy, hr)
    select ts
         , extract(year from ts) year
         , extract(month from ts) month
         , extract(day from ts) day
         , to_char(ts,'d') wday
         , to_char(ts,'dy') dy
         , to_number(to_char(ts,'HH24')) hr
      from (
    select DATE '2017-01-01' + (level - 1)/24 ts
      FROM DUAL connect by level <= 365*24) a;
    

    然后外部连接到您的 web_sessions 表:

    select t.ts, t.year, t.month, t.wday, t.dy, t.hr
         , count(session_id) sessions
         , count(distinct user_name) users
      from time_dim t
      left join web_session w
        on t.ts between trunc(w.start_date, 'hh24') and w.stop_date
     where trunc(t.ts) between date '2017-04-19' and date '2017-04-21'
     group by rollup (t.year, t.month, (t.wday, t.dy), (t.hr, t.ts));
    

    您可以更改group by子句以获取您感兴趣的各种聚合 .

    在上面的代码中,我将start_date截断为 ON 子句中的小时,以便开始时间将包含在结果中,否则在该小时内不会完全计算在一小时内没有开始的会话 .

  • 1

    我觉得这样的事情会起作用:

    WITH ct ( active_dt ) AS (
        -- Build the query for the "table" of hours
        SELECT DATE'2018-04-19' + (LEVEL-1)/24 AS active_dt FROM dual
       CONNECT BY DATE'2018-04-19' + (LEVEL-1)/24 < DATE'2018-04-22'
    )
    SELECT active_dt AS "Date", active_hr AS "HR"
         , COUNT(session_id) AS active_sessions
         , COUNT(DISTINCT user_name) AS distinct_users
      FROM (
        SELECT TRUNC(ct.active_dt) AS active_dt
             , TO_CHAR(ct.active_dt, 'HH24') AS active_hr
             , ws.session_id, ws.user_name
          FROM ct LEFT JOIN web_session ws
            ON ct.active_dt + 1/24 >= ws.start_dt
           AND ct.active_dt < ws.stop_dt
    ) GROUP BY active_dt, active_hr
     ORDER BY active_dt DESC, active_hr DESC;
    

    我可能没有 LEFT JOIN 100%正确的条件 .

    希望这可以帮助 .

相关问题