首页 文章

数据帧子集内的计算[R]

提问于
浏览
2

面对子集计算的困难 . 我可以使用 avetapplyddply 获得客户(因子)平均购买的整体统计数据,但我无法为每位客户计算 visit by visit 统计数据 . 下面的一些简化数据说明了我的数据和理想的结果 .

当前数据帧:(请注意,访问#1是最近的访问)

customer  visit      date    purchase_amt
    sarah          2    2013-08-09      5
    sarah          3    2013-07-21      8
    sarah          4    2013-06-23      9
    sarah          5    2013-06-02      1
    sarah          1    2013-08-20      8
    henry          1    2013-07-04      4
    che            1    2013-08-27      2
    che            2    2013-07-27      1
    che            3    2013-07-05      8
    che            4    2013-06-14      3
    dt             3    2013-04-05      9
    dt             2    2013-06-07      1
    dt             1    2013-07-11      6

这些是我寻求的结果:

customer  visit    date purchase_amt    days since  amt_diff
sarah       2   2013-08-09  5               19        -3
sarah       3   2013-07-21  8               28        -1
sarah       4   2013-06-23  9               21         8
sarah       5   2013-06-02  1               NA        NA
sarah       1   2013-08-20  8               11         3
henry       1   2013-07-04  4               NA        NA
che         1   2013-08-27  2               31         1
che         2   2013-07-27  1               22        -7
che         3   2013-07-05  8               21         5
che         4   2013-06-14  3               NA        NA
dt          3   2013-04-05  9               NA        NA
dt          2   2013-06-07  1               63        -8
dt          1       2013-07-11    6         34         5

总而言之,我想找到一个客户的最近访问及其属性,然后找到下一个访问属性并计算两者的各种统计数据 . 没有更多先前访问时返回“NA” .

3 回答

  • 5

    此解决方案仅使用R的基数并保留输入的原始顺序:

    # Sort, calculate differences and unsort.
    # r is row indexes to use, order.by is ordering vector, col is vector to difference
    
    diffs <- function(r, order.by, col) {
        order.by <- order.by[r]
        col <- col[r]
        o <- order(order.by)
        replace(r, o, c(NA, diff(col[o])))
    }
    
    # fun specialized to arguments after first, i.e. subsequent arguments curried
    
    curry <- function (fun, ...) function(r) fun(r, ...)
    
    ix <- 1:nrow(DF)
    transform(DF, 
        days_since = ave(ix, customer, FUN = curry(diffs, date, date)),
        amt_diff = ave(ix, customer, FUN = curry(diffs, date, purchase_amt))
    )
    

    结果是:

    customer visit       date purchase_amt days_since amt_diff
    1     sarah     2 2013-08-09            5         19       -3
    2     sarah     3 2013-07-21            8         28       -1
    3     sarah     4 2013-06-23            9         21        8
    4     sarah     5 2013-06-02            1         NA       NA
    5     sarah     1 2013-08-20            8         11        3
    6     henry     1 2013-07-04            4         NA       NA
    7       che     1 2013-08-27            2         31        1
    8       che     2 2013-07-27            1         22       -7
    9       che     3 2013-07-05            8         21        5
    10      che     4 2013-06-14            3         NA       NA
    11       dt     3 2013-04-05            9         NA       NA
    12       dt     2 2013-06-07            1         63       -8
    13       dt     1 2013-07-11            6         34        5
    

    更新:对代码的微小改进 .

  • 7

    像这样的东西?假设您的数据被称为 df

    library(plyr)
    
    # convert dates to class 'Date'
    df$date <- as.Date(df$date)
    
    # order by customer and date
    df <- df[order(df$customer, df$date), ]
    # or since plyr is loaded anyway:
    df <- arrange(df, customer, date) 
    
    # per customer, calculate differences in date and purchase, between consecutive visits
    # pad differences with a leading NA
    df2 <- ddply(.data = df, .variables = .(customer), mutate,
          days_since = c(NA, diff(date)),
          amt_diff = c(NA, diff(purchase_amt)))
    
    df2
    # customer visit       date purchase_amt days_since amt_diff
    # 1       che     4 2013-06-14            3         NA       NA
    # 2       che     3 2013-07-05            8         21        5
    # 3       che     2 2013-07-27            1         22       -7
    # 4       che     1 2013-08-27            2         31        1
    # 5        dt     3 2013-04-05            9         NA       NA
    # 6        dt     2 2013-06-07            1         63       -8
    # 7        dt     1 2013-07-11            6         34        5
    # 8     henry     1 2013-07-04            4         NA       NA
    # 9     sarah     5 2013-06-02            1         NA       NA
    # 10    sarah     4 2013-06-23            9         21        8
    # 11    sarah     3 2013-07-21            8         28       -1
    # 12    sarah     2 2013-08-09            5         19       -3
    # 13    sarah     1 2013-08-20            8         11        3
    
  • 6

    这是与@Henrik一致的data.table解决方案:

    df<-structure(list(customer = structure(c(4L, 4L, 4L, 4L, 4L, 3L, 
    1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("che", "dt", "henry", 
    "sarah"), class = "factor"), visit = c(2L, 3L, 4L, 5L, 1L, 1L, 
    1L, 2L, 3L, 4L, 3L, 2L, 1L), date = structure(c(15926, 15907, 
    15879, 15858, 15937, 15890, 15944, 15913, 15891, 15870, 15800, 
    15863, 15897), class = "Date"), purchase_amt = c(5L, 8L, 9L, 
    1L, 8L, 4L, 2L, 1L, 8L, 3L, 9L, 1L, 6L)), .Names = c("customer", 
    "visit", "date", "purchase_amt"), row.names = c(NA, -13L), class =  
    "data.frame")
    
    library(data.table)
     df<-data.table(df)
    df[,list(visit=visit,date=date, purchase_amt=purchase_amt,days_since = c(NA, diff(date)),amt_diff = c(NA, diff(purchase_amt))),keyby="customer"]
        customer visit       date purchase_amt days_since amt_diff
     1:      che     1 2013-08-27            2         NA       NA
     2:      che     2 2013-07-27            1        -31       -1
     3:      che     3 2013-07-05            8        -22        7
     4:      che     4 2013-06-14            3        -21       -5
     5:       dt     3 2013-04-05            9         NA       NA
     6:       dt     2 2013-06-07            1         63       -8
     7:       dt     1 2013-07-11            6         34        5
     8:    henry     1 2013-07-04            4         NA       NA
     9:    sarah     2 2013-08-09            5         NA       NA
    10:    sarah     3 2013-07-21            8        -19        3
    11:    sarah     4 2013-06-23            9        -28        1
    12:    sarah     5 2013-06-02            1        -21       -8
    13:    sarah     1 2013-08-20            8         79        7
    

相关问题