时间序列数据包括:
产品(分类); ProductGroup(分类);国家(分类); YearSinceProductLaunch(数字); SalesAtLaunchYear(数字)
只有“SalesAtLaunchYear”数据有一些需要估算的缺失值 .
对于某些产品,存在完整的数据,即销售年份1,2和截至目前的销售数据 .
但是,其他一些产品仅包含自发布以来的早期销售数据 . 产品有不同的年龄,因此有时自推出以来有2年缺失,有时有10年,这取决于产品 .
我有兴趣在R中找到一个可以归咎于缺少时间序列数据缺口的模型 . 我通过将“SalesAtLaunchYear”的模型设置为随机森林来尝试MICE,但我仍然获得了一些非常高的销售 Value ,特别是在产品发布之初 . 我确保在0年级,所有销售额都为0以避免负值 . 数据框有20000行,有300个独特的产品 .
testdf = tibble::tribble(
~Country, ~ProductGroup, ~Product, ~YearSinceProductLaunch, ~SalesAtLaunchYear,
"CA", "ProductGroup1", "Product1", 0L, 0,
"CA", "ProductGroup1", "Product1", 1L, NA,
"CA", "ProductGroup1", "Product1", 2L, NA,
"CA", "ProductGroup1", "Product1", 3L, NA,
"CA", "ProductGroup1", "Product1", 4L, NA,
"CA", "ProductGroup1", "Product1", 5L, 206034.9814,
"CA", "ProductGroup1", "Product1", 6L, 170143.2623,
"CA", "ProductGroup1", "Product1", 7L, 212541.9306,
"CA", "ProductGroup1", "Product1", 8L, 270663.199,
"CA", "ProductGroup1", "Product1", 9L, 736738.3755,
"CA", "ProductGroup1", "Product1", 10L, 2579723.981,
"CA", "ProductGroup1", "Product1", 11L, 4964319.496,
"CA", "ProductGroup1", "Product1", 12L, 6864985.16,
"CA", "ProductGroup1", "Product1", 13L, 8793292.386,
"CA", "ProductGroup1", "Product1", 14L, 11416033.38,
"IT", "ProductGroup2", "Product2", 0L, 0,
"IT", "ProductGroup2", "Product2", 1L, NA,
"IT", "ProductGroup2", "Product2", 2L, NA,
"IT", "ProductGroup2", "Product2", 3L, NA,
"IT", "ProductGroup2", "Product2", 4L, NA,
"IT", "ProductGroup2", "Product2", 5L, NA,
"IT", "ProductGroup2", "Product2", 6L, NA,
"IT", "ProductGroup2", "Product2", 7L, NA,
"IT", "ProductGroup2", "Product2", 8L, NA,
"IT", "ProductGroup2", "Product2", 9L, NA,
"IT", "ProductGroup2", "Product2", 10L, NA,
"IT", "ProductGroup2", "Product2", 11L, NA,
"IT", "ProductGroup2", "Product2", 12L, NA,
"IT", "ProductGroup2", "Product2", 13L, 30806222.96,
"IT", "ProductGroup2", "Product2", 14L, 31456272,
"IT", "ProductGroup2", "Product2", 15L, 31853476.78,
"IT", "ProductGroup2", "Product2", 16L, 30379818,
"IT", "ProductGroup2", "Product2", 17L, 29765448.87,
"IT", "ProductGroup2", "Product2", 18L, 31376234,
"IT", "ProductGroup2", "Product2", 19L, 32628514.81,
"IT", "ProductGroup2", "Product2", 20L, 32732196,
"IT", "ProductGroup2", "Product2", 21L, 33503784.25,
"IT", "ProductGroup2", "Product2", 22L, 35163372,
"DE", "ProductGroup3", "Product3", 0L, 0,
"DE", "ProductGroup3", "Product3", 1L, 161884.081,
"DE", "ProductGroup3", "Product3", 2L, 7876925.474,
"DE", "ProductGroup3", "Product3", 3L, 12948209.55,
"DE", "ProductGroup3", "Product3", 4L, 13304401.76
)
testdf$Country = as.factor(testdf$Country)
testdf$ProductGroup = as.factor(testdf$ProductGroup)
testdf$Product = as.factor(testdf$Product)
1 回答
可能使用鼠标不会给你想要的结果 . 因为它主要使用互变量相关性 . 您正在寻找更多关联时间 .
我对此特定示例的建议是将数据集拆分为Country,ProductGroup,Product groups,并使用时间序列插补包对这些数据集进行插补 .
看看你的数据,我认为类似函数na.interpolation来自包 imputeTS 已经做得很好 .
这就是你怎么称呼它:
对于您在每个Country,ProductGroup,Product中创建的每个时间系列,您必须多次调用它 .
你也可以跑
在您的整个数据集上更容易 - 在您展示的示例中,这也可以 . (如果其余部分结构不同或者您使用与imputeTS包不同的算法,则可能会导致问题)