使用固定R2模拟多个回归数据：如何合并相关变量？-Java 学习之路

我想用四个预测变量来模拟多元线性回归的数据，我可以自由指定

模型的整体解释方差
所有标准化回归系数的大小
预测变量相互关联的程度

我找到了满足前两点的解决方案，但是基于所有自变量彼此不相关的假设（参见下面的代码） . 为了得到标准化的回归系数，我从人口变量中采样，均值= 0，方差= 1 .

# Specify population variance/covariance of four predictor variables that is sampled from
sigma.1 <- matrix(c(1,0,0,0,   
                    0,1,0,0,   
                    0,0,1,0,    
                    0,0,0,1),nrow=4,ncol=4)
# Specify population means of four predictor varialbes that is sampled from 
mu.1 <- rep(0,4) 

# Specify sample size, true regression coefficients, and explained variance
n.obs <- 50000 # to avoid sampling error problems
intercept <- 0.5
beta <- c(0.4, 0.3, 0.25, 0.25)
r2 <- 0.30

# Create sample with four predictor variables
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))

# Add error variable based on desired r2
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2)*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))

# Add y variable based on true coefficients and desired r2
sample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 + 
beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon

# Inspect model
summary(lm(y~V1+V2+V3+V4, data=sample1))

Call:
lm(formula = y ~ V1 + V2 + V3 + V4, data = sample1)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0564 -0.6310 -0.0048  0.6339  3.7119 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.496063   0.004175  118.82   <2e-16 ***
V1          0.402588   0.004189   96.11   <2e-16 ***
V2          0.291636   0.004178   69.81   <2e-16 ***
V3          0.247347   0.004171   59.30   <2e-16 ***
V4          0.253810   0.004175   60.79   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9335 on 49995 degrees of freedom
Multiple R-squared:  0.299, Adjusted R-squared:  0.299 
F-statistic:  5332 on 4 and 49995 DF,  p-value: < 2.2e-16

Problem: 如果我的预测变量是相关的，那么如果指定它们的方差/协方差矩阵而非非对角线元素为0，则r2和回归系数在很大程度上与我希望它们的方式不同，例如，通过使用

sigma.1 <- matrix(c(1,0.25,0.25,0.25,   
                    0.25,1,0.25,0.25,   
                    0.25,0.25,1,0.25,    
                    0.25,0.25,0.25,1),nrow=4,ncol=4)

有什么建议？谢谢！

1 回答

在考虑了我的问题后，我找到了答案 .

上面的代码首先以彼此之间给定的相关程度对预测变量进行采样 . 然后基于期望的r2值添加用于错误的列 . 然后将所有这些组合在一起，添加y列 .

到目前为止，创建错误的行只是

var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2)*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))

因此，它假设每个β系数对y的解释贡献100％（=没有自变量的相互关系） . 但如果x变量是相关的，那么每个beta都不会（！）贡献100％ . 这意味着误差的方差必须更大，因为变量彼此之间存在一些差异 .

多大了？只需调整错误术语的创建如下：

var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2+cor(sample1$V1, sample1$V2))*((1 - r2)/r2)

因此，通过添加 cor(sample1$V1, sample1$V2) ，将自变量相关的程度添加到误差方差中 . 在相互关系为0.25的情况下，例如，通过使用

sigma.1 <- matrix(c(1,0.25,0.25,0.25,   
                0.25,1,0.25,0.25,   
                0.25,0.25,1,0.25,    
                0.25,0.25,0.25,1),nrow=4,ncol=4)

cor(sample1$V1, sample1$V2) 类似于0.25，此值将添加到误差项的方差中 .

像这样，可以指定自变量之间的任何程度的相互关系，以及真实的标准化回归系数和期望的R2 .

证明：

sigma.1 <- matrix(c(1,0.35,0.35,0.35,   
                    0.35,1,0.35,0.35,   
                    0.35,0.35,1,0.35,    
                    0.35,0.35,0.35,1),nrow=4,ncol=4)
# Specify population means of four predictor varialbes that is sampled from 
mu.1 <- rep(0,4) 

# Specify sample size, true regression coefficients, and explained variance
n.obs <- 500000 # to avoid sampling error problems
intercept <- 0.5
beta <- c(0.4, 0.3, 0.25, 0.25)
r2 <- 0.15

# Create sample with four predictor variables
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))

# Add error variable based on desired r2
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2+cor(sample1$V1, sample1$V2))*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))

# Add y variable based on true coefficients and desired r2
sample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 + 
  beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon

# Inspect model
summary(lm(y~V1+V2+V3+V4, data=sample1))

> summary(lm(y~V1+V2+V3+V4, data=sample1))

Call:
lm(formula = y ~ V1 + V2 + V3 + V4, data = sample1)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.7250  -1.3696   0.0017   1.3650   9.0460 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.499554   0.002869  174.14   <2e-16 ***
V1          0.406360   0.003236  125.56   <2e-16 ***
V2          0.298892   0.003233   92.45   <2e-16 ***
V3          0.247581   0.003240   76.42   <2e-16 ***
V4          0.253510   0.003241   78.23   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.028 on 499995 degrees of freedom
Multiple R-squared:  0.1558,    Adjusted R-squared:  0.1557 
F-statistic: 2.306e+04 on 4 and 499995 DF,  p-value: < 2.2e-16

回复于 2024-05-17T10:48:03+08:00

使用固定R2模拟多个回归数据：如何合并相关变量？

1 回答

相关问题