teengamb数据集进行回归分析

回归分析

在 faraway 包中，包含一个 47 行 5 列的 teengamb 数据集（加载 faraway包后，可通过代码“head(teengamb)”查看数据的前 5 行，通过“?teengamb”查看每个变量的具体意义），该数据是研究关于青少年赌博情况的数据集。针对该数据集，请回答以下问题：
Sex:性别，0=男性，1=女性。
Status：基于父母职业的社会经济状况评分
Income：每周的收入，英镑
Verbal:正确定义的 12 各单词的口头评分
Gamle:每年赌博的开支，英镑。
（1）如果只考虑 sex、income、verbal 三个变量作为自变量，预测因变量 gamble
时，可以使用哪些回归模型进行预测？要求建立的回归模型数量不少于 3 个，
并对为什么要建立这样的回归模型进行解释；
先进行相关系数分析：

library(corrplot) library(faraway) library(ggcorrplot) library(tidyr) library(GGally) data(teengamb) head(teengamb) teengamb<-teengamb ?teengamb teen<-data.frame(teengamb$sex,teengamb$income,teengamb$gamble,teengamb$verbal) voice_cor <- cor(teen) corrplot.mixed(voice_cor,tl.col="black",tl.pos = "lt", tl.cex = 2,number.cex = 1)

结果如下：

> head(teengamb) sex status income verbal gamble 1 1 51 2.00 8 0.0 2 1 28 2.50 8 0.0 3 1 37 2.00 6 0.0 4 1 28 7.00 4 7.3 5 1 65 2.00 8 19.6 6 1 61 3.47 6 0.1

在这里插入图片描述
可发现，income 和 gamble 收入相关性达到 0.62，较强相关，gamble 与 sex 相关系数为-0.41，成一定相关性，说明与性别有关系。

再进行多元线性回归：

 多元线型回归 lm1 <- lm(gamble~sex+income+verbal,data = teengamb) summary(lm1)

结果如下：

> summary(lm1) Call: lm(formula = gamble ~ sex + income + verbal, data = teengamb) Residuals: Min 1Q Median 3Q Max -50.639 -11.765 -1.594 9.305 93.867 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 24.1390 14.7686 1.634 0.1095 sex -22.9602 6.7706 -3.391 0.0015 ** income 4.8981 0.9551 5.128 6.64e-06 *** verbal -2.7468 1.8253 -1.505 0.1397 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22.43 on 43 degrees of freedom Multiple R-squared: 0.5263, Adjusted R-squared: 0.4933 F-statistic: 15.93 on 3 and 43 DF, p-value: 4.148e-07

经多元线性回归，系数检验发现，verbal 检验的 p 值为 0.1397>0.05,不显著，故可考虑剔除 verbal 做多元线性回归。

考虑剔除verbal ：

#剔除verbal lm2 <- lm(gamble~sex+income,data = teengamb) summary(lm2) library(broom)  可视化回归模型的图像 par(mfrow = c(2,2)) plot(lm2)

结果如下：

> summary(lm2) Call: lm(formula = gamble ~ sex + income, data = teengamb) Residuals: Min 1Q Median 3Q Max -49.757 -11.649 0.844 8.659 100.243 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.041 6.394 0.632 0.53070 sex -21.634 6.809 -3.177 0.00272 ** income 5.172 0.951 5.438 2.24e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22.75 on 44 degrees of freedom Multiple R-squared: 0.5014, Adjusted R-squared: 0.4787 F-statistic: 22.12 on 2 and 44 DF, p-value: 2.243e-07

可发现，参数的检验 sex,income 都较为显著，而 Adjusted R-squared=0.4787
可得到回归方程：
gamble = 4.041 − 21.634 ∗ sex + 5.172 ∗ income
在这里插入图片描述
再进行逐步线性回归：

Enblm <- lm(gamble~sex+income+verbal,data = teengamb) summary(Enblm)  Coefficients: (1 not defined because of singularities)  因为奇异性问题，有一个变量没有计算系数  判断模型的多重共线性问题 kappa(Enblm,exact=TRUE) #exact=TRUE表示精确计算条件数； alias(Enblm)  逐步回归 Enbstep <- step(Enblm,direction = "both") summary(Enbstep)  判断模型的多重共线性问题 kappa(Enbstep,exact=TRUE) vif(Enbstep)

结果如下：

> Enbstep <- step(Enblm,direction = "both") Start: AIC=296.21 gamble ~ sex + income + verbal Df Sum of Sq RSS AIC <none> 21642 296.21 - verbal 1 1139.8 22781 296.63 - sex 1 5787.9 27429 305.35 - income 1 13236.1 34878 316.64 > summary(Enbstep) Call: lm(formula = gamble ~ sex + income + verbal, data = teengamb) Residuals: Min 1Q Median 3Q Max -50.639 -11.765 -1.594 9.305 93.867 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 24.1390 14.7686 1.634 0.1095 sex -22.9602 6.7706 -3.391 0.0015 ** income 4.8981 0.9551 5.128 6.64e-06 *** verbal -2.7468 1.8253 -1.505 0.1397 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22.43 on 43 degrees of freedom Multiple R-squared: 0.5263, Adjusted R-squared: 0.4933 F-statistic: 15.93 on 3 and 43 DF, p-value: 4.148e-07 > kappa(Enbstep,exact=TRUE) [1] 39.20124 > vif(Enbstep) sex income verbal 1.030968 1.051585 1.049578

可发现，不存在多重共线性，故结果将与多元线性回归一致。

（2）使用所有的变量预测因变量 gamble，并且使用 step()函数对模型进行逐步回归，分析逐步回归后的结果；

Enblm <- lm(gamble~sex+income+verbal+status,data = teengamb) summary(Enblm)  Coefficients: (1 not defined because of singularities)  因为奇异性问题，有一个变量没有计算系数  判断模型的多重共线性问题 kappa(Enblm,exact=TRUE) #exact=TRUE表示精确计算条件数； alias(Enblm)  逐步回归 Enbstep <- step(Enblm,direction = "both") summary(Enbstep)  判断模型的多重共线性问题 kappa(Enbstep,exact=TRUE) vif(Enbstep)

结果如下：

#原始状态，未剔除变量 > summary(Enblm) Call: lm(formula = gamble ~ sex + income + verbal + status, data = teengamb) Residuals: Min 1Q Median 3Q Max -51.082 -11.320 -1.451 9.452 94.252 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.55565 17.19680 1.312 0.1968 sex -22.11833 8.21111 -2.694 0.0101 * income 4.96198 1.02539 4.839 1.79e-05 *** verbal -2.95949 2.17215 -1.362 0.1803 status 0.05223 0.28111 0.186 0.8535 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22.69 on 42 degrees of freedom Multiple R-squared: 0.5267, Adjusted R-squared: 0.4816 F-statistic: 11.69 on 4 and 42 DF, p-value: 1.815e-06 > kappa(Enblm,exact=TRUE) #exact=TRUE表示精确计算条件数； [1] 263.8049 > alias(Enblm) Model : gamble ~ sex + income + verbal + status #此时存在多重共线性，条件数为263，较大 #逐步回归后： > Enbstep <- step(Enblm,direction = "both") Start: AIC=298.18 gamble ~ sex + income + verbal + status Df Sum of Sq RSS AIC - status 1 17.8 21642 296.21 <none> 21624 298.18 - verbal 1 955.7 22580 298.21 - sex 1 3735.8 25360 303.67 - income 1 12056.2 33680 317.00 Step: AIC=296.21 gamble ~ sex + income + verbal Df Sum of Sq RSS AIC <none> 21642 296.21 - verbal 1 1139.8 22781 296.63 + status 1 17.8 21624 298.18 - sex 1 5787.9 27429 305.35 - income 1 13236.1 34878 316.64 > summary(Enbstep) Call: lm(formula = gamble ~ sex + income + verbal, data = teengamb) Residuals: Min 1Q Median 3Q Max -50.639 -11.765 -1.594 9.305 93.867 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 24.1390 14.7686 1.634 0.1095 sex -22.9602 6.7706 -3.391 0.0015 ** income 4.8981 0.9551 5.128 6.64e-06 *** verbal -2.7468 1.8253 -1.505 0.1397 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22.43 on 43 degrees of freedom Multiple R-squared: 0.5263, Adjusted R-squared: 0.4933 F-statistic: 15.93 on 3 and 43 DF, p-value: 4.148e-07 #R方为0.4933，P值<0.05,模型检验通过。 > kappa(Enbstep,exact=TRUE) [1] 39.20124 > vif(Enbstep) sex income verbal 1.030968 1.051585 1.049578

逐步回归之后，回归模型的条件数变为 39.20124，此时剔除了 status 变量。

（3）如果以性别为因变量，能够根据其他的几个数据特征准确地预测出性别吗？如果可以，那么预测的准确率是多少？如果不可以，请说明为什么？

利用逻辑斯特回归预测：

library(caret) library(Metrics) library(dplyr) voicelm <- glm(sex~.,data = teengamb,family = "binomial")#利用逻辑斯特回归预测 summary(voicelm) label<-predict(voicelm,teengamb[,2:5],type = "response") label <- as.factor(ifelse(label > 0.5,1,0))#将数据规范为0，1 table(teengamb$sex,label) sprintf("逻辑回归模型的精度为：%f",accuracy(teengamb$sex,label))

结果如下：

> summary(voicelm) Call: glm(formula = sex ~ ., family = "binomial", data = teengamb) Deviance Residuals: Min 1Q Median 3Q Max -1.50499 -0.57882 -0.09388 0.59949 2.58612 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.63905 1.90352 1.912 0.0559 . status -0.10108 0.04033 -2.507 0.0122 * income 0.10653 0.18900 0.564 0.5730 verbal 0.13822 0.25711 0.538 0.5909 gamble -0.08651 0.04247 -2.037 0.0417 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 63.422 on 46 degrees of freedom Residual deviance: 36.140 on 42 degrees of freedom AIC: 46.14 Number of Fisher Scoring iterations: 7 > table(teengamb$sex,label) label 0 1 0 23 5 1 4 15 > sprintf("逻辑回归模型的精度为：%f",accuracy(teengamb$sex,label)) [1] "逻辑回归模型的精度为：0."

精度为80%，较低，可尝试使用深度学习方法和支持向量机等机器学习方法。详细可参加另一篇文章《对于teengamb数据集进行神经网络分类》

到此这篇teengamb数据集进行回归分析的文章就介绍到这了,更多相关内容请继续浏览下面的相关推荐文章，希望大家都能在编程的领域有一番成就！

上一篇：微信聊天记录删除过了还能恢复吗？试试这2款超好用的数据恢复软件

下一篇： 2024年5款值得推荐的图表数据可视化工具推荐，不会Excel做表必备！

版权声明：
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。
如若内容造成侵权、违法违规、事实不符，请将相关资料发送至xkadmin@xkablog.com进行投诉反馈，一经查实，立即处理！

转载请注明出处，原文链接：https://www.xkablog.com/sjkxydsj/10470.html

回归分析

相关文章：