我已使用以下代码进行回归:
res <- lm (c241 ~ x + matchcode , data = df)
summary(res)
matchcode
是一个变量,它是Iso3c代码和年份的组合。例如,在俄罗斯2005年,它是RUS 2005
(请参阅小标题的第一个变量)。想法是将此匹配代码用作固定效果,例如上面的lm
。完美地应用上面的lm
由于我拥有庞大的数据集(总共超过4000个变量):
# A tibble: 450 x 546
matchcode idstd year country wt region income industry sector ownership exporter c201 c202 c203a c203b c203c c203d c2041 c2042 c205a c205b1 c205b2 c205b3 c205b4 c205b5 c205b6 c205b7 c205b8 c205b9 c205b10 c205c c205d c206a c206b c2071
<chr+lbl> <dbl> <dbl> <chr+l> <dbl> <dbl+> <dbl+> <dbl+lb> <dbl+> <dbl+lbl> <dbl+lb> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl>
1 "BGD 200~ 2128 2002 Bangla~ NA 6 1 8 1 2 2 1988 4 100 0 0 NA 2 NA NA NA NA NA NA NA NA NA NA NA NA 1 1 1 NA 2
2 "BGD 200~ 2926 2002 Bangla~ NA 6 1 1 1 2 2 2000 1 100 0 0 NA 2 NA NA NA NA NA NA NA NA NA NA NA NA 1 1 NA NA 1
3 "BGD 200~ 2931 2002 Bangla~ NA 6 1 1 1 2 1 1993 4 100 0 0 NA 2 NA NA NA NA NA NA NA NA NA NA NA NA 1 1 NA NA 2
4 "BRA 200~ 15303 2003 Brazil~ NA 4 2 9 1 2 2 1946 2 100 0 0 0 2 NA 18.72 1 NA NA NA NA NA NA NA NA NA 2 2 1 2 5
5 "BRA 200~ 14917 2003 Brazil~ NA 4 2 8 1 2 2 1984 2 100 0 0 0 2 NA 50.00 1 NA NA NA NA NA NA NA NA 1 1 1 1 2 3
6 "BRA 200~ 14212 2003 Brazil~ NA 4 2 11 1 2 2 1998 2 100 0 0 0 2 NA 50.00 1 NA NA NA NA NA NA NA NA 1 1 1 1 2 2
7 "KHM 200~ 16067 2003 Cambod~ NA 2 1 23 2 1 1 1993 4 50 50 0 0 2 NA 100.00 1 NA 1 1 NA NA NA NA NA NA 1 1 1 1 1
8 "KHM 200~ 16233 2003 Cambod~ NA 2 1 10 4 2 2 1989 4 100 0 0 0 2 NA 100.00 1 NA NA NA NA NA NA NA NA NA 1 1 1 2 3
9 "KHM 200~ 16002 2003 Cambod~ NA 2 1 3 1 1 1 1990 5 0 100 0 0 2 NA 50.00 1 NA NA NA NA NA NA NA NA NA 1 1 1 1 1
10 "CHN 200~ 17987 2002 China2~ NA 2 2 8 1 1 2 1993 6 55 45 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
我想遍历变量LINK;
dfoutput<- data.table(df)[, .(nm = names(.SD),dffits= lapply(.SD, function(x) if(is.numeric(x)) summary(lm(y~ x, na.action=na.omit)))), .SDcols = -1]
但是,这会在data.table中为matchcode
变量(以及任何其他字符变量)创建一个data.table NULL。
当我尝试将matchcode
添加到回归中时:
dfoutput<- data.table(df)[, .(nm = names(.SD),dffits= lapply(.SD, function(x) if(is.numeric(x)) summary(lm(c241~ x + matchcode, na.action=na.omit)))), .SDcols = -1]
或者我在matchcode
中使用lapply:
df <- lapply( df[,-1], function(x) summary(lm(df$c241~ x + df$matchcode)) )
它给出以下“著名”错误:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
尽管我已经读过这个错误可能意味着任何事情,但我的因素确实只有一个级别,但这在单次回归中似乎可以正常工作(也可以添加其他非字符变量,也可以)。当使用data.table循环或lapply时,不会。我的问题有两个:
1)为什么变量变量matchcode
在第一种情况(res <- lm (c241 ~ x + matchcode , data = df
)下而不在第二种dfoutput<- data.table(df)[, .(nm = names(.SD),dffits= lapply(.SD, function(x) if(is.numeric(x)) summary(lm(c241~ x + matchcode, na.action=na.omit)))), .SDcols = -1]
下工作?
2)我该怎么做才能避免这种情况?因为变量对于模型至关重要。
也许我可以转换字符变量,或者可以某种方式对其进行重新编码?
更新:我已使用此链接中的代码:LINK将字符转换为一个级别的因子,最终导致相同的错误。
ES1sample <- dput(head(ES1sample[, ],10))
structure(list(matchcode = structure(c("BGD 2002 ", "BRA 2003 ",
"KHM 2003 ", "CHN 2002 ", "CHN 2003 ", "ECU 2003 ", "ERI 2002 ",
"ETH 2002 ", "GTM 2003 ", "HND 2003 "), label = "", class = c("labelled",
"character")), idstd = structure(c(2760, 14273, 16104, 17039,
19095, 22207, 23063, 24046, 25420, 26212), label = "WEB STD FIRMID", format.stata = "%5.0f", class = c("labelled",
"numeric")), year = structure(c(2002, 2003, 2003, 2002, 2003,
2003, 2002, 2002, 2003, 2003), format.stata = "%9.0g", label = "", class = c("labelled",
"numeric")), country = structure(c("Bangladesh2002", "Brazil2003",
"Cambodia2003", "China2002", "China2003", "Ecuador2003", "Eritrea2002",
"Ethiopia2002", "Guatemala2003", "Honduras2003"), label = "Country", format.stata = "%21s", class = c("labelled",
"character")), wt = structure(c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
), label = "locations and sectors weights", format.stata = "%9.0g", class = c("labelled",
"numeric")), region = structure(c(6, 4, 2, 2, 2, 4, 1, 1, 4,
4), label = "", class = c("labelled", "numeric")), income = structure(c(1,
2, 1, 2, 2, 2, 1, 1, 2, 2), label = "income grouping for survey year", class = c("labelled",
"numeric")), industry = structure(c(1, 1, 20, 3, 20, 12, 7, 3,
NA, 3), label = "Industry", class = c("labelled", "numeric")),
sector = structure(c(1, 1, 2, 1, 2, 1, 1, 1, 2, 1), label = "Sector", class = c("labelled",
"numeric")), ownership = structure(c(2, 2, 2, 2, 2, 2, 2,
2, 1, 1), label = "Ownership", class = c("labelled", "numeric"
)), exporter = structure(c(2, 2, 2, 2, 2, 1, 2, 2, 2, 1), label = "Export", class = c("labelled",
"numeric")), c201 = structure(c(1991, 1993, 1999, 1979, 1998,
1997, 1996, 1998, 1990, 1998), label = "Year firm began operations in this country", format.stata = "%4.0f", class = c("labelled",
"numeric")), c202 = structure(c(2, 2, 4, 6, 6, 2, 6, NA,
NA, NA), label = "Current legal status of firm", class = c("labelled",
"numeric")), c203a = structure(c(100, 100, 100, 100, 0, 100,
0, 100, 0, 0), label = "Percentage of firm owned by domestic private sector", format.stata = "%9.2f", class = c("labelled",
"numeric")), c203b = structure(c(0, 0, 0, 0, 0, 0, 0, 0,
100, 100), label = "Percentage of firm owned by foreign private sector", format.stata = "%9.2f", class = c("labelled",
"numeric")), c203c = structure(c(0, 0, 0, 0, 100, 0, 0, 0,
0, 0), label = "Percentage of firm owned by government/state", format.stata = "%9.2f", class = c("labelled",
"numeric")), c203d = structure(c(NA, 0, 0, NA, NA, 0, 100,
0, 0, 0), label = "Percentage of firm owned by other types of owner", format.stata = "%8.2f", class = c("labelled",
"numeric")), c2041 = structure(c(2, 2, 2, NA, NA, 2, 2, 2,
2, 2), label = "Firm previously owned by government?", class = c("labelled",
"numeric")), c2042 = structure(c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "Year of privatization", format.stata = "%4.0f", class = c("labelled",
"numeric")), c205a = structure(c(NA, 25, 100, NA, 100, 100,
NA, NA, 100, 100), label = "Percentage owned by largest shareholder", format.stata = "%9.2f", class = c("labelled",
"numeric")), c205b1 = structure(c(NA, NA, NA, NA, NA, NA,
NA, NA, 1, 1), label = "Largest shareholder: individual", class = c("labelled",
"numeric")), c205b2 = structure(c(NA, 1, 1, NA, NA, 1, NA,
NA, NA, NA), label = "Largest shareholder: family", class = c("labelled",
"numeric")), c205b3 = structure(c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "Largest shareholder: domestic company", class = c("labelled",
"numeric")), c205b4 = structure(c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "Largest shareholder: foreign company", class = c("labelled",
"numeric")), c205b5 = structure(c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "Largest shareholder: bank", class = c("labelled",
"numeric")), c205b6 = structure(c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "Largest shareholder: investment fund", class = c("labelled",
"numeric")), c205b7 = structure(c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "Largest shareholder: firm managers", class = c("labelled",
"numeric")), c205b8 = structure(c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "Largest shareholder: firm employees", class = c("labelled",
"numeric")), c205b9 = structure(c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "Largest shareholder: government", class = c("labelled",
"numeric")), c205b10 = structure(c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), label = "Largest shareholder: other", class = c("labelled",
"numeric")), c205c = structure(c(1, 1, 1, NA, 2, 1, 1, 1,
1, 1), label = "Is principal shareholder also the manager/director?", class = c("labelled",
"numeric")), c205d = structure(c(1, 1, 1, NA, NA, 2, 1, 1,
1, 1), label = "Is the principal owner male?", class = c("labelled",
"numeric")), c206a = structure(c(1, 1, 1, NA, NA, 1, 0, 1,
2, 3), label = "Number of separate operating facilities in this country", format.stata = "%4.0f", class = c("labelled",
"numeric")), c206b = structure(c(NA, 2, 2, NA, NA, 2, 2,
2, 2, 1), label = "Operations in other countries?", class = c("labelled",
"numeric")), c2071 = structure(c(1, 5, 1, 1, 2, 1, 5, 1,
1, 3), label = "Location of establishment", class = c("labelled",
"numeric")), c2072 = structure(c(NA, NA, NA, NA, NA, 1, 5,
NA, 1, 3), label = "Location of headquarters", class = c("labelled",
"numeric")), c208 = structure(c(NA, 1723, NA, NA, NA, NA,
NA, NA, NA, 1810), label = "Main product line", format.stata = "%10.0g", class = c("labelled",
"numeric")), c209a = structure(c(NA, 1, 2, NA, NA, NA, NA,
NA, NA, NA), label = "Other income generating activities", class = c("labelled",
"numeric")), c209ba = structure(c(NA, 0, NA, NA, NA, 100,
NA, NA, NA, NA), label = "Percent workers' time: manufacturing", format.stata = "%9.0g", class = c("labelled",
"numeric")), c209bb = structure(c(NA, 0, NA, NA, NA, 0, NA,
NA, NA, NA), label = "Percent workers' time: services", format.stata = "%9.0g", class = c("labelled",
"numeric")), c209bc = structure(c(NA, 0, NA, NA, NA, NA,
NA, NA, NA, NA), label = "Percent workers' time: commerce (retail/wholesale trade)", format.stata = "%9.0g", class = c("labelled",
"numeric")), c209bd = structure(c(NA, 0, NA, NA, NA, NA,
NA, NA, NA, NA), label = "Percent workers' time: construction", format.stata = "%9.0g", class = c("labelled",
"numeric")), c209be = structure(c(NA, 1, NA, NA, NA, 0, NA,
NA, NA, NA), label = "Percent workers' time: other", format.stata = "%9.0g", class = c("labelled",
"numeric")), c210a = structure(c(NA, 50, 1, NA, NA, NA, NA,
NA, 65, 0), label = "Main product line: firm's share of local market", format.stata = "%9.0g", class = c("labelled",
"numeric")), c210b = structure(c(12, 25, 1, 6, 44.6399993896484,
50, NA, 5, 45, 0), label = "Main product line: firm's share of national market", format.stata = "%9.0g", class = c("labelled",
"numeric")), c211a1 = structure(c(100, 100, 100, 99, 100,
90, 100, 100, 95, 0), label = "Percent of sales sold domestically", format.stata = "%9.2f", class = c("labelled",
"numeric")), c211a2 = structure(c(0, 0, 0, 0, 0, 10, 0, 0,
5, 100), label = "Percent of sales exported directly", format.stata = "%9.2f", class = c("labelled",
"numeric")), c211a3 = structure(c(0, 0, 0, 1, 0, 0, NA, 0,
0, 0), label = "Percent of sales exported indirectly", format.stata = "%9.2f", class = c("labelled",
"numeric")), c211b1 = structure(c(NA, 0, 0, 0, 0, 0, NA,
NA, NA, NA), label = "Percentage of domestic sales to government", format.stata = "%9.2f", class = c("labelled",
"numeric")), c282a2y = structure(c(451645, NA, NA, 49609,
1061449, 21, 38966.6171875, 43.0904998779297, NA, NA), label = "Total liabilities 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled",
"numeric")), c282b2y = structure(c(NA, NA, NA, 393, 81012,
1, 0, NA, NA, NA), label = "Long-term liabilities (>1 year) 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled",
"numeric")), c282c2y = structure(c(NA, NA, NA, 55193, 980437,
20, 7687.90576171875, NA, NA, NA), label = "Short-term liabilities (<1 year) 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled",
"numeric")), c282d2y = structure(c(NA, NA, 5500, 55193, 19194,
NA, 0, NA, NA, NA), label = "Payable short-term liabilities 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled",
"numeric")), c282e2y = structure(c(10000, NA, NA, 5000, 20329,
12, 29826.701171875, 40, NA, NA), label = "Equity–share capital 2 years ago (thousands LCU)", format.stata = "%10.0g", class = c("labelled",
"numeric")), c282f2y = structure(c(305436, NA, 5500, 916,
NA, NA, 1452.00903320312, 6.09049987792969, NA, NA), label = "Retained earnings 2 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled",
"numeric")), c282a3y = structure(c(456618, NA, NA, NA, 1063217,
16, 39676.3984375, 43.7501029968262, NA, NA), label = "Total liabilities 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled",
"numeric")), c282b3y = structure(c(NA, NA, NA, NA, 152156,
5, 0, NA, NA, NA), label = "Long-term liabilities (>1 year) 3 years ago (thousands LCU)", format.stata = "%10.0g", class = c("labelled",
"numeric")), c282c3y = structure(c(NA, NA, NA, NA, 911061,
11, 6299.255859375, NA, NA, NA), label = "Short-term liabilities (<1 year) 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled",
"numeric")), c282d3y = structure(c(NA, NA, NA, NA, 20964,
NA, 0, NA, NA, NA), label = "Payable short-term liabilities 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled",
"numeric")), c282e3y = structure(c(10000, NA, NA, NA, 124531,
9, 32368.564453125, 40, NA, NA), label = "Equity–share capital 3 years ago (thousands LCU)", format.stata = "%10.0g", class = c("labelled",
"numeric")), c282f3y = structure(c(282840, NA, NA, NA, NA,
NA, 1008.58001708984, 6.75010013580322, NA, NA), label = "Retained earnings 3 years ago (thousands LCU)", format.stata = "%9.0g", class = c("labelled",
"numeric")), gni = structure(c(370, 2760, 300, 970, 1100,
1830, 150, 100, 1910, 960), label = "Gross National Income per capita, Atlas Method (current ), World Development Ind", format.stata = "%9.0g", class = c("labelled",
"numeric")), pop = structure(c(135683664, 176596256, 13403644,
1280400000, 1288400000, 13007942, 4296700, 67217840, 12307091,
6968512), label = "Population, Total, in 2005 (World Development Indicators)", format.stata = "%9.0g", class = c("labelled",
"numeric")), country_proper = structure(c("Bangladesh", "Brazil",
"Cambodia", "China", "China", "Ecuador", "Eritrea", "Ethiopia",
"Guatemala", "Honduras"), format.stata = "%22s", label = "", class = c("labelled",
"character")), c_abbr = structure(c("BGD", "BRA", "KHM",
"CHN", "CHN", "ECU", "ERI", "ETH", "GTM", "HND"), format.stata = "%9s", label = "", class = c("labelled",
"character")), countryyear = structure(c("Bangladesh2002",
"Brazil2003", "Cambodia2003", "China2002", "China2003", "Ecuador2003",
"Eritrea2002", "Ethiopia2002", "Guatemala2003", "Honduras2003"
), label = "Country", format.stata = "%21s", class = c("labelled",
"character")), iso3c = structure(c("BGD", "BRA", "KHM", "CHN",
"CHN", "ECU", "ERI", "ETH", "GTM", "HND"), label = "", class = c("labelled",
"character")), cname = structure(c("Bangladesh", "Brazil",
"Cambodia", "China", "China", "Ecuador", "Eritrea", "Ethiopia",
"Guatemala", "Honduras"), label = "", class = c("labelled",
"character")), cyear = structure(c("2002", "2003", "2003",
"2002", "2003", "2003", "2002", "2002", "2003", "2003"), label = "", class = c("labelled",
"character"))), .Names = c("matchcode", "idstd", "year",
"country", "wt", "region", "income", "industry", "sector", "ownership",
"exporter", "c201", "c202", "c203a", "c203b", "c203c", "c203d",
"c2041", "c2042", "c205a", "c205b1", "c205b2", "c205b3", "c205b4",
"c205b5", "c205b6", "c205b7", "c205b8", "c205b9", "c205b10",
"c205c", "c205d", "c206a", "c206b", "c2071", "c2072", "c208",
"c209a", "c209ba", "c209bb", "c209bc", "c209bd", "c209be", "c210a",
"c210b", "c211a1", "c211a2", "c211a3", "c211b1", "c282a2y", "c282b2y",
"c282c2y", "c282d2y", "c282e2y", "c282f2y", "c282a3y", "c282b3y",
"c282c3y", "c282d3y", "c282e3y", "c282f3y", "gni", "pop", "country_proper",
"c_abbr", "countryyear", "iso3c", "cname", "cyear"), class = c("data.table",
"data.frame"), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x0000000002570788>)
答案 0 :(得分:1)
据我了解,您已经尝试进行功能选择,但是您选择了一种复杂的方法,因此输出为20180905180349_InitialCreate.cs
。我同意,您有这么多特征,您需要在回归之前进行特征选择。
有非常突出的功能选择方法,例如 Random Forest 。随机森林可帮助您检测最佳预测指标。
鉴于我有兴趣预测植物的种类,但是我不知道哪个功能可以最好地预测它(NULL
,Sepal.Length
,Sepal.Width
,Petal.Length
)。因此,以下代码指定了最佳谓词:
Petal.Width
library(party)
colnames(iris)
cf1 <- cforest(Species ~ . , data= iris, control=cforest_unbiased(mtry=2,ntree=50))
varimp(cf1)
的结果为您提供:
“准确度平均下降”重要性分数的向量。换句话说,分数越高,预测值越好。
在示例中:
varimp()