Question

我正在开发一个大型（但不是很大）的1.1ml观测值x 41变量的数据库。数据被安排为不平衡面板。使用这些变量，我指定了三个不同的模型，并将每个模型作为1）固定效果，2）随机效应和3）汇总OLS回归运行。

仅包含数据库的原始.RData文件大约为15Mb。包含数据库和回归结果（总共9个回归）的.RData权重约为650Mb。我确实意识到（来自基础文档）

An object of class c("plm","panelmodel").

A "plm" object has the following elements :

coefficients   the vector of coefficients,
vcov           the covariance matrix of the coefficients,
residuals      the vector of residuals,
df.residual    degrees of freedom of the residuals,
formula        an object of class ’pFormula’ describing the model,
model          a data.frame of class ’pdata.frame’ containing the variables usedfor the estimation: the response is in first position and the two indexes in the last positions,
ercomp         an object of class ’ercomp’ providing the estimation of the components of the
errors         (for random effects models only),
call           the call

即便如此，我也无法理解为什么这些文件应该如此庞大。为了避免在处理plm对象时内存过载，我将它们保存在三个不同的文件中（每个文件的权重现在大约为200Mb）。我在一小时前打电话给summary看固定效应模型的结果，但它还没有向我展示任何结果。我现在的问题很简单。你觉得这是正常的行为吗？我可以采取哪些措施来减少plm对象的大小并加快结果检索的速度？

以下是您可能想知道的一些事项：

我使用的数据库格式为data.table
formula已预先汇编，并按照建议here包含在plm之前的as.formula()来电中。例如：

form<-y~x1+x2+x3+...+xn

mod.fe<-plm(as.formula(form), regr, effect="individual", model="within", index=c("id", "year"))

如果我能提供任何其他信息并且您可能需要回答这个问题，请告诉我。

修改

我设法建立了一个小规模的数据库，其特征与我正在进行的工作类似。这是：

structure(list(id = c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 
5L, 5L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 
10L, 10L, 11L, 11L), year = structure(c(1L, 2L, 1L, 2L, 3L, 4L, 
1L, 2L, 1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 
1L, 2L, 3L, 4L, 3L, 4L, 1L, 2L), .Label = c("2000", "2001", "2002", 
"2003"), class = "factor"), study = c(3.37354618925767, 4.18364332422208, 
5.32950777181536, 4.17953161588198, 5.48742905242849, 5.73832470512922, 
6.57578135165349, 5.69461161284364, 6.3787594194582, 4.7853001128225, 
7.98380973690105, 8.9438362106853, 9.07456498336519, 7.01064830413663, 
10.6198257478947, 9.943871260471, 9.84420449329467, 8.52924761610073, 
3.52184994489138, 4.4179415601997, 5.35867955152904, 3.897212272657, 
5.38767161155937, 4.9461949594171, 3.62294044317139, 4.58500543670032, 
7.10002537198388, 6.76317574845754, 6.83547640374641, 6.74663831986349
), ethnic = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 
2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
1L, 1L, 2L, 2L), .Label = c("hispanic", "black", "chinese"), class = "factor"), 
    sport = c(0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0), health = structure(c(1L, 
    1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 2L, 2L, 3L, 3L, 
    3L, 3L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("none", 
    "drink", "both", "smoke"), class = "factor"), gradec = c(2.72806403942929, 
    3.10067738633308, 4.04728186632456, 2.19701362539883, 1.73115878111307, 
    5.35879931359977, 5.79613840739381, 5.07050219214859, 4.26224490644077, 
    3.53554192927934, 6.10515669475491, 7.18032957183198, 6.73191149590581, 
    6.49512764543435, 6.4783689354808, 6.19974636196512, 5.54014977312232, 
    6.72545652880344, 1.00223129492982, 1.08994269214495, 3.06702680106689, 
    1.70103126320561, 4.82973481729635, 3.14010240687364, 3.8068435242348, 
    5.01254268106181, 5.66497772013949, 4.16303452633342, 4.2751229553617, 
    3.05652055248093), event = c(1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 
    0), evm3 = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0), evm2 = c(0, 
    0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 1, 0, 0, 1, 1, 0, 0, 0, 0), evm1 = c(0, 1, 0, 1, 1, 1, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 
    1, 0, 0, 0, 0), evp1 = c(0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1), 
    evp2 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1), evp3 = c(0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 1, 0), ndm3 = c(1, 1, 1, 1, 1, 0, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 
    1, 1, 1, 1), ndm2 = c(1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1), ndm1 = c(1, 
    0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 
    0, 0, 1, 0, 0, 0, 1, 0, 1, 0), ndp1 = c(0, 1, 0, 0, 0, 1, 
    0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 
    1, 0, 1, 0, 0), ndp2 = c(1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0), 
    ndp3 = c(1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 
    1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1)), .Names = c("id", 
"year", "study", "ethnic", "sport", "health", "gradec", "event", 
"evm3", "evm2", "evm1", "evp1", "evp2", "evp3", "ndm3", "ndm2", 
"ndm1", "ndp1", "ndp2", "ndp3"), class = "data.frame", row.names = c(NA, 
30L))

我使用的公式和plm调用是：

form<-gradec~year+study+ethnic+sport+health+event+evm3+evm2+evm1+evp1+evp2+evp3+ndm3+ndm2+ndm1+ndp1+ndp2+ndp3

plm.f<-plm(as.formula(form), data, effect="individual", model="within", index=c("id", "year"))

使用@BenBolker建议的object.size()我发现调用生成了plm个对象，加权为64.5Kb，而原始数据框的大小为6.9Kb，这意味着结果大约是10倍大于输入矩阵。在这里，我设置了下面@ zx8754建议的选项，但不幸的是它们没有效果。当我最终调用summary(plm.f)时，收到了错误消息：

Error in crossprod(t(X), beta) : non-conformable arguments

我最终也得到了我的大数据库，但只是经过几个小时的计算。 Here建议问题可能是由于系数矩阵是单数的。但是，使用is.matrix.singular()包中的matrixcalc测试奇点，结果证明情况并非如此。

您可能想知道的另外几件事：

year，ethnic和health是因素
公式中的变量或多或少都是不言自明的，除了最后的变量。 event是在某个时间发生的假设创伤事件。在特定年份的事件中编码为1，否则为0。如果其中一个事件发生在前一年（减1），则变量evm1等于1，否则为0。同样，如果事件发生在下一年（加1），则evp1为1，否则为0。变量ndm.和ndp.以相同的方式工作，但当该距离不可观察时（因为某个人的时间段太短），它们被编码为1，否则为0。如此深度联系的变量的存在引起了完美共线性的怀疑。然而，如上所述，测试显示矩阵是非单数的。

让我再次告诉我，如果有人能回答这个问题，我将非常感激。

Answer 1

关于错误消息Error in crossprod(t(X), beta) : non-conformable arguments：

这可能是由于模型矩阵中的奇点，正如所建议的那样。请记住，固定效果模型的模型矩阵是变换后的数据（变换后的数据帧）。

因此，您需要检查转换的数据的奇点。即使原始数据不是线性相关的，固定效应变换也可能导致线性相关（奇点）！ plm包在?detect_lin_dep中有关于该问题的相当好的文档，我将在此部分重复（仅一个示例）：

### Example 1 ###
# prepare the data
data(Cigar)
Cigar[ , "fact1"] <- c(0,1)
Cigar[ , "fact2"] <- c(1,0)
Cigar.p <- pdata.frame(Cigar)

# setup a pFormula and a model frame
pform <- pFormula(price ~ 0 + cpi + fact1 + fact2)
mf <- model.frame(pform, data = Cigar.p)

# no linear dependence in the pooling model's model matrix
# (with intercept in the formula, there would be linear depedence)
detect_lin_dep(model.matrix(pform, data = mf, model = "pooling"))

# linear dependence present in the FE transformed model matrix
modmat_FE <- model.matrix(pform, data = mf, model = "within")
detect_lin_dep(modmat_FE)
mod_FE <- plm(pform, data = Cigar.p, model = "within")
detect_lin_dep(mod_FE) 
alias(mod_FE) # => fact1 == -1*fact2
plm(pform, data = mf, model = "within")$aliased # "fact2" indicated as aliased

因此，您应该运行函数来检测由model.matrix(you_model)得到的模型的转换数据的线性依赖性。您可以使用plm提供的函数：detect_lin_dep，alias或任何适用于矩阵的函数。

您还可以查看您的plm模型对象： your_model$aliased查看估算中是否删除了某些变量。

为什么PLM会创建大量对象而无法打开它们

1 个答案: