Question

我正在尝试遍历data.frame的所有列名并使用它们作为线性回归中的预测变量。

我目前拥有的是：

for (i in 1:11){
for (j in 1:11){
if (i != j ){
  var1 = names(newData)[i]
  var2 = names(newData)[j]
  glm.fit = glm(re78 ~  as.name(var1):as.name(var2), data=newData)
  summary(glm.fit)
  cv.glm(newData, glm.fit, K = 10)$delta[1]
  }
 }
}

newData是我的data.frame，总共有11列。此代码给出了以下错误：

model.frame.default中的错误（formula = re78~as.name（var1），data = newData ,: 变量'as.name（var1）'

的类型（符号）无效

我该如何解决这个问题，让它发挥作用？

Answer 1

看起来你想要使用两个变量的所有组合的模型。这是使用内置mtcars数据框进行说明并使用mpg作为结果变量的另一种方法。

我们使用mpg得到两个变量的所有组合（在这种情况下不包括结果变量，combn）。 combn返回一个列表，其中每个列表元素是包含一对变量名称的向量。然后我们使用map（来自purrr包）为每对变量创建模型，并将结果存储在列表中。

我们使用reformulate来构建模型公式。 .x引用变量名称的向量（vars的每个元素）。例如，如果您运行reformulate(paste(c("cyl", "disp"),collapse="*"), "mpg")，则可以看到reformulate正在执行的操作。

library(purrr)

# Get all combinations of two variables
vars = combn(names(mtcars)[-grep("mpg", names(mtcars))], 2, simplify=FALSE)

现在我们想在所有变量对上运行回归模型，并将结果存储在列表中：

# No interaction
models = map(vars, ~ glm(reformulate(.x, "mpg"), data=mtcars))

# Interaction only (no main effects)
models = map(vars, ~ glm(reformulate(paste(.x, collapse=":"), "mpg"), data=mtcars))

# Interaction and main effects
models = map(vars, ~ glm(reformulate(paste(.x, collapse="*"), "mpg"), data=mtcars))

使用该模型的公式为每个列表元素命名：

names(models) = map(models, ~ .x[["terms"]])

要使用paste代替reformulate创建模型公式，可以执行（将+更改为:或*，具体取决于相互作用的组合和想要包含的主要效果）：

models = map(vars, ~ glm(paste("mpg ~", paste(.x, collapse=" + ")), data=mtcars))

要了解此处如何使用paste，您可以运行：

paste("mpg ~", paste(c("cyl", "disp"), collapse=" * "))

当模型同时包含主要效果和交互时，前两个模型的外观如下：

models[1:2]

$`mpg ~ cyl * disp`

Call:  glm(formula = reformulate(paste(.x, collapse = "*"), "mpg"), 
    data = mtcars)

Coefficients:
(Intercept)          cyl         disp     cyl:disp  
   49.03721     -3.40524     -0.14553      0.01585  

Degrees of Freedom: 31 Total (i.e. Null);  28 Residual
Null Deviance:        1126 
Residual Deviance: 198.1  AIC: 159.1

$`mpg ~ cyl * hp`

Call:  glm(formula = reformulate(paste(.x, collapse = "*"), "mpg"), 
    data = mtcars)

Coefficients:
(Intercept)          cyl           hp       cyl:hp  
   50.75121     -4.11914     -0.17068      0.01974  

Degrees of Freedom: 31 Total (i.e. Null);  28 Residual
Null Deviance:        1126 
Residual Deviance: 247.6  AIC: 166.3

要评估模型输出，您可以使用broom包中的函数。下面的代码分别返回数据框，其中包含每个模型的系数和性能统计数据。

library(broom)

model_coefs = map_df(models, tidy, .id="Model")
model_performance = map_df(models, glance, .id="Model")

以下是具有主要效果和交互的模型的结果：

head(model_coefs, 8)

             Model        term    estimate   std.error statistic      p.value
1 mpg ~ cyl * disp (Intercept) 49.03721186 5.004636297  9.798357 1.506091e-10
2 mpg ~ cyl * disp         cyl -3.40524372 0.840189015 -4.052950 3.645320e-04
3 mpg ~ cyl * disp        disp -0.14552575 0.040002465 -3.637919 1.099280e-03
4 mpg ~ cyl * disp    cyl:disp  0.01585388 0.004947824  3.204212 3.369023e-03
5   mpg ~ cyl * hp (Intercept) 50.75120716 6.511685614  7.793866 1.724224e-08
6   mpg ~ cyl * hp         cyl -4.11913952 0.988229081 -4.168203 2.672495e-04
7   mpg ~ cyl * hp          hp -0.17068010 0.069101555 -2.469989 1.987035e-02
8   mpg ~ cyl * hp      cyl:hp  0.01973741 0.008810871  2.240120 3.320219e-02

Answer 2

您可以使用var input = @"<155>33739: 033910: *Dec 12 01:09:12.669 UTC: %XYz-3-UPDxyt: Hello, How era you"; var regex = new Regex(@"<(\d+)>(\d+):\s+(\d+):\s+(?<date>[^%]+)%([A-Za-z]+)-(\d+)-([A-Za-z]+):\s+([\w,\s]+)"); var match = regex.Match(input); if (match.Success) { var date = match.Groups["date"]; }作为@akrun建议。此外，您可能不希望调用对象from subprocess import check_output, CalledProcessError import sys def is_conda_managed(): try: out = check_output(['conda', 'env', 'list']).decode('utf-8') lines = (line for line in out.splitlines() if line[:1] != '#') roots = set(line.split()[-1] for line in lines if line.strip()) except CalledProcessError: roots = set() return sys.prefix in roots，因为有一个函数具有相同的功能。

警告：我不知道为什么你有双循环和fit <- glm(as.formula(paste0("re78 ~ ", var1)), data=newData)。你不想用单一的covaraite回归吗？我不知道你想要达到的目的。

在线性回归中使用数据框的列名作为预测变量

2 个答案: