Question

我有一个大约39k行数据的数据集，摘录如下：

'国家'，'群组'，'项目'，'年'是明确的
'生产'和'废物'是数字

'LF'也是数字，但是'Waste'/'Production

的结果

Region  Country Group   Item    Year    Production  Waste   LF
Europe  Bulgaria    Cereals Wheat   1961    2040    274 0.134313725
Europe  Bulgaria    Cereals Wheat   1962    2090    262 0.125358852
Europe  Bulgaria    Cereals Wheat   1963    1894    277 0.14625132
Europe  Bulgaria    Cereals Wheat   1964    2121    286 0.134842056
Europe  Bulgaria    Cereals Wheat   1965    2923    341 0.116660965
Europe  Bulgaria    Cereals Wheat   1966    3193    385 0.120576261
Europe  Bulgaria    Cereals Barley  1961    612 15  0.024509804
Europe  Bulgaria    Cereals Barley  1962    599 16  0.026711185
Europe  Bulgaria    Cereals Barley  1963    618 16  0.025889968
Europe  Bulgaria    Cereals Barley  1964    764 21  0.027486911
Europe  Bulgaria    Cereals Barley  1965    876 22  0.025114155
Europe  Bulgaria    Cereals Barley  1966    1064    24  0.022556391

我使用以下代码按项目和组

生成991种不同的方法

df2 <- aggregate(LF ~ Country + Item, data=df1, FUN='mean')

此功能的结果看起来不错。

我想测试df2中LF的各自均值是否与df1中每个Country-Item组合的基础年度观察不同（即，如果为FALSE，那么LF实际上只是一个静态比率，如果为TRUE则为'废物'独立于'生产'）。

如何做到最好？这个数据集似乎只有991个测试，我不知道如何以这种方式混合apply和t.test函数。

谢谢！

Answer 1

t.test需要两个组来比较依赖于数字/比例的输出变量。在这里，在我看来，对于国家和项目的每个组合，您想要比较所有不同的年份平均值/均值。换句话说，您正在尝试调查年份是否影响国家和项目的每个组合的LF平均值。

最简单的方法是为国家和项目的每个组合创建一个线性模型（LF~年），并解释变量年份的系数和p值。

library(dplyr)
library(broom)

set.seed(115)

# example dataset
dt = data.frame(Country = rep("country1",12),
                Item = c(rep("item1",6), rep("item2",6)),
                Year = rep(1961:1966,2),
                LF = runif(12,0,1))

# general means by country and item
dt %>% group_by(Country,Item) %>% summarise(Mean_LF = mean(LF))

# each years means by country and item
dt %>% group_by(Country,Item,Year) %>% summarise(Mean_LF = mean(LF))

# does year influence the means for each country and item?
dt %>% group_by(Country,Item) %>% do(tidy(lm(LF~Year, data=.)))

希望这会有所帮助。如果我遗漏了某些内容并且我将更新我的代码，请告诉我。

对多因素水平的平均值与观察值的R t检验

1 个答案: