Question

我有一个以下结构的数据集：

> data("household", package="HSAUR2")
> household[c(1,5,10,30,40),]
   housing food goods service gender total
1      820  114   183     154 female  1271
5      721   83   176     104 female  1084
10     845   64  1935     414 female  3258
30    1641  440  6471    2063   male 10615
40    1524  964  1739    1410   male  5637

专栏＆＃34;总＆＃34;是前四列的总和。这是一个家庭的支出分为四类。

现在，如果我想要一个性别与总支出的条件密度图，我可以去：

cdplot(gender ~ total, data=household)

我会得到这张图片：

enter image description here

我喜欢和＃34;相同的照片＆＃34; x轴上的支出，但是在y轴上的四个类别（住房，食品，商品，服务）的条件分布。我只能想到一个非常脏的黑客，我生成一个因素，并且，对于第一个数据线，我重复＆＃34;住房＆＃34; 820次，然后＆＃34;食物＆＃34; 114次，等等。

必须有一种更简单的方法，对吧？

Answer 1

正如我所说，你使用错误的工具来获得你想要的东西。您正在设想无法直接从您的数据中获取的情节（见下）。

相反，您需要为数据建模。具体而言，您希望将每个类别的预期支出部分预测为总支出的函数。然后，您想象的图表显示该模型的拟合值（即，任何区域的预计支出比例）作为总支出的函数。这是使用loess曲线执行此操作的一些代码。我绘制原始数据和拟合值，以显示正在发生的事情。

# setup the data
data("household", package = "HSAUR2")
household$total <- rowSums(household[,1:4])
household <- within(household, {
    housing <- housing/total
    food <- food/total
    goods <- goods/total
    service <- service/total
})

# estimate loess curves
l_list <-
list(loess(housing ~ total, data = household),
     loess(food ~ total, data = household),
     loess(goods ~ total, data = household),
     loess(service ~ total, data = household))

# stack fitted curves on top of one another
ndat <- data.frame(total = seq(min(household$total), max(household$total), 100))
p <- lapply(l_list, predict, newdata = ndat)
for(i in 2:length(l_list))
    p[[i]] <- p[[i]] + p[[i-1]]

# plot
plot(NA, xlim=range(household$total), ylim = c(0,1), xlab='Total', ylab='Percent', las=1, xaxs='i')
# plot dots
with(household, points(total, housing, pch = 20, col = palette()[1]))
with(household, points(total, housing + food, pch = 20, col = palette()[2]))
with(household, points(total, housing + food + goods, pch = 20, col = palette()[3]))
with(household, points(total, housing + food + goods + service, pch = 20, col = palette()[4]))
# plot fitted lines
for(i in 1:length(p))
    lines(ndat$total, p[[i]], type = 'l', lwd = 2, col = palette()[i])

结果：

enter image description here

如果你试图根据原始数据创建这样的情节，那看起来会有些奇怪，但也许这就是你想要的：

plot(NA, xlim=range(household$total), ylim = c(0,1), xlab='Total', ylab='Percent', las=1, xaxs='i')
with(household, lines(total[order(total)], housing[order(total)], pch = 20, col = palette()[1]))
with(household, lines(total[order(total)], (housing + food)[order(total)], pch = 20, col = palette()[2]))
with(household, lines(total[order(total)], (housing + food + goods)[order(total)], pch = 20, col = palette()[3]))
with(household, lines(total[order(total)], (housing + food + goods + service)[order(total)], pch = 20, col = palette()[4]))

结果：

enter image description here

条件分布的R图。 cdplot（）似乎没有这样做

1 个答案: