交叉制表R

时间:2014-09-08 18:15:46

标签: r pivot-table crosstab

我一直在拼命想要创建一个简单的交叉制表/数据透视表来显示我的一些数据。它是一个巨大的数据框架,有大约11,000个观测值和100多个变量,所以为了简单起见,我在这里创建了一个用作示例的子集(参见最后一个代码块中的结构)。

每次观察代表家庭调查中的个体家庭。尝试创建交叉表的两个变量是' HouseholdSize'和' buildingRef'。我希望我的交叉表能够显示一定数量的住户数,这些住户数量与特定建筑物中的住户数量相对应,如建筑物参考所示。

我一直在玩count(来自plyr)和dcast(来自reshape2),并且我的交叉标签位于我所拥有的' householdSize'作为我的行和' freq'作为我的列(其中freq = buildingRef的计数)使用:

hhsize_counted <- count(hh_table_short, c("householdSize","buildingRef"))
hhsize_counted <- dcast(hhsize_counted, householdSize~freq)

但是我似乎从原始数据框中丢失了大约25个观察结果,并且我不完全确定原因(hh_table_short有200个obs; hhsize_counted有175个)。

任何人都可以帮助我理解为什么会这样吗?或者至关重要的是,任何人都可以指向另一种实现相同目标的方向!?

我已经在网上和网络上搜索了解决方案,但无济于事,非常感谢任何帮助!

感谢

玛蒂

    structure(list(buildingRef = structure(c(1L, 2L, 2L, 2L, 3L, 
4L, 5L, 6L, 6L, 6L, 7L, 8L, 9L, 10L, 10L, 10L, 10L, 11L, 12L, 
13L, 14L, 15L, 15L, 16L, 16L, 16L, 17L, 18L, 19L, 19L, 20L, 21L, 
22L, 22L, 23L, 23L, 23L, 23L, 23L, 23L, 24L, 24L, 24L, 25L, 26L, 
26L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 32L, 33L, 34L, 35L, 36L, 
37L, 38L, 39L, 40L, 41L, 42L, 43L, 43L, 44L, 45L, 46L, 46L, 47L, 
48L, 49L, 50L, 50L, 51L, 52L, 53L, 53L, 53L, 54L, 55L, 56L, 57L, 
58L, 59L, 60L, 60L, 61L, 62L, 62L, 63L, 64L, 64L, 64L, 65L, 66L, 
67L, 68L, 69L, 69L, 69L, 70L, 70L, 70L, 71L, 71L, 72L, 72L, 73L, 
74L, 74L, 75L, 75L, 75L, 75L, 76L, 77L, 78L, 78L, 79L, 79L, 79L, 
79L, 79L, 80L, 80L, 81L, 81L, 82L, 82L, 83L, 83L, 84L, 85L, 85L, 
85L, 85L, 85L, 85L, 86L, 87L, 87L, 88L, 88L, 88L, 88L, 89L, 90L, 
91L, 91L, 91L, 91L, 91L, 92L, 93L, 94L, 95L, 96L, 97L, 98L, 99L, 
99L, 100L, 101L, 102L, 103L, 104L, 105L, 106L, 106L, 106L, 107L, 
108L, 108L, 108L, 109L, 110L, 111L, 111L, 111L, 111L, 111L, 111L, 
112L, 113L, 114L, 114L, 114L, 114L, 114L, 114L, 115L, 116L, 117L, 
118L, 119L, 119L, 119L, 120L), .Label = c("1001001031", "1002001029", 
"1002001060", "1002003013", "1002005026", "1002005060", "1002005088", 
"1002005111", "1002005135", "1002006021", "1002007024", "1004001030", 
"1005001032", "1005002011", "1005003008", "1005003036", "1005005005", 
"1005005030", "1005006068", "1005007012", "1005007043", "1005008019", 
"1005009005", "1005009032", "1005009057", "1005010012", "1005011010", 
"1005013052", "1005015012", "1005016024", "1005017002", "1005017042", 
"1005017077", "1006001008", "1006002010", "1006002039", "1006002063", 
"1006004001", "1006004028", "1006005015", "1006005035", "1006006012", 
"1006007015", "1006007040", "1006008012", "1006008035", "1006009024", 
"1006009052", "1006010015", "1006010044", "1006011032", "1006012001", 
"1006012029", "1006014004", "1006014055", "1006017009", "1007001038", 
"1007003019", "1007003043", "1007004007", "1007004057", "1007005008", 
"1008001030", "1008004006", "1008005024", "1008006014", "1008007019", 
"1009001008", "1009002017", "1009003001", "1009003031", "1009003055", 
"1009003080", "1009004008", "1009004034", "1009004057", "1009005024", 
"1009005053", "1009005077", "1010001005", "1010001046", "1010002011", 
"1010002034", "1010002056", "1010002083", "1010003036", "1010004001", 
"1010005003", "1010006005", "1010007005", "1011001011", "1011002003", 
"1011004012", "1011007002", "1011008016", "1012003036", "1012004008", 
"1012005003", "1012009015", "1013002002", "1013006004", "1013009002", 
"1014001013", "1014003010", "1014004025", "1014006010", "1014008003", 
"1014010011", "1015003013", "1015004018", "1015005009", "1015006015", 
"1015006042", "1015007025", "1015010002", "1015010030", "1015012014", 
"1016001002", "1016002004", "1016003019"), class = "factor"), 
    householdSize = c(5L, 5L, 5L, 3L, 4L, 5L, 4L, 4L, 4L, 7L, 
    5L, 4L, 4L, 2L, 1L, 0L, 4L, 0L, 6L, 7L, 12L, 3L, 2L, 4L, 
    3L, 4L, 4L, 9L, 6L, 4L, 6L, 6L, 3L, 2L, 4L, 3L, 5L, 4L, 3L, 
    2L, 2L, 1L, 1L, 7L, 5L, 7L, 4L, 2L, 6L, 7L, 2L, 5L, 3L, 2L, 
    6L, 12L, 5L, 4L, 9L, 10L, 8L, 7L, 6L, 5L, 2L, 0L, 2L, 4L, 
    5L, 3L, 3L, 2L, 4L, 2L, 1L, 4L, 5L, 10L, 1L, 1L, 4L, 4L, 
    4L, 4L, 7L, 23L, 4L, 6L, 1L, 5L, 4L, 4L, 2L, 1L, 0L, 2L, 
    8L, 9L, 7L, 7L, 7L, 6L, 4L, 4L, 4L, 7L, 3L, 15L, 6L, 6L, 
    3L, 5L, 8L, 5L, 4L, 4L, 11L, 4L, 7L, 1L, 1L, 1L, 3L, 3L, 
    6L, 2L, 1L, 9L, 4L, 15L, 1L, 5L, 1L, 1L, 2L, 10L, 11L, 2L, 
    8L, 15L, 9L, 7L, 2L, 9L, 4L, 4L, 0L, 2L, 6L, 5L, 2L, 2L, 
    2L, 3L, 4L, 5L, 9L, 26L, 6L, 7L, 3L, 3L, 4L, 9L, 1L, 6L, 
    4L, 4L, 4L, 3L, 5L, 3L, 5L, 4L, 2L, 0L, 5L, 7L, 2L, 6L, 2L, 
    0L, 6L, 5L, 8L, 12L, 7L, 5L, 6L, 5L, 5L, 5L, 4L, 2L, 3L, 
    6L, 3L, 3L, 3L, 2L)), .Names = c("buildingRef", "householdSize"
), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 8L, 9L, 10L, 11L, 12L, 
13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 
26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 
39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 
52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 
65L, 66L, 67L, 68L, 69L, 70L, 71L, 72L, 73L, 74L, 75L, 76L, 77L, 
78L, 79L, 80L, 81L, 82L, 83L, 84L, 85L, 87L, 88L, 89L, 90L, 91L, 
92L, 93L, 94L, 95L, 96L, 97L, 98L, 99L, 100L, 101L, 102L, 103L, 
104L, 105L, 106L, 107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 
115L, 116L, 117L, 118L, 119L, 120L, 121L, 122L, 123L, 124L, 125L, 
126L, 127L, 128L, 129L, 130L, 131L, 132L, 133L, 134L, 135L, 136L, 
137L, 138L, 139L, 140L, 141L, 142L, 143L, 144L, 145L, 146L, 147L, 
148L, 149L, 150L, 151L, 152L, 153L, 154L, 155L, 156L, 157L, 158L, 
159L, 160L, 161L, 162L, 163L, 165L, 166L, 167L, 168L, 169L, 170L, 
171L, 172L, 173L, 174L, 175L, 176L, 177L, 178L, 179L, 181L, 182L, 
183L, 184L, 185L, 186L, 187L, 188L, 189L, 190L, 191L, 192L, 193L, 
194L, 195L, 196L, 197L, 198L, 199L, 200L, 201L, 202L, 203L, 205L
), class = "data.frame")

1 个答案:

答案 0 :(得分:1)

你非常接近,你计算行数的第一行很好:

hhsize_counted <- count(hh_table_short, c("householdSize","buildingRef"))

但是当您执行func.aggregate时(而不是默认的sum),您需要将dcast设置为length。否则,它会将您的原始hhsize_counted数据集视为频率始终为1。

hhsize_counted <- dcast(hhsize_counted, householdSize~freq, fun.aggregate=sum)