如何从长数据帧中的excel文件导入多维数据

时间:2017-09-13 17:45:11

标签: r dplyr data-manipulation

使用R进行数据操作的最繁琐的练习之一是从未为此目的编译的数据源中读取数据,当我遇到此类问题时,我总是很努力。我欢迎任何帮助,并编写了一个例子,说明了我经常遇到的一些问题。

假设我有以下Excel-fileenter image description here

并希望将其读取为长数据框架,如下所示:

   year country  region    group count
1  2015 Austria capital students  4747
2  2015 Austria capital toddlers  1781
3  2015 Austria capital  workers  1443
4  2015 Austria  cities students  3245
5  2015 Austria  cities toddlers   404
6  2015 Austria  cities  workers   213
7  2015 Austria   total students  7992
8  2015 Austria   total toddlers  2185
9  2015 Austria   total  workers  1656
10 2015 Denmark capital students   289
11 2015 Denmark capital toddlers  3699
12 2015 Denmark capital  workers  1518
13 2015 Denmark  cities students  4659
14 2015 Denmark  cities toddlers  2476
15 2015 Denmark  cities  workers  2495
16 2015 Denmark   total students  4948
17 2015 Denmark   total toddlers  6175
18 2015 Denmark   total  workers  4013
19 2016 Austria capital students  1836
20 2016 Austria capital toddlers  1130
21 2016 Austria capital  workers  1981
22 2016 Austria  cities students   809
23 2016 Austria  cities toddlers  2126
24 2016 Austria  cities  workers  2267
25 2016 Austria   total students  2645
26 2016 Austria   total toddlers  3256
27 2016 Austria   total  workers  4248
28 2016 Denmark capital students  2251
29 2016 Denmark capital toddlers  2555
30 2016 Denmark capital  workers  1829
31 2016 Denmark  cities students  4722
32 2016 Denmark  cities toddlers  2165
33 2016 Denmark  cities  workers  4373
34 2016 Denmark   total students  6973
35 2016 Denmark   total toddlers  4720
36 2016 Denmark   total  workers  6202
37 2017 Austria capital students  1181
38 2017 Austria capital toddlers   710
39 2017 Austria capital  workers  3876
40 2017 Austria  cities students   895
41 2017 Austria  cities toddlers   994
42 2017 Austria  cities  workers  3199
43 2017 Austria   total students  2076
44 2017 Austria   total toddlers  1704
45 2017 Austria   total  workers  7075
46 2017 Denmark capital students  1155
47 2017 Denmark capital toddlers  4455
48 2017 Denmark capital  workers  3292
49 2017 Denmark  cities students   683
50 2017 Denmark  cities toddlers  3565
51 2017 Denmark  cities  workers   561
52 2017 Denmark   total students  1838
53 2017 Denmark   total toddlers  8020
54 2017 Denmark   total  workers  3853

主要挑战是:

  • "标题"跨越几行。可以按照here所描述的那样单独读入标题,但是如果不打算将数据粘贴到单个值上,则不能提供如何将其附加到数据的简单解决方案 - 相同适用于列名。

  • 如果通过NA阅读,有许多链接的单元格将为open.xlsx或为空。可以使用tidyr::fill解决该问题,但这需要首先为标头提供适当的数据结构。

  • 某些标题是相关类别,而其他标题则不是("非工作者"是多余的)

  • 某些标题未在来源中明确说明,必须手动添加(&#34;总计&#34;,&#34;地区&#34;,&#34;组&#34;)。< / p>

导入此数据的另一种方法是read.xlsx仅列3:11和行4:12,使用reshape转换为long并手动添加其他变量,即指定一些rep()级联和希望指定正确的排列以便正确标记字段:

library(openxlsx)
library(reshape2)
library(dpylr)

t <- read.xlsx("h:/example.xlsx", cols=3:11, rows=4:12, colNames=FALSE) %>%

  melt %>%

  transmute(count = value) %>%

  mutate(country = c("Austria","Denmark") %>% rep(each=3) %>% rep(times=9),
         region = c("total","capital","cities") %>% rep(times=54/3),
         year = c(2015:2017) %>% rep(each=6*3),
         group = c("workers","students","toddlers") %>% rep(each=6) %>% rep(times=3))

是否有一种优雅的方式可以将此类数据自动读入R?

0 个答案:

没有答案