使用R
进行数据操作的最繁琐的练习之一是从未为此目的编译的数据源中读取数据,当我遇到此类问题时,我总是很努力。我欢迎任何帮助,并编写了一个例子,说明了我经常遇到的一些问题。
假设我有以下Excel-file:
并希望将其读取为长数据框架,如下所示:
year country region group count
1 2015 Austria capital students 4747
2 2015 Austria capital toddlers 1781
3 2015 Austria capital workers 1443
4 2015 Austria cities students 3245
5 2015 Austria cities toddlers 404
6 2015 Austria cities workers 213
7 2015 Austria total students 7992
8 2015 Austria total toddlers 2185
9 2015 Austria total workers 1656
10 2015 Denmark capital students 289
11 2015 Denmark capital toddlers 3699
12 2015 Denmark capital workers 1518
13 2015 Denmark cities students 4659
14 2015 Denmark cities toddlers 2476
15 2015 Denmark cities workers 2495
16 2015 Denmark total students 4948
17 2015 Denmark total toddlers 6175
18 2015 Denmark total workers 4013
19 2016 Austria capital students 1836
20 2016 Austria capital toddlers 1130
21 2016 Austria capital workers 1981
22 2016 Austria cities students 809
23 2016 Austria cities toddlers 2126
24 2016 Austria cities workers 2267
25 2016 Austria total students 2645
26 2016 Austria total toddlers 3256
27 2016 Austria total workers 4248
28 2016 Denmark capital students 2251
29 2016 Denmark capital toddlers 2555
30 2016 Denmark capital workers 1829
31 2016 Denmark cities students 4722
32 2016 Denmark cities toddlers 2165
33 2016 Denmark cities workers 4373
34 2016 Denmark total students 6973
35 2016 Denmark total toddlers 4720
36 2016 Denmark total workers 6202
37 2017 Austria capital students 1181
38 2017 Austria capital toddlers 710
39 2017 Austria capital workers 3876
40 2017 Austria cities students 895
41 2017 Austria cities toddlers 994
42 2017 Austria cities workers 3199
43 2017 Austria total students 2076
44 2017 Austria total toddlers 1704
45 2017 Austria total workers 7075
46 2017 Denmark capital students 1155
47 2017 Denmark capital toddlers 4455
48 2017 Denmark capital workers 3292
49 2017 Denmark cities students 683
50 2017 Denmark cities toddlers 3565
51 2017 Denmark cities workers 561
52 2017 Denmark total students 1838
53 2017 Denmark total toddlers 8020
54 2017 Denmark total workers 3853
主要挑战是:
"标题"跨越几行。可以按照here所描述的那样单独读入标题,但是如果不打算将数据粘贴到单个值上,则不能提供如何将其附加到数据的简单解决方案 - 相同适用于列名。
如果通过NA
阅读,有许多链接的单元格将为open.xlsx
或为空。可以使用tidyr::fill
解决该问题,但这需要首先为标头提供适当的数据结构。
某些标题是相关类别,而其他标题则不是("非工作者"是多余的)
某些标题未在来源中明确说明,必须手动添加(&#34;总计&#34;,&#34;地区&#34;,&#34;组&#34;)。< / p>
导入此数据的另一种方法是read.xlsx仅列3:11和行4:12,使用reshape转换为long并手动添加其他变量,即指定一些rep()
级联和希望指定正确的排列以便正确标记字段:
library(openxlsx)
library(reshape2)
library(dpylr)
t <- read.xlsx("h:/example.xlsx", cols=3:11, rows=4:12, colNames=FALSE) %>%
melt %>%
transmute(count = value) %>%
mutate(country = c("Austria","Denmark") %>% rep(each=3) %>% rep(times=9),
region = c("total","capital","cities") %>% rep(times=54/3),
year = c(2015:2017) %>% rep(each=6*3),
group = c("workers","students","toddlers") %>% rep(each=6) %>% rep(times=3))
是否有一种优雅的方式可以将此类数据自动读入R?