我有一个这样的清单:
list(
structure(
list(
time = structure(
1452841800,
class = c("POSIXct", "POSIXt")
),
latitude = 34.0128987,
longitude = -84.7879747,
location = structure(
list(),
.Names = character(0)
),
day = "FRIDAY"
),
.Names = c("time", "latitude", "longitude", "location", "day")
),
structure(
list(
time = structure(
1456875240,
class = c("POSIXct", "POSIXt")
),
latitude = 35.85285882,
longitude = -78.69758511,
location = structure(
list(
postcode = "27612"
),
.Names = "postcode"
),
day = "TUESDAY"
),
.Names = c("time", "latitude", "longitude", "location", "day")
),
structure(
list(
time = structure(
1456621440,
class = c("POSIXct", "POSIXt")
),
latitude = 33.81418132,
longitude = -84.73134873,
location = structure(
list(
postcode = "30127"
),
.Names = "postcode"
),
day = "SATURDAY"
),
.Names = c("time", "latitude", "longitude", "location", "day")
),
structure(
list(
time = structure(
1451953320,
class = c("POSIXct", "POSIXt")
),
latitude = 33.6678031,
longitude = -86.5398931,
location = structure(
list(
postcode = "35173"
),
.Names = "postcode"
),
day = "MONDAY"
),
.Names = c("time", "latitude", "longitude", "location", "day")
),
structure(
list(
time = structure(
1452966960,
class = c("POSIXct", "POSIXt")
),
latitude = 33.8458767,
longitude = -84.0986578,
location = structure(
list(
postcode = "30047"
),
.Names = "postcode"
),
day = "SATURDAY"
),
.Names = c("time", "latitude", "longitude", "location", "day")
),
structure(
list(
time = structure(
1455584160,
class = c("POSIXct", "POSIXt")
),
latitude = 36.4001153,
longitude = -105.5727933,
location = structure(
list(
postcode = "87571"
),
.Names = "postcode"
),
day = "MONDAY"
),
.Names = c("time", "latitude", "longitude", "location", "day")
)
)
我想变成一个数据框。我几乎到了那里,但遇到了一些挑战。当我删除不是“数字”的列表元素时,我得到一个带有数字列的漂亮数据框,如下所示:
df <- as.data.frame(
do.call(rbind, lapply(d, function(x) unlist(x[-c(4, 5)]))),
stringsAsFactors = FALSE
)
str(df)
'data.frame': 6 obs. of 3 variables:
$ time : num 1.45e+09 1.46e+09 1.46e+09 1.45e+09 1.45e+09 ...
$ latitude : num 34 35.9 33.8 33.7 33.8 ...
$ longitude: num -84.8 -78.7 -84.7 -86.5 -84.1 ...
到目前为止一直很好......
现在,当我在列表中有一个字符项时,我将所有列强制转换为字符类。不是我想要的。当然,我可以重新转换回来。但...
df <- as.data.frame(
do.call(rbind, lapply(d, function(x) unlist(x[-4]))),
stringsAsFactors = FALSE
)
str(df)
'data.frame': 6 obs. of 4 variables:
$ time : chr "1452841800" "1456875240" "1456621440" "1451953320" ...
$ latitude : chr "34.0128987" "35.85285882" "33.81418132" "33.6678031" ...
$ longitude: chr "-84.7879747" "-78.69758511" "-84.73134873" "-86.5398931" ...
$ day : chr "FRIDAY" "TUESDAY" "SATURDAY" "MONDAY" ...
最后,因为location$postcode
字段有一个空列表,所以整个机制甚至无法给我一个正确的数据框。我正在通过单独提取该字段来解决它,并将列绑定如下:
postcode <- sapply(d, function(x) if (length(x$location)) unlist(x$location) else NA)
df$postcode <- postcode
df
time latitude longitude day postcode
1 1452841800 34.0128987 -84.7879747 FRIDAY <NA>
2 1456875240 35.85285882 -78.69758511 TUESDAY 27612
3 1456621440 33.81418132 -84.73134873 SATURDAY 30127
4 1451953320 33.6678031 -86.5398931 MONDAY 35173
5 1452966960 33.8458767 -84.0986578 SATURDAY 30047
6 1455584160 36.4001153 -105.5727933 MONDAY 87571
三个问题:
1)如何在将列表转换为数据框时保留类?
2)是否有更好的方法来处理列表中的空列表项(我的邮政编码字段)
3)如果在#2上没有其他方式,是否有更有效的方式来做我正在做的事情而不是再通过数据循环?我想我可以在邮政编码字段上结合空列表检查,并在我使用lapply
do.call(rbind, ...)
内连接
编辑:为清楚起见,这些是我列表中命名元素的类:
sapply(d[[1]], class)
$time
[1] "POSIXct" "POSIXt"
$latitude
[1] "numeric"
$longitude
[1] "numeric"
$location
[1] "list"
$day
[1] "character"
在某种程度上,“第一个”案例的工作原理是保留数值,即将我的POSIXct元素time
转换为数字后仍然存在。我希望它保持完整。 :)
答案 0 :(得分:3)
单独处理$location
(绕过空列表问题),然后对每个列表项使用as.data.frame
(绕过一切为角色的问题)。
d2 <- lapply(d, function(df) {
as.data.frame(within(df, location <- if (length(location) > 0) location$postcode else NA),
stringsAsFactors = FALSE)
})
str(do.call(rbind, d2))
# 'data.frame': 6 obs. of 5 variables:
# $ time : POSIXct, format: "2016-01-14 23:10:00" "2016-03-01 15:34:00" "2016-02-27 17:04:00" ...
# $ latitude : num 34 35.9 33.8 33.7 33.8 ...
# $ longitude: num -84.8 -78.7 -84.7 -86.5 -84.1 ...
# $ location : chr NA "27612" "30127" "35173" ...
# $ day : Factor w/ 4 levels "FRIDAY","TUESDAY",..: 1 2 3 4 3 4
编辑:作为评论,上面的表现有点惨淡。这可以改进:
d3 <- lapply(d, function(df) {
within(df, location <- if (length(location) > 0) location$postcode else NA)
})
str(do.call(rbind.data.frame, c(d3, list(stringsAsFactors = FALSE))))
# 'data.frame': 6 obs. of 5 variables:
# $ time : num 1.45e+09 1.46e+09 1.46e+09 1.45e+09 1.45e+09 ...
# $ latitude : num 34 35.9 33.8 33.7 33.8 ...
# $ longitude: num -84.8 -78.7 -84.7 -86.5 -84.1 ...
# $ location : chr NA "27612" "30127" "35173" ...
# $ day : chr "FRIDAY" "TUESDAY" "SATURDAY" "MONDAY" ...
(不幸的是,POSIX
类在此过程中丢失。可以通过调用as.POSIXct
来修复此问题。)
后一种技术的表现要好一些,大约快3-4倍。
答案 1 :(得分:0)
首先,$location
字段中的数据需要适当更改:
for(i in seq_along(d))
d[[i]]$location = if(length((tmp <- d[[i]]$location$postcode))) tmp else NA_character_
然后,使用方便的d
版本并行地并联Map(c, d[[1]], d[[2]], ...)
形式的.mapply
中每个元素的每个子元素。此外 - 至少对于此示例中的隐式/显式class
es,有一个c
方法可用,因此不会丢失class
属性:
ans = .mapply(c, d, NULL)
并使用适当的“名称”转换为“data.frame”:
ans = structure(ans,
class = "data.frame",
row.names = .set_row_names(length(ans[[1]])),
names = names(d[[1]]))
str(ans)
#'data.frame': 6 obs. of 5 variables:
# $ time : POSIXct, format: "2016-01-15 09:10:00" "2016-03-02 01:34:00" "2016-02-28 03:04:00" "2016-01-05 02:22:00" ...
# $ latitude : num 34 35.9 33.8 33.7 33.8 ...
# $ longitude: num -84.8 -78.7 -84.7 -86.5 -84.1 ...
# $ location : chr NA "27612" "30127" "35173" ...
# $ day : chr "FRIDAY" "TUESDAY" "SATURDAY" "MONDAY" ...