在保留类的同时将混合类元素列表转换为数据框

时间:2016-06-17 23:21:48

标签: r data-manipulation

我有一个这样的清单:

list(
    structure(
        list(
            time = structure(
                1452841800, 
                class = c("POSIXct", "POSIXt")
            ), 
            latitude = 34.0128987, 
            longitude = -84.7879747, 
            location = structure(
                list(), 
                .Names = character(0)
            ),
            day = "FRIDAY"
        ), 
        .Names = c("time", "latitude", "longitude", "location", "day")
    ), 
    structure(
        list(
            time = structure(
                1456875240, 
                class = c("POSIXct", "POSIXt")
                ), 
            latitude = 35.85285882, 
            longitude = -78.69758511, 
            location = structure(
                list(
                    postcode = "27612"
                ), 
                .Names = "postcode"
            ), 
            day = "TUESDAY"
        ), 
        .Names = c("time", "latitude", "longitude", "location", "day")
    ), 
    structure(
        list(
            time = structure(
                1456621440, 
                class = c("POSIXct", "POSIXt")
            ), 
            latitude = 33.81418132, 
            longitude = -84.73134873, 
            location = structure(
                list(
                    postcode = "30127"
                ), 
                .Names = "postcode"
            ), 
            day = "SATURDAY"
        ), 
        .Names = c("time", "latitude", "longitude", "location", "day")
    ), 
    structure(
        list(
            time = structure(
                1451953320, 
                class = c("POSIXct", "POSIXt")
            ), 
            latitude = 33.6678031, 
            longitude = -86.5398931, 
            location = structure(
                list(
                    postcode = "35173"
                ), 
                .Names = "postcode"
            ), 
            day = "MONDAY"
        ), 
        .Names = c("time", "latitude", "longitude", "location", "day")
    ), 
    structure(
        list(
            time = structure(
                1452966960, 
                class = c("POSIXct", "POSIXt")
            ), 
            latitude = 33.8458767, 
            longitude = -84.0986578, 
            location = structure(
                list(
                    postcode = "30047"
                ), 
                .Names = "postcode"
            ), 
            day = "SATURDAY"
        ), 
        .Names = c("time", "latitude", "longitude", "location", "day")
    ), 
    structure(
        list(
            time = structure(
                1455584160, 
                class = c("POSIXct", "POSIXt")
            ), 
            latitude = 36.4001153, 
            longitude = -105.5727933, 
            location = structure(
                list(
                    postcode = "87571"
                ), 
                .Names = "postcode"
            ), 
            day = "MONDAY"
        ), 
        .Names = c("time", "latitude", "longitude", "location", "day")
    )
)

我想变成一个数据框。我几乎到了那里,但遇到了一些挑战。当我删除不是“数字”的列表元素时,我得到一个带有数字列的漂亮数据框,如下所示:

df <- as.data.frame(
    do.call(rbind, lapply(d, function(x) unlist(x[-c(4, 5)]))),
    stringsAsFactors = FALSE
)
str(df)
'data.frame':   6 obs. of  3 variables:
 $ time     : num  1.45e+09 1.46e+09 1.46e+09 1.45e+09 1.45e+09 ...
 $ latitude : num  34 35.9 33.8 33.7 33.8 ...
 $ longitude: num  -84.8 -78.7 -84.7 -86.5 -84.1 ...

到目前为止一直很好......

现在,当我在列表中有一个字符项时,我将所有列强制转换为字符类。不是我想要的。当然,我可以重新转换回来。但...

df <- as.data.frame(
        do.call(rbind, lapply(d, function(x) unlist(x[-4]))),
        stringsAsFactors = FALSE
        )
str(df)
'data.frame':   6 obs. of  4 variables:
 $ time     : chr  "1452841800" "1456875240" "1456621440" "1451953320" ...
 $ latitude : chr  "34.0128987" "35.85285882" "33.81418132" "33.6678031" ...
 $ longitude: chr  "-84.7879747" "-78.69758511" "-84.73134873" "-86.5398931" ...
 $ day      : chr  "FRIDAY" "TUESDAY" "SATURDAY" "MONDAY" ...

最后,因为location$postcode字段有一个空列表,所以整个机制甚至无法给我一个正确的数据框。我正在通过单独提取该字段来解决它,并将列绑定如下:

postcode <- sapply(d, function(x) if (length(x$location)) unlist(x$location) else NA)
df$postcode <- postcode
df
        time    latitude    longitude      day postcode
1 1452841800  34.0128987  -84.7879747   FRIDAY     <NA>
2 1456875240 35.85285882 -78.69758511  TUESDAY    27612
3 1456621440 33.81418132 -84.73134873 SATURDAY    30127
4 1451953320  33.6678031  -86.5398931   MONDAY    35173
5 1452966960  33.8458767  -84.0986578 SATURDAY    30047
6 1455584160  36.4001153 -105.5727933   MONDAY    87571

三个问题:

1)如何在将列表转换为数据框时保留类?

2)是否有更好的方法来处理列表中的空列表项(我的邮政编码字段)

3)如果在#2上没有其他方式,是否有更有效的方式来做我正在做的事情而不是再通过数据循环?我想我可以在邮政编码字段上结合空列表检查,并在我使用lapply

do.call(rbind, ...)内连接

编辑:为清楚起见,这些是我列表中命名元素的类:

sapply(d[[1]], class)
$time
[1] "POSIXct" "POSIXt" 

$latitude
[1] "numeric"

$longitude
[1] "numeric"

$location
[1] "list"

$day
[1] "character"

在某种程度上,“第一个”案例的工作原理是保留数值,即将我的POSIXct元素time转换为数字后仍然存在。我希望它保持完整。 :)

2 个答案:

答案 0 :(得分:3)

单独处理$location(绕过空列表问题),然后对每个列表项使用as.data.frame(绕过一切为角色的问题)。

d2 <- lapply(d, function(df) {
          as.data.frame(within(df, location <- if (length(location) > 0) location$postcode else NA),
                        stringsAsFactors = FALSE)
      })
str(do.call(rbind, d2))
# 'data.frame': 6 obs. of  5 variables:
#  $ time     : POSIXct, format: "2016-01-14 23:10:00" "2016-03-01 15:34:00" "2016-02-27 17:04:00" ...
#  $ latitude : num  34 35.9 33.8 33.7 33.8 ...
#  $ longitude: num  -84.8 -78.7 -84.7 -86.5 -84.1 ...
#  $ location : chr  NA "27612" "30127" "35173" ...
#  $ day      : Factor w/ 4 levels "FRIDAY","TUESDAY",..: 1 2 3 4 3 4

编辑:作为评论,上面的表现有点惨淡。这可以改进:

d3 <- lapply(d, function(df) {
             within(df, location <- if (length(location) > 0) location$postcode else NA)
      })
str(do.call(rbind.data.frame, c(d3, list(stringsAsFactors = FALSE))))
# 'data.frame': 6 obs. of  5 variables:
#  $ time     : num  1.45e+09 1.46e+09 1.46e+09 1.45e+09 1.45e+09 ...
#  $ latitude : num  34 35.9 33.8 33.7 33.8 ...
#  $ longitude: num  -84.8 -78.7 -84.7 -86.5 -84.1 ...
#  $ location : chr  NA "27612" "30127" "35173" ...
#  $ day      : chr  "FRIDAY" "TUESDAY" "SATURDAY" "MONDAY" ...

(不幸的是,POSIX类在此过程中丢失。可以通过调用as.POSIXct来修复此问题。)

后一种技术的表现要好一些,大约快3-4倍。

答案 1 :(得分:0)

首先,$location字段中的数据需要适当更改:

for(i in seq_along(d)) 
    d[[i]]$location = if(length((tmp <- d[[i]]$location$postcode))) tmp else NA_character_

然后,使用方便的d版本并行地并联Map(c, d[[1]], d[[2]], ...)形式的.mapply中每个元素的每个子元素。此外 - 至少对于此示例中的隐式/显式class es,有一个c方法可用,因此不会丢失class属性:

ans = .mapply(c, d, NULL)

并使用适当的“名称”转换为“data.frame”:

ans = structure(ans, 
                class = "data.frame", 
                row.names = .set_row_names(length(ans[[1]])), 
                names = names(d[[1]]))
str(ans)
#'data.frame':   6 obs. of  5 variables:
# $ time     : POSIXct, format: "2016-01-15 09:10:00" "2016-03-02 01:34:00" "2016-02-28 03:04:00" "2016-01-05 02:22:00" ...
# $ latitude : num  34 35.9 33.8 33.7 33.8 ...
# $ longitude: num  -84.8 -78.7 -84.7 -86.5 -84.1 ...
# $ location : chr  NA "27612" "30127" "35173" ...
# $ day      : chr  "FRIDAY" "TUESDAY" "SATURDAY" "MONDAY" ...