将数据帧转换为逗号分隔字符串列表的最快方法

时间:2016-04-13 01:30:57

标签: arrays r performance dataframe type-conversion

我有一个数据框,其中包含市政名称和州名称。它看起来像这样:

 my.df <- structure(list(Location = c("Abatiá", "Adrianópolis", "Agudos do Sul", 
"Almirante Tamandaré", "Altamira do Paraná", "Altônia"), State = c("PR", 
"PR", "PR", "PR", "PR", "PR")), .Names = c("Location", "State"
), row.names = 0:5, class = "data.frame")

我需要做的是将此数据帧转换为数组。预期的输出将是:

my.array$PR
Abatiá, PR
Adrianópolis, PR
Agudos do Sul, PR
...

my.array$RS
Vitória das Missões, RS
Westfalia, RS
Xangri-lá, RS
...

等等。

我怎样才能到达那里?

我的实际数据集大约有10k行,因此快速解决方案可能比清晰度更好。谢谢!

3 个答案:

答案 0 :(得分:4)

以下内容可为您提供所需内容。

df = data.frame("location" = c("a", "b", "c", "d", "e", "f"), "state" = c("pr", "pr", "pr", "rs", "rs", "rs"), stringsAsFactors=F)
my.array = lapply(unique(df$state), function(x) paste(df$location[df$state == x], df$state[df$state == x], sep=", "))
names(my.array) = unique(df$state)
my.array$pr
# [1] "a, pr" "b, pr", "c, pr"

我简化了df中的值,但这一点保持不变。

答案 1 :(得分:4)

因为您需要类似列表的结果(即您可以使用$进行索引),请在split上使用State。它自然会生成一个名为State的列表

一种方法是先拆分

split_df <- split(my.df, my.df$State)
my.array <- sapply(names(split_df), function(name) 
                               paste(split_df[[name]][["Location"]],
                                     ", ", name, sep=""), 
                    USE.NAMES = TRUE)

第二种使用split的方法(在考虑了你的问题之后,看起来更优雅)就是在位置之后拆分,直接显示状态对

# First, create a new vector (array) of location, state pairs
# use apply(X, 1, FUN) which works row-wise along X
# and for each row, paste it together
location_state <- apply(my.df,
                        1,
                        function(r) paste(r["Location"],
                                          r["State"],
                                          sep=', '))
#Second, split that vector, using State
split(location_state, my.df$State)

示例数据

states <- sapply(1:100, function(pass) paste0(sample(LETTERS, 2), collapse=""))
my.df <- data.frame(State=sample(states, 10000, replace=TRUE),
                 Location=sapply(1:1e4, function(pass) paste0(sample(letters, 5),
                                                         collapse="")), 
                    stringsAsFactors=FALSE)

答案 2 :(得分:2)

如何使用Reduce?

Reduce(function(...) paste(..., sep=", "), my.df)

编辑:使用@thelatemail建议更新基准测试

#for your  benchmarking using 1 million rows
library(rbenchmark)
df <- data.frame(X=rnorm(1e6), Y=rnorm(1e6))
benchmark(M1=Reduce(function(...) paste(..., sep=", "), df),
    M2=do.call(paste, c(df, sep=", ")))

##test replications elapsed relative user.self sys.self user.child sys.child
##1   M1           10   68.60    1.219     68.55     0.00         NA        NA
##2   M2           10   56.28    1.000     56.22     0.07         NA        NA

do.call(paste,c(df,sep =&#34;,&#34;))肯定更快!