在数据帧列表中,用前导零填充变量(理想情况下用w / stringr)

时间:2019-02-14 22:00:45

标签: r lapply number-formatting stringr pad

我正在处理数据帧列表。在每个数据帧中,我想用前导零填充单个ID变量。 ID变量是字符向量,并且始终是数据帧中的第一个变量。但是,在每个数据帧中,ID变量的长度都不同。例如:

df1_id的范围是1:20,因此我需要填充最多一个零, df2_id的范围是1:100,因此我需要填充最多两个零, 等

我的问题是,如何填充每个数据帧而不必为列表中的每个数据帧编写一行代码。

如上所述,我可以通过在每个数据帧上分别使用str_pad函数来解决此问题。例如,请参见下面的代码:

#Load stringr package
library(stringr)

#Create sample data frames
df1 <- data.frame("x" = as.character(1:20), "y" = rnorm(20, 10, 1), 
stringsAsFactors = FALSE)

df2 <- data.frame("v" = as.character(1:100), "y" = rnorm(100, 10, 1), 
stringsAsFactors = FALSE)

df3 <- data.frame("z" = as.character(1:1000), "y" = rnorm(1000, 10, 1), 
stringsAsFactors = FALSE)

#Combine data fames into list
dfl <- list(df1, df2, df3)

#Pad ID variables with leading zeros
dfl[[1]]$x <- str_pad(dfl[[1]]$x, width = 2, pad = "0")
dfl[[2]]$v <- str_pad(dfl[[2]]$v, width = 3, pad = "0")
dfl[[3]]$z <- str_pad(dfl[[3]]$z, width = 4, pad = "0")

虽然此解决方案对于较短的列表比较有效,但是随着数据帧数量的增加,它变得有些笨拙。

如果有一种方法可以将某种“序列”矢量嵌入到str_pad函数的width参数中,我会很喜欢。像这样:

dfl <- lapply(dfl, function(x) {x[,1] <- str_pad(x[,1], width = SEQ, pad = 
"0")})

其中SEQ是可变长度的向量。使用上面的示例,它看起来像:

seq <- c(2,3,4)

预先感谢,如果您有任何疑问,请告诉我。

〜kj

1 个答案:

答案 0 :(得分:0)

You could use Map here, which is designed to apply a function "to the first elements of each ... argument, the second elements, the third elements", see ?mapply for details.

library(stringr)
vec <- c(2,3,4) # this is the vector of 'widths', don't name it seq

Map(function(i, y) {
  dfl[[i]][, 1] <- str_pad(dfl[[i]][, 1], width = y, pad = "0")
  dfl[[i]] # this gets returned
}, 
# you iterate over these two vectors in parallel
i = 1:length(dfl), 
y = vec) 

Output

#[[1]]
#   x         y
#1 01  9.373546
#2 02 10.183643
#3 03  9.164371
#
#[[2]]
#    v         y
#1 001 11.595281
#2 002 10.329508
#3 003  9.179532
#4 004 10.487429
#
#[[3]]
#     z         y
#1 0001 10.738325
#2 0002 10.575781
#3 0003  9.694612
#4 0004 11.511781
#5 0005 10.389843

explanation

The function that we pass to Map is an anonymous function, which more or less you provided in your question:

function(i, y) {
  dfl[[i]][, 1] <- str_pad(dfl[[i]][, 1], width = y, pad = "0")
  dfl[[i]] # this gets returned
}

You see the function takes two argument, i and y (choose other names if you like such as df and width), and for each dataframe in your list it modifies the first column dfl[[i]][, 1] <- ... . What the anonymous function does is it applies str_pad to the first column of each dataframe

... <- str_pad(dfl[[i]][, 1], width = y, pad = "0")

but you see that we don't pass a fixed value to the width argument, but y.

Coming back to Map. Map now applies str_pad to the first dataframe, with argument width = 2, it applies str_pad to the second dataframe, with argument width = 3 and - you probably guessed it - it applies str_pad to the third dataframe in your list, with argument width = 4.

The arguments are specified in the last two lines of the code as

i = 1:length(dfl), 
y = vec) 

I hope this helps.


data

(consider to create a minimal example next time as the number of rows of the dataframes is not relevant for the problem)

set.seed(1)
df1 <- data.frame("x" = as.character(1:3), "y" = rnorm(3, 10, 1), 
                  stringsAsFactors = FALSE)

df2 <- data.frame("v" = as.character(1:4), "y" = rnorm(4, 10, 1), 
                  stringsAsFactors = FALSE)

df3 <- data.frame("z" = as.character(1:5), "y" = rnorm(5, 10, 1), 
                  stringsAsFactors = FALSE)

#Combine data fames into list
dfl <- list(df1, df2, df3)