Question

我正在尝试使用更具描述性的标签替换数据框中的一些（但不是全部）列名。我有一个带有长名的向量，需要匹配并替换当前的相关列名。

详细信息：

我有一个数据框，其中包含文本和数字列。例如

df<-data.frame(text1=c("nnnn","uuuu","ooo"),
               text2=c("b","t","eee"),
               a1=c(1,2,3),
               a2=c(45,43,23),
               b1=c(43,6,2),
               text3=c("gg","ll","jj"))

所以看起来像这样：

df
  text1 text2 a1 a2 b1 text3
1  nnnn     b  1 45 43    gg
2  uuuu     t  2 43  6    ll
3   ooo   eee  3 23  2    jj

对于某些列标签，我还有一个较长标签的向量：

longnames=c("a1 age","a2 gender","b1 postcode")

在有匹配的长名称的地方，我想完全替换df中的相应短名称。所以我想要的输出是：

  text1 text2 a1 age a2 gender b1 postcode text3
1  nnnn     b      1        45          43    gg
2  uuuu     t      2        43           6    ll
3   ooo   eee      3        23           2    jj

所有需要替换的短标签都与相关长标签的开头唯一匹配。换句话说，短标签“ a2 ”需要替换为长标签“ a2性别”，并且该长标签是唯一以“ << strong> a2 ”。

Answer 1

dplyr::rename可以一次性重命名列的子集，但新名称需要一个命名向量。

library("tidyverse")

df <- data.frame(
  text1 = c("nnnn", "uuuu", "ooo"),
  text2 = c("b", "t", "eee"),
  a1 = c(1, 2, 3),
  a2 = c(45, 43, 23),
  b1 = c(43, 6, 2),
  text3 = c("gg", "ll", "jj")
)

longnames <- c("a1 age", "a2 gender", "b1 postcode")
shortnames <- str_extract(longnames, "^(\\w+)")

# named vector specifying how to rename
names(shortnames) <- longnames
shortnames
#>      a1 age   a2 gender b1 postcode 
#>        "a1"        "a2"        "b1"

df %>%
  rename(!!shortnames)
#>   text1 text2 a1 age a2 gender b1 postcode text3
#> 1  nnnn     b      1        45          43    gg
#> 2  uuuu     t      2        43           6    ll
#> 3   ooo   eee      3        23           2    jj

# In this case `!!shortnames` achieves this:

df %>%
  rename("a1 age" = "a1",
         "a2 gender" = "a2",
         "b1 postcode" = "b1")
#>   text1 text2 a1 age a2 gender b1 postcode text3
#> 1  nnnn     b      1        45          43    gg
#> 2  uuuu     t      2        43           6    ll
#> 3   ooo   eee      3        23           2    jj

^{由reprex package（v0.2.1）于2019-03-28创建}

以编程方式指定新名称很有用，因为我们可以更轻松，更干净地更改列名称规范。但是为了提高可读性，您可以首先从显式规范开始，这只是更多文字。

Answer 2

使用sapply的一种方法。这可以使用for循环以及几乎完全的代码来完成。 seq.int(colnames(df))产生1：ncol(df)的序列。当grep中的各个列名称匹配时，longnames在df中找到索引。然后if条件检查索引向量的长度是否> 0（如果存在列匹配，则应为>）。然后进行替换。

## sapply (can be replaced with lapply)
sapply(seq.int(colnames(df)), function(x) {
  index <- grep(colnames(df)[x], longnames)
  if (length(index) > 0) colnames(df)[x] <<- longnames[index]
})

OR

## for loop (note the difference in <<-)
for (x in seq.int(colnames(df))) {
  index <- grep(colnames(df)[x], longnames)
  if (length(index) > 0) colnames(df)[x] <- longnames[index]
}

Answer 3

m1 = sapply(names(df), function(snm) sapply(longnames, function(lnm) grepl(snm, lnm)))
df1 = setNames(df, replace(names(df), colSums(m1) == 1, longnames[rowSums(m1) == 1]))
df1
#  text1 text2 a1 age a2 gender b1 postcode text3
#1  nnnn     b      1        45          43    gg
#2  uuuu     t      2        43           6    ll
#3   ooo   eee      3        23           2    jj

m1是一个矩阵，显示df和longnames的列名之间的匹配。 colSums(m1) == 1标识具有匹配项的列名称。 rowSums(m1) == 1标识相应的匹配longnames。

或使用部分匹配

inds = pmatch(colnames(df), longnames)
df1 = setNames(df, replace(longnames[inds], is.na(inds), colnames(df)[is.na(inds)]))

Answer 4

您可以使用已经矢量化的adist：

a = which(!attr(adist(names(df),longnames,counts = T),'counts')[,,'sub'],T)

names(df)[a[,'row']] = longnames    #longnames[a[,'col']]

df
  text1 text2 a1 age a2 gender b1 postcode text3
1  nnnn     b      1        45          43    gg
2  uuuu     t      2        43           6    ll
3   ooo   eee      3        23           2    jj

如何匹配和替换列名的子集

4 个答案: