在R中匹配和粘贴数据帧

时间:2015-06-03 13:03:25

标签: regex r dataframe pattern-matching

我遇到匹配和粘贴问题。我有一个像

这样的数据框
df
#     X1   X2   X3   X4   X5   X6
#t1 <NA> <NA>   AU   78 <NA> <NA>
#t2   dA   AK <NA> <NA>    5 <NA>
#t3   ip <NA> <NA> <NA> <NA> <NA>
#t4 <NA> <NA> <NA> <NA> <NA>   BA

我希望它在操作后看起来像这样,

newdf
#     X1   X2   X3   X4   X5   X6
#v1 <NA> <NA> <NA> <NA> <NA> <NA>
#v2 AU78 <NA> <NA> <NA> <NA> <NA>
#v3  AK5 <NA> <NA> <NA> <NA> <NA>
#v4 <NA> <NA> <NA> <NA> <NA>   BA

该过程应首先搜索以“A”开头的值。在这种情况下df[1,3], df[2,2]。然后将该值粘贴到其右侧的任何其他数字(右侧将始终有一个数字)。另外,为了帮助,目标元素(如'AK')和右边的数字之间永远不会有杂散字符;只有NAs会将它们分开。

这些组合的新值需要被带到第一列,并且从它的位置向下一行。是否覆盖第一行中存在的值无关紧要。

我的模式定位器是,

pat.locate <- lapply(df, function(x) grep('^A', x))
un.pat <- unlist(pat.locate)
#X2 X3 
# 2  1 

这看起来是一个好的开始。从那里,

df[un.pat, names(un.pat)]
#     X2   X3
#t2   AK <NA>
#t1 <NA>   AU

因此,可以找到目标值及其列和行索引。但我需要这些索引右侧的值。要对整行进行子集化,

full.row <- df[un.pat, ]
#     X1   X2   X3   X4   X5   X6
#t2   dA   AK <NA> <NA>    5 <NA>
#t1 <NA> <NA>   AU   78 <NA> <NA>

我粘贴了非NA值,但你可以知道会发生什么,

paste(full.row[!is.na(full.row)], collapse='')
#[1] "dAAKAU785"

要对其进行分割,请使用行apply

pasty <- function(x) paste(x[!is.na(x)], collapse='')
pasted.rows <- apply(full.row, 1, pasty)
#     t2      t1 
#"dAAK5"  "AU78" 

这仍然留下了开始的迷路字符串。如果我找到了一个很好的正则表达式,告诉它​​把它抛弃,我就有了,

good.regex
#    t2     t1 
# "AK5" "AU78"

然后我可以根据这些索引对整个数据框进行子集化,

df[names(good.regex), 1] <- good.regex
df
#     X1   X2   X3   X4   X5   X6
#t1 AU78 <NA>   AU   78 <NA> <NA>
#t2  AK5   AK <NA> <NA>    5 <NA>
#t3   ip <NA> <NA> <NA> <NA> <NA>
#t4 <NA> <NA> <NA> <NA> <NA>   BA

但是我仍然需要将粘贴的值减少一个。

df[names(good.regex)+1, 1] <- good.regex
#Error in names(good.regex) + 1 : non-numeric argument to binary operator

我们显然无法为命名式子集添加数字。我觉得我早就错过了一些因素,这让我走上了解决问题的艰难道路。正则表达式必须是一个使用模式匹配的子输出和一个我无法破解的后视。我想我正在努力进入一个不必要的角落。任何帮助表示赞赏。

数据

df <- structure(list(X1 = c(NA, "dA", "ip", NA), X2 = c(NA, "AK", NA, 
NA), X3 = c("AU", NA, NA, NA), X4 = c("78", NA, NA, NA), X5 = c(NA, 
"5", NA, NA), X6 = c(NA, NA, NA, "BA")), .Names = c("X1", "X2", 
"X3", "X4", "X5", "X6"), row.names = c("t1", "t2", "t3", "t4"
), class = "data.frame")

newdf <- structure(list(X1 = structure(c(NA, 2L, 1L, NA), .Names = c("v1", 
"v2", "v3", "v4"), .Label = c("AK5", "AU78"), class = "factor"), 
    X2 = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), .Names = c("v1", "v2", "v3", "v4"), .Label = character(0), class = "factor"), 
    X3 = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), .Names = c("v1", "v2", "v3", "v4"), .Label = character(0), class = "factor"), 
    X4 = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), .Names = c("v1", "v2", "v3", "v4"), .Label = character(0), class = "factor"), 
    X5 = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), .Names = c("v1", "v2", "v3", "v4"), .Label = character(0), class = "factor"), 
    X6 = structure(c(NA, NA, NA, 1L), .Names = c("v1", "v2", 
    "v3", "v4"), .Label = "BA", class = "factor")), .Names = c("X1", 
"X2", "X3", "X4", "X5", "X6"), row.names = c("v1", "v2", "v3", 
"v4"), class = "data.frame")

1 个答案:

答案 0 :(得分:0)

根据您的输出示例我理解的是,重点是在同一行中折叠A*字符及其后续数字,然后将此新实体向下移动到下面第一行的第一列。虽然&#34;擦除&#34;原始行(newdf的第1行填充NA)但如果他们不受前一个动作(第4行)的影响,则保持不匹配的行完好无损。

您的主要问题是在整行上折叠,而不是只折叠它的结尾。

## original data
df <- structure(list(X1 = c(NA, "dA", "ip", NA), 
                     X2 = c(NA, "AK", NA, NA), 
                     X3 = c("AU", NA, NA, NA), 
                     X4 = c("78", NA, NA, NA), 
                     X5 = c(NA, "5", NA, NA), 
                     X6 = c(NA, NA, NA, "BA")), 
                .Names = c("X1", "X2", "X3", "X4", "X5", "X6"), 
                row.names = c("t1", "t2", "t3", "t4"), class = "data.frame")

df
     X1   X2   X3   X4   X5   X6
t1 <NA> <NA>   AU   78 <NA> <NA>
t2   dA   AK <NA> <NA>    5 <NA>
t3   ip <NA> <NA> <NA> <NA> <NA>
t4 <NA> <NA> <NA> <NA> <NA>   BA

以下函数抓取具有匹配模式的行,但仅从此模式折叠到行的末尾,同时忘记其开头。从而避免了非匹配的杂散字符(示例的dA)遇到的问题:

locateAndPaste <- function(x){
  if(TRUE %in% grepl('^A', df[x,])){
    endRow <- df[x, grep('^A', df[x,]):length(df)]
    pasted.rows <- paste(endRow[!is.na(endRow)], collapse='')
  }
  else{NA}
}

else元素可防止在未找到匹配项时抛出错误。

newEntity <- sapply(1:nrow(df),  locateAndPaste)
# [1] "AU78" "AK5"  NA     NA

在第1行和第2行中找到了两个匹配模式,在第3行和第4行中没有找到任何匹配模式。 正如您所看到的那样,折叠部分运行良好。

您的第二个问题是向下移动一行,并且无法在字符串中添加数字。由于我没有对名称进行子集化,而是对索引进行子集化,因此很容易避免这个问题:

(为了完整,我在本文末尾添加了一行关于转换为这些名称的数字的行)

## the newEntity element is already ordered according to the original row numbers
originalRowNumbers <- grep("^A", newEntity)
# [1] 1 2

从那时起,它非常直接:

newdf <- df   ## all operations can be done on the original df, 
              ## this copy is made only for the sake of the example.

## as per your example, "erase" the original lines where a matching pattern was found
## that will also prevent orphan lines if a no match have been found in the above line
newdf[originalRowNumbers, ] <- rep(NA, length(df))

## place the new entity in the first column one row below
newdf[originalRowNumbers+1, 1] <- newEntity[originalRowNumbers]
## fill the rest of this row with NA as per your example
newdf[originalRowNumbers+1, 2:length(df)] <- NA


newdf
     X1   X2   X3   X4   X5   X6
t1 <NA> <NA> <NA> <NA> <NA> <NA>
t2 AU78 <NA> <NA> <NA> <NA> <NA>
t3  AK5 <NA> <NA> <NA> <NA> <NA>
t4 <NA> <NA> <NA> <NA> <NA>   BA

但是,如果在最后一行中找到匹配的模式,则会向newdf添加一个额外的行。为了避免这种情况,可以缩短初始选择:

newEntity <- sapply(1:(nrow(df)-1),  locateAndPaste)


完整:在您的示例中,您可以只获取good.regex名称中的数字,然后将其提供给您的子集:

idx.goood.regex <- as.numeric(gsub("t","", names(good.regex)))
# [1] 2 1
df[idx.good.regex+1, 1] <- good.regex

请注意,只有good.regex属于类字符才有效。如果good.regex是data.frame,则会出错。