Question

所以我使用G.Grothendieck的建议解决了我最初的问题，再次感谢，正是我所追求的干净方式。最初的帖子是here。现在的事实是我的档案更加微妙。

它实际上看起来像这样（另请参见帖子末尾的数据部分，以获得可重现的格式）：

A1
100
200
txt 
A2
STRING
300
400
txt txt
txt
txt txt txt
A3
STRING
STRING
150
250
A2
.
.
.

初步数据争论看起来像这样：

type <- cumsum(raw_data[[1]] %in% c("A1","A2","A3"))
v <- tapply(raw_data[[1]], type, c, simplify = FALSE)
m <- t(do.call(cbind, lapply(v, ts)))

raw_data<- as.data.frame(m, stringsAsFactors = FALSE)
raw_data[] <- lapply(raw_data, type.convert, as.is = TRUE)

raw_data$Occurences <- 0

给予：

  V1     V2     V3   V4      V5   V6          V7
1 A1    100    200 txt     <NA> <NA>        <NA>
2 A2 String    300  400 txt txt  txt txt txt txt
3 A3 String String  150     250 <NA>        <NA>
4 A2   <NA>   <NA> <NA>    <NA> <NA>        <NA>

问题是df [3,4]应该是df [3,2]，我应该说明＆＃34; 2＆＃34;在一个新专栏中。第2行也是如此，其中df [2,3]应该在df [2,2]中，状态＆＃34; 1＆＃34;在同一个addiotional专栏中。换句话说，我正在追逐这个：

  V1  V2  V3      V4   V5          V6 Occurences
1 A1 100 200    txt  <NA>        <NA>          0
2 A2 300 400 txt txt  txt txt txt txt          1
3 A3 150 250    <NA> <NA>        <NA>          2
4 A2  NA  NA    <NA> <NA>        <NA>          0

STRING就在A之后，有时它不会发生，有时只发生一次或几次。这就是我为解决这个问题所做的工作：

#Count "STRING" occurences and readjust values in expected columns
formatString <- function(df) {
  z <<- which(df[,2] %in% "STRING")
  if (length(z) > 0){
    for (i in z){
      df$Occurences = df$Occurences + 1
      for (j in 2:ncol(df)-1){
            if (is.na(df[i,j]) | is.na(df[i,j+1])){
          df[i,j] = NA
        } else {
          df[i,j] = df[i,j+1]
        }
      }
    }
  }
  z <<- which(df[,2] %in% "STRING")
  if(length(z) > 0){formatString(df)}
}

这个函数应该只处理在第2列中找到STRING的行，它会增加最后一列（Occurences）然后将所有值向左移动一列，这样它们都会返回到它们所在的位置预计会是。当我们开始看到NA时，IS.NA只是试图阻止循环。一旦我们处理了这些行，我们再次查看STRING是否在第2列中，如果是，则再次调用该函数。

现在我的问题是该功能看起来正在处理（在几乎19k的观察和261列上花费20秒，不确定它在处理时间方面是最好的），期望我的数据帧不是在循环结束时更新。然而，z得到了更新，所以它似乎按照应有的方式工作。

我错过了什么？

数据

可复制形式的数据：

DF <- structure(list(V1 = c("A1", "100", "200", "txt ", "A2", "String",
"300","400", "txt txt", "txt", "txt txt txt", "A3",
"String", "String", "150", "250", "A2")), .Names = "V1",
row.names = c(NA, -14L), class = "data.frame")

Answer 1

与OP的方法相反，这个建议的一般想法是以长格式进行所有数据清理，并将数据从长到大的形状重新整形为最后一步。

似乎主要目标是在最终宽格式表的列V2和V3中对齐整数值，同时跟踪之间删除的STRING行的数量组头和每个组中第一个整数行的外观。

因此，下面的data.table（开发版本1.9.7）方法是查找每个组中包含整数值的第一行，而不是删除任何明确包含字符串STRING的行。因此，这种方法更灵活。

此外，假设同一组标题可能多次出现。

library(data.table)

# read data (to make it a reproducible example)
dt <- fread("A1
            100
            200
            txt 
            A2
            STRING
            300
            400
            txt txt
            txt
            txt txt txt
            A3
            STRING
            STRING
            150
            250
            A2
            ", header = FALSE, sep = "\n")

# Identify group headers by regular expression and push them down
dt <- dt[V1 %like% "^A[1-3]$", grp := V1][, grp := zoo::na.locf(grp)]
# Count groups in case of multiple appearances of the same group headers
dt[V1 == grp, grp_cnt := .I][, grp_cnt := zoo::na.locf(grp_cnt)]

# Remove "STRING" rows
# Add row count within each individual group
dt[, id := seq_len(.N), by = grp_cnt]

# find first occurrence of an integer in each group by regex
first_int <- dt[V1 %like% "^\\d+$", .(min_id = min(id)), by = grp_cnt]

# non-equi join to start each group with a row containing the first integer
# (requires data.table development version 1.9.7)
dt <- dt[first_int, on = c("grp_cnt", "id>=min_id")]

# compute Occurences as the number of dropped "STRING" rows
dt[, Occurences := id - 2L]

print(dt)
#    grp_cnt          V1 grp id
# 1:       1         100  A1  2
# 2:       1         200  A1  3
# 3:       1         txt  A1  4
# 4:       2         300  A2  3
# 5:       2         400  A2  4
# 6:       2     txt txt  A2  5
# 7:       2         txt  A2  6
# 8:       2 txt txt txt  A2  7
# 9:       3         150  A3  4
#10:       3         250  A3  5

# prepare for dcast: add column names for each group
# (one added to have the column names in line with Q)
dt[, col := paste0("V", seq_len(.N) + 1), by = grp]

# reshape from long to wide form
z <- dcast(dt, grp_cnt + grp + Occurences ~ col, value.var = "V1")[, grp_cnt := NULL]

# do type conversion on the new columns
new_cols <- dt[, unique(col)]
z[, (new_cols) := lapply(.SD, type.convert, as.is = TRUE), .SDcols = new_cols]

print(z)
#   grp Occurences  V2  V3      V4  V5          V6
#1:  A1          0 100 200     txt  NA          NA
#2:  A2          1 300 400 txt txt txt txt txt txt
#3:  A3          2 150 250      NA  NA          NA

str(z)
#Classes ‘data.table’ and 'data.frame': 3 obs. of  7 variables:
# $ grp       : chr  "A1" "A2" "A3"
# $ Occurences: int  0 1 2
# $ V2        : int  100 300 150
# $ V3        : int  200 400 250
# $ V4        : chr  "txt" "txt txt" NA
# $ V5        : chr  NA "txt" NA
# $ V6        : chr  NA "txt txt txt" NA
# - attr(*, ".internal.selfref")=<externalptr>

请注意，A2的第二次出现已被删除，因为第二个A2（在原始单列文件中）下面没有任何行包含整数值。

如果生产数据包含除A1，A2和A3之外的其他组标题，则必须相应地修改用于标识组标题的正则表达式。

列名与OP的预期结果一致（V1除了grp外，为了清楚起见）。列顺序略有不同，不应该相关。

Answer 2

所以，因为我的例子不像我想的那么准确，data.table方法不适合我的需要。实际上 STRING 不一定在整数之间。

我想出了这个功能：

#Count "STRING" occurences and readjust values in expected columns
formatSearching <- function(df) {
  z <<- which(df[,2] %in% "STRING")
  if (length(z) > 0){
    for (i in z){
      df[i,"String_occurences"] = df[i,"String_occurences"] + 1
      for (j in 2:(ncol(raw_data)-1)){
        if (is.na(df[i,j]) | is.na(df[i,j+1])){
          df[i,j] = NA
        } else {
          df[i,j] = df[i,j+1]
        }
      }
    }
  }
  z <<- which(df[,2] %in% "STRING")
  #if(length(z) > 0){formatSearching(df)} This somehow does work, but does not update df...
  return(df)
}

由于最后的评论，我称之为：

raw_data <- formatSearching(raw_data)
while(length(z) > 0){raw_data <- formatSearching(raw_data)}

所有这些都有几个问题。首先，我的意图不是在处理数据的过程中花费一些时间，而是具有完全功能的递归功能。我可能在某处错过了一项任务，以便我的 RAW_DATA 数据框得到更新。

其次，这个过程需要时间，特别是步骤。有可能在某些方面，我会有多达10次出现，但在其他方面只有1次。我确信我们可以做得更好，而且更好，我的意思是更快，更有效。

现在这是按照我想要的方式完成工作，我只是想获得一些处理速度。

谢谢大家。

将单个列拆分为多列数据帧

数据

2 个答案: