Question

我正在尝试将以下数据框列拆分为3列，具体取决于内容。我尝试使用dplyr和mutate因为我想更好地学习它们，但任何建议都会受到欢迎。

exampledf<-data.frame(c("Argentina","2005/12","2005/11","Bolivia","2006/12"),stringsAsFactors=F)
mutate(exampledf,month=strsplit(exampledf[,1],"/")[1],month=strsplit(exampledf[,1],"/")[2])

我的目标：

Year     Month    Country
2005     12       Argentina
2005     11       Argentina
2006     12       Bolivia

这非常接近this SO帖子，但它没有解决我重复的国家问题。

Answer 1

我们为没有数字的行（＆＃39; i1＆＃39;）创建一个逻辑索引，得到累计和，c数据集与该分组索引，提取＆＃39;年＆＃39;，＆＃39;月＆＃39;与（c）和＆＃39;国家＆＃39;作为第一个元素，创建split和sub data.frame个内容。

rbind

或者使用list，我们会转换＆＃39; data.frame＆＃39;到＆＃39; data.table＆＃39; （i1 <- grepl('^[^0-9]+$', exampledf$Col1) lst <- lapply(split(exampledf, cumsum(i1)), function(x) data.frame(year= as.numeric(sub('\\/.*', '', x[-1,1])), month = as.numeric(sub('.*\\/', '', x[-1,1])), Country = x[1,1] ) ) res <- do.call(rbind, lst) row.names(res) <- NULL res # year month Country #1 2005 12 Argentina #2 2005 11 Argentina #3 2006 12 Bolivia），按索引的data.table分组（从上方），我们在＆＃39; Col1＆＃39;上分割（setDT(exampledf)）。（删除第一个元素）带分隔符（cumsum）。我们从中得到两列。然后，连接第一个元素以创建三列，并使用tstrsplit更改列名称。如果我们不需要分组变量，可以将其分配（/）为NULL。

setnames

数据

:=

Answer 2

我的方法不是很优雅但是试图逐步清理数据......

edf<-data.frame(c("Argentina","2005/12","2005/11","Bolivia","2006/12"),
                stringsAsFactors=F)

names(edf) <- "x"  # just to give a concise name

# flag if the row shows the month or not
edf$isMonth <- (regexpr("^[0-9]+/[0-9]+$", edf$x) > 0)

# expand the country 
# (i.e. if the row is month, reuse the country from the previous row)
edf$country <- edf$x
for (i in seq(2, nrow(edf))) {
  if (edf$isMonth[i]) {
    edf$country[i] <- edf$country[i-1]
  }
}

# now only the rows with month are relevant
edf <- edf[edf$isMonth,]

这会让你：

     x isMonth   country
2005/12    TRUE Argentina
2005/11    TRUE Argentina
2006/12    TRUE   Bolivia

现在，剩下的任务是将年月变量分成年和月。在您的示例中，代码strsplit失败，因为函数strsplit返回一个列表，而mutate函数执行向量化操作而不是元素方式。

在这种特殊情况下，我发现stringr::str_match很有用。

library(stringr)
matched <- str_match(edf$x, "([0-9]+)/([0-9]+)")
edf$year <- matched[, 2]
edf$month <- matched[, 3]

结果是：

      x isMonth   country year month    
2005/12    TRUE Argentina 2005    12
2005/11    TRUE Argentina 2005    11
2006/12    TRUE   Bolivia 2006    12

Answer 3

另类策略。它并不简洁，但很容易理解。

library(tidyr)
df <-data.frame(Country = c("Argentina","2005/12","2005/11","Bolivia","2006/12"),stringsAsFactors=F)
df$dates[grep("[0-9]",df$Country)] <- df$Country[grep("[0-9]",df$Country)]
df$Country[grep("[0-9]",df$Country)] <- NA

replace_with <- 1
for(i in 1:length(df$Country)) {
  if(!is.na(df$Country[i])) {
    replace_with <- df$Country[i]
    next
  } else {
    x[i] <- replace_with
  }
}
df$Country <- x
df <- separate(df, dates, c("Year", "Month"), "/")
df <- na.omit(df)
df
    Country Year Month
2 Argentina 2005    12
3 Argentina 2005    11
5   Bolivia 2006    12

Answer 4

这是另一种选择。您可以使用read.mtable from my "SOfun" package和“splitstackshape”中的cSplit以及“data.table”中的rbindlist。

假设您至少加载了read.mtable函数（如果您不想安装软件包），方法是：

library(SOfun)
library(splitstackshape)

rbindlist(lapply(read.mtable(textConnection(exampledf[[1]]), "[a-z]"), 
                 cSplit, "V1", "/"), idcol = TRUE)
#          .id V1_1 V1_2
# 1: Argentina 2005   12
# 2: Argentina 2005   11
# 3:   Bolivia 2006   12

或者，您可以使用read.mtable本身拆分数据（尽管我怀疑cSplit可能更快）。因此，方法是：

# library(SOfun)
# library(data.table)
rbindlist(read.mtable(textConnection(exampledf[[1]]), "[a-z]", 
                      sep = "/", col.names = c("Year", "Month")), idcol = TRUE)
#          .id Year Month
# 1: Argentina 2005    12
# 2: Argentina 2005    11
# 3:   Bolivia 2006    12

使用这种方法，您可以在流程中命名列。[/ p>

根据值将dataframe列的内容拆分为不同的列

4 个答案:

数据