将字符串拆分为R中的列,其中每个字符串具有可能不同数量的列条目

时间:2013-02-24 20:44:37

标签: r

我有一个数据框,其格式如下

pages                         count
[page 1, page 2, page 3]      23
[page 2, page 4]              4
[page 1, page 3, page 4]      12

我需要做的是将逗号分隔第一列并创建足够的新列来覆盖最长的序列。结果应该是:

First Page      Second Page  Third Page     Count
page 1          page 2       page 3         23
page 2          page 4       null           4
page 1          page 3       page 4         12

如果null是一个零长度的字符串,我很好,我可以处理剥离括号。

3 个答案:

答案 0 :(得分:4)

我的“splitstackshape”包有一个解决这类问题的功能。在这种情况下,相关函数是concat.split,其工作原理如下(使用里卡多答案中的“myDat”):

# Get rid of "[" and "]" from your "pages" variable
myDat$pages <- gsub("\\[|\\]", "", myDat$pages)
# Specify the source data.frame, the variable that needs to be split up
#   and whether to drop the original variable or not
library(splitstackshape)
concat.split(myDat, "pages", ",", drop = TRUE)
#   count pages_1 pages_2 pages_3
# 1    23  page 1  page 2  page 3
# 2     4  page 2  page 4        
# 3    12  page 1  page 3  page 4

答案 1 :(得分:3)

示例数据

myDat <- read.table(text=
  "pages|count
[page 1, page 2, page 3]|23
[page 2, page 4]|4
[page 1, page 3, page 4]|12", header=TRUE, sep="|") 

我们可以从pages中提取myDat来处理它。

# if factors, convert to characters
pages <- as.character(myDat$page)

# remove brackets.  Note the double-escape's in R
pages <- gsub("(\\[|\\])", "", pages)

# split on comma
pages <- strsplit(pages, ",")

# find the largest element
maxLen <- max(sapply(pages, length))

# fill in any blanks. The t() is to transpose the return from sapply
pages <- 
t(sapply(pages, function(x)
      # append to x, NA's.  Note that if (0 == (maxLen - length(x))), then no NA's are appended 
      c(x, rep(NA, maxLen - length(x)))
  ))

# add column names as necessary
colnames(pages) <- paste(c("First", "Second", "Third"), "Page")

# Put it all back together
data.frame(pages, Count=myDat$count)



结果

> data.frame(pages, Count=myDat$count)
  First.Page Second.Page Third.Page Count
1     page 1      page 2     page 3    23
2     page 2      page 4       <NA>     4
3     page 1      page 3     page 4    12

答案 2 :(得分:2)

带有read.table

fill=TRUE可以填充它们。如果好的列名不重要,可以省略names(DF2)<-行。没有包使用。

# test data

Lines <- "pages                         count
[page 1, page 2, page 3]      23
[page 2, page 4]              4
[page 1, page 3, page 4]      12"

# code - replace text=Lines with something like "myfile.dat"

DF <- read.table(text = Lines, skip = 1, sep = "]", as.is = TRUE)
DF2 <- read.table(text = DF[[1]], sep = ",", fill = TRUE, as.is = TRUE)
names(DF2) <- paste0(read.table(text = Lines, nrow = 1, as.is = TRUE)[[1]], seq_along(DF2))
DF2$count <- DF[[2]]
DF2[[1]] <- sub(".", "", DF2[[1]]) # remove [

给出了这个:

> DF2
  pages1  pages2  pages3 count
1 page 1  page 2  page 3    23
2 page 2  page 4             4
3 page 1  page 3  page 4    12

注意:这会给出page1,page2等的列标题。如果准确显示问题中显示的列标题很重要,那么将该行替换为使用这些标题的行,如果有的话少于20页的专栏。

 ord <- c('First', 'Second', 'Third', 'Fourth', 'Fifth', 'Sixth', 'Seventh',
 'Eighth', 'Ninth', 'Tenth', 'Eleventh', 'Twelfth', 'Thirteenth',
 'Fourteenth', 'Fiftheenth', 'Sixteenth', 'Seventeenth', 'Eighteenth', 
 'Nineteenth')
ix <- seq_along(DF2)
names(DF2) <- if (ncol(DF2) < 20) paste(ord[ix], "Page") else paste("Page", ix)