如何拆分没有定义分隔符的数据框列

时间:2017-03-16 20:19:16

标签: r string split

如何拆分此列' seriesID'分成多列,看起来像下表?基本上我需要将字符串拆分成多个长度为(3,3,6,1,1,3)的字符串。

  seriesID
1 ISU111aaaaaa33001
2 ISU222bbbbbb33001
3 ISU000cccccc63001
4 ISU333dddddd63001


seriesID             pre  supp  ind     data  case  area
1 ISU111aaaaaa33001  ISU  111   aaaaaa  3     3     001
2 ISU222bbbbbb33001  ISU  222   bbbbbb  3     3     001
3 ISU000cccccc63001  ISU  000   cccccc  6     3     001
4 ISU333dddddd63001  ISU  333   dddddd  6     3     001

谢谢!

5 个答案:

答案 0 :(得分:2)

您还可以使用substr

widths = c(3,3,6,1,1,3)
end = cumsum(widths)
start = c(1, head(end, -1) + 1)

as.data.frame(mapply(substr, start, end, MoreArgs = list(x=df$seriesID)))

#   V1  V2     V3 V4 V5  V6
#1 ISU 000 000000  3  3 001
#2 ISU 000 000000  3  3 001
#3 ISU 000 000000  6  3 001
#4 ISU 000 000000  6  3 001

答案 1 :(得分:1)

您可以使用readr将数据“重新读取”​​为固定的wdith文件。例如

series=c("ISU00000000033001","ISU00000000033001","ISU00000000063001","ISU00000000063001")

read_fwf(paste(series, collapse="\n"), fwf_widths(c(3,3,6,1,1,3)))
# A tibble: 4 × 6
#      X1    X2     X3    X4    X5    X6
#   <chr> <chr>  <chr> <int> <int> <chr>
# 1   ISU   000 000000     3     3   001
# 2   ISU   000 000000     3     3   001
# 3   ISU   000 000000     6     3   001
# 4   ISU   000 000000     6     3   001

请注意,我们将字符串向量折叠为带有换行符的单个字符串,这对于大型向量可能效率低。

答案 2 :(得分:1)

seriesID <- c('ISU00000000033001',
          'ISU00000000033001',
          'ISU00000000063001',
          'ISU00000000063001')



df <- data.frame(pre = substr(seriesID,1,3), 
             supp =substr(seriesID,4,6),
             ind =substr(seriesID,7,12),
             data =substr(seriesID,13,13),
             case =substr(seriesID,14,14),
             area =substr(seriesID,15,17))

df


pre supp    ind data case area
1 ISU  000 000000    3    3  001
2 ISU  000 000000    3    3  001
3 ISU  000 000000    6    3  001
4 ISU  000 000000    6    3  001

答案 3 :(得分:1)

您可以使用包separate中的tidyr

df <- data.frame(series=c("ISU00000000033001","ISU00000000033001","ISU00000000063001","ISU00000000063001"), stringsAsFactors=FALSE)

library(tidyr)
df %>%
  separate(series, 
           c("pre", "supp", "ind", "data", "case", "area"), 
           sep=cumsum(c(3,3,6,1,1)))

  pre supp    ind data case area
1 ISU  000 000000    3    3  001
2 ISU  000 000000    3    3  001
3 ISU  000 000000    6    3  001
4 ISU  000 000000    6    3  001

答案 4 :(得分:0)

当您使用read.fwf()https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.fwf.html之类的读取数据时,听起来应该真正应对此问题。

但要解决问题,请使用substr()

seriesID <- c('ISU00000000033001', 'ISU00000000033001', 'ISU00000000063001', 'ISU00000000063001')

df <- data.frame(seriesID = seriesID,
    pre = substr(seriesID, 1, 3),
    supp = substr(seriesID, 4, 6),
    ind = substr(seriesID, 7, 12),
    data = substr(seriesID, 13, 13),
    case = substr(seriesID, 14, 14),
    area = substr(seriesID, 15, 17))

print(df)
#            seriesID pre supp    ind data case area
# 1 ISU00000000033001 ISU  000 000000    3    3  001
# 2 ISU00000000033001 ISU  000 000000    3    3  001
# 3 ISU00000000063001 ISU  000 000000    6    3  001
# 4 ISU00000000063001 ISU  000 000000    6    3  001