通过多个分隔符拆分数字和字符,然后生成矩阵

时间:2017-12-14 09:25:14

标签: r matrix split

我有一个数据集如下

1       Saturday,SatAug 13, 2016-5:30 PM
2                  54.362·Robert Madley
3         Sunday,SunAug 14, 2016-1:30 PM
4        11.355 sold out·Andre Marriner

我想做的是用“,”或“·”分隔数据集,然后将其作为矩阵或数据框。 在line4的情况下,11.355和“售罄”也需要拆分。 所以最终的数据集应该是

date       date1       time           a        f                s
Saturday   SatAug 13   2016-5:30 PM   54.362   Robert Madley
Sunday     SunAug 14   2016-1:30 PM   11.355   Andre Marriner   sold out

1 个答案:

答案 0 :(得分:0)

假设观察总是由原始数据中的两个数据行组成,这里是dplyr + tidyr的解决方案:

library(dplyr)
library(tidyr)

df %>%
  mutate(ID = c(0, rep(1:(n()-1)%/%2))) %>%
  group_by(ID) %>%
  mutate(ID2 = paste0('V', row_number())) %>%
  spread(ID2, V2) %>%
  separate(V1, c("date", "date1", "time"), sep = ",\\s?") %>%
  extract(V2, c("a", "s", "f"), regex = "^(\\d+\\.\\d+\\b)(\\b.+)?·(.+)", convert = TRUE)

<强>结果:

# A tibble: 2 x 7
# Groups:   ID [2]
     ID     date     date1         time      a         s              f
* <dbl>    <chr>     <chr>        <chr>  <dbl>     <chr>          <chr>
1     0 Saturday SatAug 13 2016-5:30 PM 54.362      <NA>  Robert Madley
2     1   Sunday SunAug 14 2016-1:30 PM 11.355  sold out Andre Marriner

数据:

df = read.table(text = "1|Saturday,SatAug 13, 2016-5:30 PM
2|54.362·Robert Madley
3|Sunday,SunAug 14, 2016-1:30 PM
4|11.355 sold out·Andre Marriner", sep = "|", row.names = 1,
                stringsAsFactors = FALSE)