我有一个数据集如下
1 Saturday,SatAug 13, 2016-5:30 PM
2 54.362·Robert Madley
3 Sunday,SunAug 14, 2016-1:30 PM
4 11.355 sold out·Andre Marriner
我想做的是用“,”或“·”分隔数据集,然后将其作为矩阵或数据框。 在line4的情况下,11.355和“售罄”也需要拆分。 所以最终的数据集应该是
date date1 time a f s
Saturday SatAug 13 2016-5:30 PM 54.362 Robert Madley
Sunday SunAug 14 2016-1:30 PM 11.355 Andre Marriner sold out
答案 0 :(得分:0)
假设观察总是由原始数据中的两个数据行组成,这里是dplyr
+ tidyr
的解决方案:
library(dplyr)
library(tidyr)
df %>%
mutate(ID = c(0, rep(1:(n()-1)%/%2))) %>%
group_by(ID) %>%
mutate(ID2 = paste0('V', row_number())) %>%
spread(ID2, V2) %>%
separate(V1, c("date", "date1", "time"), sep = ",\\s?") %>%
extract(V2, c("a", "s", "f"), regex = "^(\\d+\\.\\d+\\b)(\\b.+)?·(.+)", convert = TRUE)
<强>结果:强>
# A tibble: 2 x 7
# Groups: ID [2]
ID date date1 time a s f
* <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 0 Saturday SatAug 13 2016-5:30 PM 54.362 <NA> Robert Madley
2 1 Sunday SunAug 14 2016-1:30 PM 11.355 sold out Andre Marriner
数据:强>
df = read.table(text = "1|Saturday,SatAug 13, 2016-5:30 PM
2|54.362·Robert Madley
3|Sunday,SunAug 14, 2016-1:30 PM
4|11.355 sold out·Andre Marriner", sep = "|", row.names = 1,
stringsAsFactors = FALSE)