我的数据框中包含一列字符串,数字ID后跟" - "然后一个月。我试图解析字符串以获得月份和年份。作为第一步,我使用了dplyr :: mutate()和
regexpr()
regexpr("-",yearid)[1]
创建一个新列,显示此" - "的位置字符。但似乎regexpr()在mutate()内的执行方式与单独使用时的执行方式完全不同。它似乎不会根据字符串更新,但会从前一行继承字符串位置。在下面的例子中,我期待" - "的位置。字符在各自的年份中为4,4和5。但我得到4,4和4 - 所以这4个不正确。当我单独运行regexpr时,我没有看到这个问题。
想知道我是否遗漏了什么,我怎样才能获得" - "动态是否特定于yearid的每个值?可能有更简单的方法来获得1月和1997年。
yearid <- c("50 - January 1995","51 - January 1996","100 - January 1997")
data.df <- data.frame(yearid)
data.df <- mutate(data.df, trimpos = regexpr("-",str_trim(yearid))[1],
pos = regexpr("-",yearid)[1])
> data.df
yearid test1 test2
1 50 - January 1995 4 4
2 51 - January 1996 4 4
3 100 - January 1997 4 4
另一方面,使用regexpr,我得到了预期的输出:
> regexpr("-",yearid[1])[1]
[1] 4
> regexpr("-",yearid[2])[1]
[1] 4
> regexpr("-",yearid[3])[1]
[1] 5
最后,我在下面的sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.0.0 dplyr_0.4.1 readr_0.2.2.9000
loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1 knitr_1.10.5 lazyeval_0.1.10.9000 magrittr_1.5 parallel_3.1.1
[7] Rcpp_0.11.6 stringi_0.4-1 tools_3.1.1
答案 0 :(得分:1)
regexpr
库中的stringr
函数会返回一个位置向量,其中附加了两个附加属性match.length
和useBytes
。如评论中所述,此向量可以直接分配给数据框。这可以使用mutate
函数或不使用。
library(dplyr)
library(stringr)
id_month_year <- c(
"50 - January 1995",
"51 - January 1996",
"100 - January 1997"
)
data <- data.frame(id_month_year, another_column = 1)
## create new column using mutate
data <- data %>% mutate(pos1 = regexpr("-", data$id_month_year))
## create new column without mutate
data$pos2 <- regexpr("-", data$id_month_year)
print(data)
以下是新栏目:
id_month_year another_column pos1 pos2
1 50 - January 1995 1 4 4
2 51 - January 1996 1 4 4
3 100 - January 1997 1 5 5
我建议使用separate
库中的tidyr
函数。这是一个示例代码段:
library(dplyr)
library(tidyr)
id_month_year <- c(
"50 - January 1995",
"51 - January 1996",
"100 - January 1997"
)
data <- tbl_df(data.frame(id_month_year, another_column = 1))
clean <- data %>%
separate(
id_month_year,
into = c("id", "month", "year"),
sep = "[- ]+",
convert = TRUE
)
print(clean)
以下是最终的干净数据框:
Source: local data frame [3 x 4]
id month year another_column
(int) (chr) (int) (dbl)
1 50 January 1995 1
2 51 January 1996 1
3 100 January 1997 1