使用dplyr mutate查找字符串

时间:2016-05-11 17:57:52

标签: regex r string dataframe dplyr

我的数据框中包含一列字符串,数字ID后跟" - "然后一个月。我试图解析字符串以获得月份和年份。作为第一步,我使用了dplyr :: mutate()和

regexpr()
regexpr("-",yearid)[1]

创建一个新列,显示此" - "的位置字符。但似乎regexpr()在mutate()内的执行方式与单独使用时的执行方式完全不同。它似乎不会根据字符串更新,但会从前一行继承字符串位置。在下面的例子中,我期待" - "的位置。字符在各自的年份中为4,4和5。但我得到4,4和4 - 所以这4个不正确。当我单独运行regexpr时,我没有看到这个问题。

想知道我是否遗漏了什么,我怎样才能获得" - "动态是否特定于yearid的每个值?可能有更简单的方法来获得1月和1997年。

yearid <- c("50 - January 1995","51 - January 1996","100 - January 1997")
data.df <- data.frame(yearid)
data.df <- mutate(data.df, trimpos = regexpr("-",str_trim(yearid))[1],
              pos = regexpr("-",yearid)[1])

> data.df
                yearid test1 test2
 1  50 - January 1995     4     4
 2  51 - January 1996     4     4
 3 100 - January 1997     4     4

另一方面,使用regexpr,我得到了预期的输出:

> regexpr("-",yearid[1])[1]
[1] 4
> regexpr("-",yearid[2])[1]
[1] 4
> regexpr("-",yearid[3])[1]
[1] 5

最后,我在下面的sessionInfo()

R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.0.0    dplyr_0.4.1      readr_0.2.2.9000

loaded via a namespace (and not attached):
[1] assertthat_0.1       DBI_0.3.1            knitr_1.10.5               lazyeval_0.1.10.9000 magrittr_1.5         parallel_3.1.1      
[7] Rcpp_0.11.6          stringi_0.4-1        tools_3.1.1         

1 个答案:

答案 0 :(得分:1)

regexpr库中的stringr函数会返回一个位置向量,其中附加了两个附加属性match.lengthuseBytes。如评论中所述,此向量可以直接分配给数据框。这可以使用mutate函数或不使用。

来完成
library(dplyr)
library(stringr)

id_month_year <- c(
    "50 - January 1995",
    "51 - January 1996",
    "100 - January 1997"
)
data <- data.frame(id_month_year, another_column = 1)

## create new column using mutate
data <- data %>% mutate(pos1 = regexpr("-", data$id_month_year))

## create new column without mutate
data$pos2 <- regexpr("-", data$id_month_year)

print(data)

以下是新栏目:

       id_month_year another_column pos1 pos2
1  50 - January 1995              1    4    4
2  51 - January 1996              1    4    4
3 100 - January 1997              1    5    5

我建议使用separate库中的tidyr函数。这是一个示例代码段:

library(dplyr)
library(tidyr)

id_month_year <- c(
    "50 - January 1995",
    "51 - January 1996",
    "100 - January 1997"
)
data <- tbl_df(data.frame(id_month_year, another_column = 1))

clean <- data %>%
    separate(
        id_month_year,
        into = c("id", "month", "year"),
        sep = "[- ]+",
        convert = TRUE
    )

print(clean)

以下是最终的干净数据框:

Source: local data frame [3 x 4]

     id   month  year another_column
  (int)   (chr) (int)          (dbl)
1    50 January  1995              1
2    51 January  1996              1
3   100 January  1997              1