如何将字符串字段分隔为R中的两个不同的数字列

时间:2018-03-16 14:54:06

标签: r regex grep substring delimiter

我有一个数据框,其中有一个文本字段,可以捕获一个人在城市中停留的时间。它的格式为m months,y和m为数字。如果此人在城市居住的时间不到一年,则该值的格式仅为df <- structure(list(Time.in.current.role = c("1 year 1 month", "11 months", "3 years 11 months", "1 year 1 month", "8 months"), City = c("Philadelphia", "Seattle", "Washington D.C.", "Ashburn", "Cork, Ireland")), .Names = c("Time.in.current.role", "City"), row.names = c(NA, 5L), class = "data.frame")

我想将此列转换为两个单独的数字列,其中一列显示已存在的年份,另一列显示已存在的月份。

以下是我的数据框示例:

result <- structure(list(Year = c(1, 0, 3, 1, 0), Month = c(1, 11, 
11, 
1, 8), City = structure(c(3L, 4L, 5L, 1L, 2L), .Label = c("Ashburn", 
"Cork, Ireland", "Philadelphia", "Seattle", "Washington D.C."
), class = "factor")), .Names = c("Year", "Month", "City"), row.names 
= c(NA, 
-5L), class = "data.frame")

我的愿望数据框如下:

y year(s)

我正在考虑使用grep来查找哪些行具有子字符串&#34; year&#34;在它和哪些行有子串&#34;月&#34;在里面。但在那之后,我很难设法将这个数字与#34;年和#34;或&#34;月&#34;。

*编辑* 在我原来的帖子中,我忘了说明只有df <- structure(list(Time.in.current.role = c("1 year 1 month", "11 months", "3 years 11 months", "1 year 1 month", "8 months", "2 years"), City = c("Philadelphia", "Seattle", "Washington D.C.", "Ashburn", "Cork, Ireland", "Washington D.C.")), .Names = c("Time.in.current.role", "City"), row.names = c(1L, 2L, 3L, 4L, 5L, 18L), class = "data.frame") result <- structure(list(Year = c(1, 0, 3, 1, 0, 2), Month = c(1, 11, 11, 1, 8, 0), City = structure(c(3L, 4L, 5L, 1L, 2L, 5L), .Label = c("Ashburn", "Cork, Ireland", "Philadelphia", "Seattle", "Washington D.C." ), class = "factor")), .Names = c("Year", "Month", "City"), row.names = c(NA, -6L), class = "data.frame") 的情况。这是新的原始数据帧和所需的数据帧:

reduce

4 个答案:

答案 0 :(得分:1)

您可以执行以下操作:

z = regmatches(x = df$Time.in.current.role, gregexpr("\\d+", df$Time.in.current.role))
years = sapply(z, function(x){ifelse(length(x)==1, 0, x[1])})
months = sapply(z, function(x){ifelse(length(x)==1, x[1], x[2])})

这给出了:

> years
[1] "1" "0" "3" "1" "0"
> months
[1] "1"  "11" "11" "1"  "8" 

如果有两个数字,此方法有效。如果只有一个,则假定它对应于月份。例如,这不起作用的情况是"5 years"

在这种情况下,您可以执行以下操作:

m = regmatches(x = df$Time.in.current.role, gregexpr("\\d+ m", df$Time.in.current.role))
y = regmatches(x = df$Time.in.current.role, gregexpr("\\d+ y", df$Time.in.current.role))
y2 = sapply(y, function(x){ifelse(length(x)==0,0,gsub("\\D+","",x))})
m2 = sapply(m, function(x){ifelse(length(x)==0,0,gsub("\\D+","",x))})

示例:

> df
  Time.in.current.role            City
1       1 year 1 month    Philadelphia
2            11 months         Seattle
3    3 years 11 months Washington D.C.
4       1 year 1 month         Ashburn
5             8 months   Cork, Ireland
6              5 years           Miami

> y2
[1] "1" "0" "3" "1" "0" "5"
> m2
[1] "1"  "11" "11" "1"  "8"  "0" 

答案 1 :(得分:1)

另一种方法是使用包splitstackshape将列拆分为两个。要做到这一点,首先需要使用gsub在年和月之间设置分隔符,然后删除所有字符,然后使用cSplit

# replace delimiter year with ;
df$Time.in.current.role <- gsub("year", ";", df$Time.in.current.role)

# If no year was found add 0; at the beginning of the cell
df$Time.in.current.role[!grepl(";", df$Time.in.current.role)] <- paste0("0;", df$Time.in.current.role[!grepl(";", df$Time.in.current.role)])

# remove characters and whitespace
df$Time.in.current.role <- gsub("[[:alpha:]]|\\s+", "", df$Time.in.current.role)

# Split column by ;
df <- splitstackshape::cSplit(df, "Time.in.current.role", sep = ";")

# Rename new columns
colnames(df)[2:3] <- c("Year", "Month")

df
              City  Year  Month
1:    Philadelphia     1      1
2:         Seattle     0     11
3: Washington D.C.     3     11
4:         Ashburn     1      1
5:   Cork, Ireland     0      8

答案 2 :(得分:1)

一个快速的解决方案:

<强>代码:

ym <- gsub("[^0-9|^ ]", "", df$Time.in.current.role)
ym <- gsub("^ | $", "", ym)
df$Year <- ifelse(
  grepl(" ", ym), 
  gsub("([0-9]+) .+", "\\1", ym), 
  0
)
df$Month <- gsub(".+ ([0-9]+)$", "\\1", ym)
df$Time.in.current.role <- NULL
df

             City Year Month
1    Philadelphia    1     1
2         Seattle    0    11
3 Washington D.C.    3    11
4         Ashburn    1     1
5   Cork, Ireland    0     8

<强>词

  • 首先删除不是数字或空格的所有内容
  • 删除字符串
  • 开头或结尾的所有空格
  • 如果字符串包含两个数字,那么首先作为年份提取,否则 year = 0
  • 最后一个号码是月份。
  • 从data.frame
  • 删除原始列
  • 享受

答案 3 :(得分:1)

这定义了一个函数extr(也见最后的替代定义),它将从第一个参数中提取与第二个参数的捕获组的匹配,即与括号内正则表达式部分的匹配。然后匹配转换为数字,或者如果未找到模式,则返回0。

它只有3行代码,在处理年份和月份方面具有令人愉悦的对称性,不仅可以处理年份和月份,还可以处理年份和月份。它允许在y和m之前使用垃圾,例如问题中的示例数据中显示的\ n。

library(gsubfn)

extr <- function(x, pat) strapply(x, pat, as.numeric, empty = 0, simplify = TRUE)
transform(df, Year = extr(Time.in.current.role, "(\\d+) +\\W*y"),
              Month = extr(Time.in.current.role, "(\\d+) +\\W*m"))

给出(对于问题中定义的数据框):

  Time.in.current.role            City Year Month
1       1 year 1 month    Philadelphia    1     1
2          11 \nmonths         Seattle    0    11
3    3 years 11 months Washington D.C.    3    11
4       1 year 1 month         Ashburn    1     1
5             8 months   Cork, Ireland    0     8

请注意strapply默认使用tcl regex引擎,但是如果tcltk在您的系统上不起作用,那么使用这个稍长版本的extr,或者更好的是修复您的安装,因为tcltk是一个基础包,如果不起作用,你的R安装就会破坏。

extr <- function(x, pat) {
  sapply(strapply(x, pat, as.numeric), function(x) if (is.null(x)) 0 else x)
}