
时间:2019-09-11 10:29:36

标签: r regex string dataframe


我需要找到短语on the right car5on the left car2。这些短语具有常见的字符串模式(左/右+汽车+数字)。我正在尝试找出如何找到它们并在新列中添加汽车编号。

text.v <- c("Max","John")
text.t <- c("True story about the area on the left car2, and a parking on the right car4 not far away","but there is a garage on the right car3 in another place")

text.data <- cbind(text.v,text.t)



|Max   | True story about the area on the left car2, and a parking on 
|John  |but there is a garage on the right car3 in another place


|text.v|text.t|left | right
|Max   | True story about the area on the left car2, and a parking on |car2|car4
|John  |but there is a garage on the right car3 in another place  |x|car3


2 个答案:

答案 0 :(得分:1)

我们可以使用str_extract并在“ left”和“ right”两个单词之后分别获取车号。如果找不到匹配项,则返回NA,可以稍后将其更改为我们想要的任何值。


text.data %>%
   mutate(left = str_extract(text.t, "(?<=left) car\\d+"), 
          right = str_extract(text.t, "(?<=right) car\\d+")) %>%
   select(left, right) #To display results

#   left right
#1  car2  car4
#2  <NA>  car3


text.data <- data.frame(text.v,text.t)

答案 1 :(得分:1)




group_by(mydf, person) %>%
mutate(left = stri_extract_all_regex(str = text,
                                     pattern = "(?<=on the left )car[0-9]+?") %>%
              unlist %>% toString,
       right = stri_extract_all_regex(str = text,
                                 pattern = "(?<=on the right )car[0-9]+?") %>%
              unlist %>% toString) %>%
ungroup-> temp


person text                                                                                      left  right     
<chr>  <chr>                                                                                     <chr> <chr>     
1 Max    Ana is on the left car2. Bob is on the right car4. They are not far away from each other. car2  car4      
2 John   I saw a garage on the right car1. There is a garage on the right car3.                    NA    car1, car3
3 Ana    There is a garage on the right car3. There is another garage on the right car3.           NA    car3, car3

dplyr::select(temp, person, left, right) %>%
       Reduce(f = separate_rows_, x = c("left", "right")) -> out

count(out, person, left, name = "left_total") %>%
full_join(count(out, person, right, name = "right_total")) 

person left  left_total right right_total
  <chr>  <chr>      <int> <chr>       <int>
1 Ana    NA             2 car3            2
2 John   NA             2 car1            1
3 John   NA             2 car3            1
4 Max    car2           1 car4            1




kwic(mydf$text, pattern = "car[0-9]+?",
     window = 1, valuetype = "regex") %>%
as.data.frame %>%
dplyr::select(docname, pre, keyword) %>%
count(docname, keyword, pre, name = "frequency")

  docname keyword pre   frequency
  <chr>   <chr>   <chr>     <int>
1 text1   car2    left          1
2 text1   car4    right         1
3 text2   car1    right         1
4 text2   car3    right         1
5 text3   car3    right         2


  person                                                                                      text
1    Max Ana is on the left car2. Bob is on the right car4. They are not far away from each other.
2   John                    I saw a garage on the right car1. There is a garage on the right car3.
3    Ana           There is a garage on the right car3. There is another garage on the right car3.