在文本中查找单词,并用单词创建单独的列

时间:2019-09-11 10:29:36

标签: r regex string dataframe

我正在根据模式搜索字符串:找到关键字car0-10(关键字car后跟0到10的数字)之前的那一侧。如果找到关键字,则要将其添加到新列中(即left和/或right)。如果没有关键字,我想添加'x'标记或不适用。

我需要找到短语on the right car5on the left car2。这些短语具有常见的字符串模式(左/右+汽车+数字)。我正在尝试找出如何找到它们并在新列中添加汽车编号。

text.v <- c("Max","John")
text.t <- c("True story about the area on the left car2, and a parking on the right car4 not far away","but there is a garage on the right car3 in another place")
#View(text.v)

text.data <- cbind(text.v,text.t)

View(text.data)

我拥有的数据:

|text.v|text.t|
|Max   | True story about the area on the left car2, and a parking on 
|John  |but there is a garage on the right car3 in another place

预期结果:

|text.v|text.t|left | right
|Max   | True story about the area on the left car2, and a parking on |car2|car4
|John  |but there is a garage on the right car3 in another place  |x|car3

如果有任何快速方法,我想知道使用正则表达式或其他方法的方法。作为一项附加功能,我想知道是否可以添加关键字的数量(例如,car2在该单词right的右侧出现两次)。

2 个答案:

答案 0 :(得分:1)

我们可以使用str_extract并在“ left”和“ right”两个单词之后分别获取车号。如果找不到匹配项,则返回NA,可以稍后将其更改为我们想要的任何值。

library(dplyr)
library(stringr)

text.data %>%
   mutate(left = str_extract(text.t, "(?<=left) car\\d+"), 
          right = str_extract(text.t, "(?<=right) car\\d+")) %>%
   select(left, right) #To display results

#   left right
#1  car2  car4
#2  <NA>  car3

数据

text.data <- data.frame(text.v,text.t)

答案 1 :(得分:1)

除了罗纳克(Ronak)的答案外,我还留下代码来处理您的其他问题。在这里,我创建了一个数据集,它比您考虑一个额外的问题要复杂得多。与Ronak相似,我创建了两列。区别在于,我为包括所有汽车在内的每一行创建了一个字符串。例如,请参见temp中的第二行。

对于另一个问题,我创建了另一个数据框。您可能在leftright中有多辆汽车。我取笑了leftright中的字符串,并扩展了数据框。这是out。然后,我总结了左右两侧汽车的行驶频率,并合并了两个数据集。

library(tidyverse)
library(stringi)

group_by(mydf, person) %>%
mutate(left = stri_extract_all_regex(str = text,
                                     pattern = "(?<=on the left )car[0-9]+?") %>%
              unlist %>% toString,
       right = stri_extract_all_regex(str = text,
                                 pattern = "(?<=on the right )car[0-9]+?") %>%
              unlist %>% toString) %>%
ungroup-> temp

temp

person text                                                                                      left  right     
<chr>  <chr>                                                                                     <chr> <chr>     
1 Max    Ana is on the left car2. Bob is on the right car4. They are not far away from each other. car2  car4      
2 John   I saw a garage on the right car1. There is a garage on the right car3.                    NA    car1, car3
3 Ana    There is a garage on the right car3. There is another garage on the right car3.           NA    car3, car3


dplyr::select(temp, person, left, right) %>%
       Reduce(f = separate_rows_, x = c("left", "right")) -> out

count(out, person, left, name = "left_total") %>%
full_join(count(out, person, right, name = "right_total")) 

person left  left_total right right_total
  <chr>  <chr>      <int> <chr>       <int>
1 Ana    NA             2 car3            2
2 John   NA             2 car1            1
3 John   NA             2 car3            1
4 Max    car2           1 car4            1

另一种解决方案

另一种方法是将Quanteda软件包与tidyverse软件包一起使用。这很容易找到单词频率。您仍然需要修改docname。但这很容易做到。

library(quanteda)

kwic(mydf$text, pattern = "car[0-9]+?",
     window = 1, valuetype = "regex") %>%
as.data.frame %>%
dplyr::select(docname, pre, keyword) %>%
count(docname, keyword, pre, name = "frequency")

  docname keyword pre   frequency
  <chr>   <chr>   <chr>     <int>
1 text1   car2    left          1
2 text1   car4    right         1
3 text2   car1    right         1
4 text2   car3    right         1
5 text3   car3    right         2

数据

  person                                                                                      text
1    Max Ana is on the left car2. Bob is on the right car4. They are not far away from each other.
2   John                    I saw a garage on the right car1. There is a garage on the right car3.
3    Ana           There is a garage on the right car3. There is another garage on the right car3.