我正在根据模式搜索字符串:找到关键字car0-10
(关键字car后跟0到10的数字)之前的那一侧。如果找到关键字,则要将其添加到新列中(即left
和/或right
)。如果没有关键字,我想添加'x'标记或不适用。
我需要找到短语on the right car5
或on the left car2
。这些短语具有常见的字符串模式(左/右+汽车+数字)。我正在尝试找出如何找到它们并在新列中添加汽车编号。
text.v <- c("Max","John")
text.t <- c("True story about the area on the left car2, and a parking on the right car4 not far away","but there is a garage on the right car3 in another place")
#View(text.v)
text.data <- cbind(text.v,text.t)
View(text.data)
我拥有的数据:
|text.v|text.t|
|Max | True story about the area on the left car2, and a parking on
|John |but there is a garage on the right car3 in another place
预期结果:
|text.v|text.t|left | right
|Max | True story about the area on the left car2, and a parking on |car2|car4
|John |but there is a garage on the right car3 in another place |x|car3
如果有任何快速方法,我想知道使用正则表达式或其他方法的方法。作为一项附加功能,我想知道是否可以添加关键字的数量(例如,car2在该单词right
的右侧出现两次)。
答案 0 :(得分:1)
我们可以使用str_extract
并在“ left”和“ right”两个单词之后分别获取车号。如果找不到匹配项,则返回NA
,可以稍后将其更改为我们想要的任何值。
library(dplyr)
library(stringr)
text.data %>%
mutate(left = str_extract(text.t, "(?<=left) car\\d+"),
right = str_extract(text.t, "(?<=right) car\\d+")) %>%
select(left, right) #To display results
# left right
#1 car2 car4
#2 <NA> car3
数据
text.data <- data.frame(text.v,text.t)
答案 1 :(得分:1)
除了罗纳克(Ronak)的答案外,我还留下代码来处理您的其他问题。在这里,我创建了一个数据集,它比您考虑一个额外的问题要复杂得多。与Ronak相似,我创建了两列。区别在于,我为包括所有汽车在内的每一行创建了一个字符串。例如,请参见temp
中的第二行。
对于另一个问题,我创建了另一个数据框。您可能在left
和right
中有多辆汽车。我取笑了left
和right
中的字符串,并扩展了数据框。这是out
。然后,我总结了左右两侧汽车的行驶频率,并合并了两个数据集。
library(tidyverse)
library(stringi)
group_by(mydf, person) %>%
mutate(left = stri_extract_all_regex(str = text,
pattern = "(?<=on the left )car[0-9]+?") %>%
unlist %>% toString,
right = stri_extract_all_regex(str = text,
pattern = "(?<=on the right )car[0-9]+?") %>%
unlist %>% toString) %>%
ungroup-> temp
temp
person text left right
<chr> <chr> <chr> <chr>
1 Max Ana is on the left car2. Bob is on the right car4. They are not far away from each other. car2 car4
2 John I saw a garage on the right car1. There is a garage on the right car3. NA car1, car3
3 Ana There is a garage on the right car3. There is another garage on the right car3. NA car3, car3
dplyr::select(temp, person, left, right) %>%
Reduce(f = separate_rows_, x = c("left", "right")) -> out
count(out, person, left, name = "left_total") %>%
full_join(count(out, person, right, name = "right_total"))
person left left_total right right_total
<chr> <chr> <int> <chr> <int>
1 Ana NA 2 car3 2
2 John NA 2 car1 1
3 John NA 2 car3 1
4 Max car2 1 car4 1
另一种解决方案
另一种方法是将Quanteda软件包与tidyverse软件包一起使用。这很容易找到单词频率。您仍然需要修改docname
。但这很容易做到。
library(quanteda)
kwic(mydf$text, pattern = "car[0-9]+?",
window = 1, valuetype = "regex") %>%
as.data.frame %>%
dplyr::select(docname, pre, keyword) %>%
count(docname, keyword, pre, name = "frequency")
docname keyword pre frequency
<chr> <chr> <chr> <int>
1 text1 car2 left 1
2 text1 car4 right 1
3 text2 car1 right 1
4 text2 car3 right 1
5 text3 car3 right 2
数据
person text
1 Max Ana is on the left car2. Bob is on the right car4. They are not far away from each other.
2 John I saw a garage on the right car1. There is a garage on the right car3.
3 Ana There is a garage on the right car3. There is another garage on the right car3.