两列之间的变量子串匹配

时间:2017-03-04 09:12:08

标签: r stringr

我有一个包含20,000行的数据集,其最纯粹的形式如下所示:

    v1                   v2
1   Case 1 (A v. B)      A v. B 
2   Case 2 (A v. C)      A v. B 
3   Case 2 (A v. C)      C v. B 
4   Case 4 (X v. Z)      X v. Z 
5   Case 5 (B v. A)      A v. B 
6   Case 6 (X v. A)      X v. A 
7   Case 6 (X v. A)      A v. X 
...

...除了 v1,v2 的n个变种(实际上约为150左右,但仍然太多而不能列出)。

我想返回第三列 v3 ,其中包含 v1 的任何子字符串是否与 v2 中的字符串匹配的逻辑指示符。

    v1                   v2           v3
1   Case 1 (A v. B)      A v. B       TRUE
2   Case 2 (A v. C)      A v. B       FALSE
3   Case 2 (A v. C)      C v. B       FALSE
4   Case 4 (X v. Z)      X v. Z       TRUE
5   Case 5 (B v. A)      A v. B       FALSE
6   Case 6 (X v. A)      X v. A       TRUE
7   Case 6 (X v. A)      A v. X       FALSE

我一直在玩这样的东西,我认为这是在正确的轨道上:

library(stringr)
x$v3 <- with(x, str_detect(v1, v2))

如果有人能指出我正确的解决方案/解决方法,我将非常感激。

MWE显示我的str_detect()技术不起作用:

x <- structure(list(v1 = c("Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation", 
                          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia  v  Russian Federation"
), v2 = c("Georgia v Russian Federation", " Ethiopia v South Africa Liberia v South Africa", 
             " Cameroon v United Kingdom", " New Zealand v France", " Australia v France", 
             " Nicaragua v United States of America", " Nicaragua v Honduras", 
             " Nauru v Anustralia", " Nnew Zealand v France", " Islamic Republic of Iran v United States of America", 
             " Bosnia and Herzegovina v Serbia and Montenegro", " Spain v Cananda", 
             " Libyan Arab Jamahiriya v United States of America", " Libyan Arab Jamahiriya v United Kingdom", 
             " Democratic Republic of the Congo v Burundi", " Germany v United States of America", 
             " Democratic Republic of the Congo v Belgium", " Liechtenstein v Germany", 
             " Democratic Republic of the Congo v Ugandan", " Democratic Republic of the Congo v Rwandan", 
             " Nicaragua v Colombia", " Djibouti v France", " Georgia v Russian Federation", 
             " Croatia v Serbia", " Mexico v United States of American", " Democratic Republic of the Congo v Rwanda", 
             " Spain v  Canada", " Australia v  France", " New Zealand v France", 
             " New Zealand v France")), .Names = c("v1", "v2"
             ), row.names = c(NA, 30L), class = "data.frame")

1 个答案:

答案 0 :(得分:1)

grepl可用于将v2中的单个值与v1的可能子串进行比较

您需要分别为每一行应用它,因此快速解决方案可以是: apply(data.frame(v1,v2),MARGIN=1, FUN=function(x) {grepl(x[2],x[1])})

如果你想忽略空格数的差异(比如第1行),你可以使用gsub将x [2]中的值替换为相应的正则表达式,这样" "将被替换为" *"允许多个空格。

在这种情况下,此申请将起作用:

apply(x,MARGIN=1, FUN=function(x) {grepl(gsub(" "," *",x[2]),x[1])})