R - 如何根据另一个数据框记录grepl匹配的数量?

时间:2016-02-28 14:17:31

标签: regex r twitter

这确实是一个相当棘手的问题。如果有人能够帮助我,那将是非常棒的。

我要做的是以下内容。我在R中有数据框,包含给定状态中的每个位置,从维基百科中删除。它看起来像这样(前10行)。我们称之为NewHampshire.df

 Municipality       County Population
1       Acworth     Sullivan        891
2        Albany      Carroll        735
3    Alexandria      Grafton       1613
4    Allenstown    Merrimack       4322
5       Alstead     Cheshire       1937
6         Alton      Belknap       5250
7       Amherst Hillsborough      11201
8       Andover    Merrimack       2371
9        Antrim Hillsborough       2637
10      Ashland      Grafton       2076

我进一步编译了一个名为grep_term的新变量,它将MunicipalityCounty中的值组合成一个新的变量,该变量用作or-statement,类似这样的:

 Municipality       County Population  grep_term
1       Acworth     Sullivan        891  "Acworth|Sullivan"
2       Albany      Carroll        735   "Albany|Carroll"

等等。此外,我有另一个数据集,包含2000个Twitter用户的自我披露位置。我称之为location.df,看起来有点像这样:

[1] "London"                     "Orleans village VT USA"     "The World"                 
 [4] "D M V Towson "              "Playa del Sol Solidaridad"  "Beautiful Downtown Burbank"
 [7] NA                           "US"                         "Gaithersburg Md"           
[10] NA                           "California "                "Indy"                      
[13] "Florida"                    "exsnaveen com"              "Houston TX"    

我想做两件事:

1:浏览location.df数据集中的每个观察点,并将TRUE或FALSE保存到新变量中,具体取决于自我披露的位置是否是第一个数据集中列表的一部分。 /强>

2:将NewHampshire.df数据集中特定行的匹配数保存到新变量中。即,如果在Twitter位置数据集中有Acworth的4个匹配项,则在新创建的“匹配”变量的NewHampshire.df中,观察1应该有一个值“4”

到目前为止我做了什么:我已经解决了任务1,如下:

for(i in 1:234){
  location.df$isRelevant <- sapply(location.df$location, function(s) grepl(NH_Places[i], s, ignore.case = TRUE))
}

如何解决任务2,理想情况下在同一个循环中?

提前致谢,非常感谢任何帮助!

1 个答案:

答案 0 :(得分:1)

关于任务一,您也可以使用:

# location vector to be matched against
loc.vec <- c("Acworth","Hillsborough","California","Amherst","Grafton","Ashland","London")
location.df <- data.frame(location=loc.vec)

# create a 'grep-vector'
places <- paste(paste(NewHampshire$Municipality, NewHampshire$County,
                      sep = "|"), 
                collapse = "|")
# match them against the available locations
location.df$isRelevant <- sapply(location.df$location, 
                                 function(s) grepl(places, s, ignore.case = TRUE))

给出:

> location.df
      location isRelevant
1      Acworth       TRUE
2 Hillsborough       TRUE
3   California      FALSE
4      Amherst       TRUE
5      Grafton       TRUE
6      Ashland       TRUE
7       London      FALSE

要获取location.dfgrep_term列的匹配数,您可以使用:

NewHampshire$n.matches <- sapply(NewHampshire$grep_term, function(x) sum(grepl(x, loc.vec)))

给出:

> NewHampshire
   Municipality       County Population            grep_term n.matches
1       Acworth     Sullivan        891     Acworth|Sullivan         1
2        Albany      Carroll        735       Albany|Carroll         0
3    Alexandria      Grafton       1613   Alexandria|Grafton         1
4    Allenstown    Merrimack       4322 Allenstown|Merrimack         0
5       Alstead     Cheshire       1937     Alstead|Cheshire         0
6         Alton      Belknap       5250        Alton|Belknap         0
7       Amherst Hillsborough      11201 Amherst|Hillsborough         2
8       Andover    Merrimack       2371    Andover|Merrimack         0
9        Antrim Hillsborough       2637  Antrim|Hillsborough         1
10      Ashland      Grafton       2076      Ashland|Grafton         2