Question

这确实是一个相当棘手的问题。如果有人能够帮助我，那将是非常棒的。

我要做的是以下内容。我在R中有数据框，包含给定状态中的每个位置，从维基百科中删除。它看起来像这样（前10行）。我们称之为NewHampshire.df：

 Municipality       County Population
1       Acworth     Sullivan        891
2        Albany      Carroll        735
3    Alexandria      Grafton       1613
4    Allenstown    Merrimack       4322
5       Alstead     Cheshire       1937
6         Alton      Belknap       5250
7       Amherst Hillsborough      11201
8       Andover    Merrimack       2371
9        Antrim Hillsborough       2637
10      Ashland      Grafton       2076

我进一步编译了一个名为grep_term的新变量，它将Municipality和County中的值组合成一个新的变量，该变量用作or-statement，类似这样的：

 Municipality       County Population  grep_term
1       Acworth     Sullivan        891  "Acworth|Sullivan"
2       Albany      Carroll        735   "Albany|Carroll"

等等。此外，我有另一个数据集，包含2000个Twitter用户的自我披露位置。我称之为location.df，看起来有点像这样：

[1] "London"                     "Orleans village VT USA"     "The World"                 
 [4] "D M V Towson "              "Playa del Sol Solidaridad"  "Beautiful Downtown Burbank"
 [7] NA                           "US"                         "Gaithersburg Md"           
[10] NA                           "California "                "Indy"                      
[13] "Florida"                    "exsnaveen com"              "Houston TX"

我想做两件事：

1：浏览location.df数据集中的每个观察点，并将TRUE或FALSE保存到新变量中，具体取决于自我披露的位置是否是第一个数据集中列表的一部分。 /强>

2：将NewHampshire.df数据集中特定行的匹配数保存到新变量中。即，如果在Twitter位置数据集中有Acworth的4个匹配项，则在新创建的“匹配”变量的NewHampshire.df中，观察1应该有一个值“4”

到目前为止我做了什么：我已经解决了任务1，如下：

for(i in 1:234){ location.df$isRelevant <- sapply(location.df$location, function(s) grepl(NH_Places[i], s, ignore.case = TRUE)) }

如何解决任务2，理想情况下在同一个循环中？

提前致谢，非常感谢任何帮助！

Answer 1

关于任务一，您也可以使用：

# location vector to be matched against
loc.vec <- c("Acworth","Hillsborough","California","Amherst","Grafton","Ashland","London")
location.df <- data.frame(location=loc.vec)

# create a 'grep-vector'
places <- paste(paste(NewHampshire$Municipality, NewHampshire$County,
                      sep = "|"), 
                collapse = "|")
# match them against the available locations
location.df$isRelevant <- sapply(location.df$location, 
                                 function(s) grepl(places, s, ignore.case = TRUE))

给出：

> location.df
      location isRelevant
1      Acworth       TRUE
2 Hillsborough       TRUE
3   California      FALSE
4      Amherst       TRUE
5      Grafton       TRUE
6      Ashland       TRUE
7       London      FALSE

要获取location.df中grep_term列的匹配数，您可以使用：

NewHampshire$n.matches <- sapply(NewHampshire$grep_term, function(x) sum(grepl(x, loc.vec)))

给出：

> NewHampshire
   Municipality       County Population            grep_term n.matches
1       Acworth     Sullivan        891     Acworth|Sullivan         1
2        Albany      Carroll        735       Albany|Carroll         0
3    Alexandria      Grafton       1613   Alexandria|Grafton         1
4    Allenstown    Merrimack       4322 Allenstown|Merrimack         0
5       Alstead     Cheshire       1937     Alstead|Cheshire         0
6         Alton      Belknap       5250        Alton|Belknap         0
7       Amherst Hillsborough      11201 Amherst|Hillsborough         2
8       Andover    Merrimack       2371    Andover|Merrimack         0
9        Antrim Hillsborough       2637  Antrim|Hillsborough         1
10      Ashland      Grafton       2076      Ashland|Grafton         2

R - 如何根据另一个数据框记录grepl匹配的数量？

1 个答案: