这确实是一个相当棘手的问题。如果有人能够帮助我,那将是非常棒的。
我要做的是以下内容。我在R中有数据框,包含给定状态中的每个位置,从维基百科中删除。它看起来像这样(前10行)。我们称之为NewHampshire.df
:
Municipality County Population
1 Acworth Sullivan 891
2 Albany Carroll 735
3 Alexandria Grafton 1613
4 Allenstown Merrimack 4322
5 Alstead Cheshire 1937
6 Alton Belknap 5250
7 Amherst Hillsborough 11201
8 Andover Merrimack 2371
9 Antrim Hillsborough 2637
10 Ashland Grafton 2076
我进一步编译了一个名为grep_term
的新变量,它将Municipality
和County
中的值组合成一个新的变量,该变量用作or-statement,类似这样的:
Municipality County Population grep_term
1 Acworth Sullivan 891 "Acworth|Sullivan"
2 Albany Carroll 735 "Albany|Carroll"
等等。此外,我有另一个数据集,包含2000个Twitter用户的自我披露位置。我称之为location.df
,看起来有点像这样:
[1] "London" "Orleans village VT USA" "The World"
[4] "D M V Towson " "Playa del Sol Solidaridad" "Beautiful Downtown Burbank"
[7] NA "US" "Gaithersburg Md"
[10] NA "California " "Indy"
[13] "Florida" "exsnaveen com" "Houston TX"
我想做两件事:
1:浏览location.df
数据集中的每个观察点,并将TRUE或FALSE保存到新变量中,具体取决于自我披露的位置是否是第一个数据集中列表的一部分。 /强>
2:将NewHampshire.df
数据集中特定行的匹配数保存到新变量中。即,如果在Twitter位置数据集中有Acworth的4个匹配项,则在新创建的“匹配”变量的NewHampshire.df中,观察1应该有一个值“4”
到目前为止我做了什么:我已经解决了任务1,如下:
for(i in 1:234){
location.df$isRelevant <- sapply(location.df$location, function(s) grepl(NH_Places[i], s, ignore.case = TRUE))
}
如何解决任务2,理想情况下在同一个循环中?
提前致谢,非常感谢任何帮助!
答案 0 :(得分:1)
关于任务一,您也可以使用:
# location vector to be matched against
loc.vec <- c("Acworth","Hillsborough","California","Amherst","Grafton","Ashland","London")
location.df <- data.frame(location=loc.vec)
# create a 'grep-vector'
places <- paste(paste(NewHampshire$Municipality, NewHampshire$County,
sep = "|"),
collapse = "|")
# match them against the available locations
location.df$isRelevant <- sapply(location.df$location,
function(s) grepl(places, s, ignore.case = TRUE))
给出:
> location.df
location isRelevant
1 Acworth TRUE
2 Hillsborough TRUE
3 California FALSE
4 Amherst TRUE
5 Grafton TRUE
6 Ashland TRUE
7 London FALSE
要获取location.df
中grep_term
列的匹配数,您可以使用:
NewHampshire$n.matches <- sapply(NewHampshire$grep_term, function(x) sum(grepl(x, loc.vec)))
给出:
> NewHampshire
Municipality County Population grep_term n.matches
1 Acworth Sullivan 891 Acworth|Sullivan 1
2 Albany Carroll 735 Albany|Carroll 0
3 Alexandria Grafton 1613 Alexandria|Grafton 1
4 Allenstown Merrimack 4322 Allenstown|Merrimack 0
5 Alstead Cheshire 1937 Alstead|Cheshire 0
6 Alton Belknap 5250 Alton|Belknap 0
7 Amherst Hillsborough 11201 Amherst|Hillsborough 2
8 Andover Merrimack 2371 Andover|Merrimack 0
9 Antrim Hillsborough 2637 Antrim|Hillsborough 1
10 Ashland Grafton 2076 Ashland|Grafton 2