通过模式匹配

时间:2017-06-09 18:28:41

标签: r dplyr tidyr grepl

                                       Province                   ElecDistName                               Candidate Votes Majority  Vper MajPer
                                          <chr>                          <chr>                                   <chr> <int>    <int> <dbl>  <dbl>
1 Newfoundland and Labrador/Terre-Neuve-et-Labrador St. John's East/St. John's-Est                     Nick Whalen Liberal 20974      646  46.7    1.4
2 Newfoundland and Labrador/Terre-Neuve-et-Labrador St. John's East/St. John's-Est Jack Harris ** NDP-New Democratic Party 20328       NA  45.3     NA
3 Newfoundland and Labrador/Terre-Neuve-et-Labrador St. John's East/St. John's-Est           Deanne Stapleton Conservative  2938       NA   6.5     NA
4 Newfoundland and Labrador/Terre-Neuve-et-Labrador St. John's East/St. John's-Est        David Anthony Peters Green Party   500       NA   1.1     NA
5 Newfoundland and Labrador/Terre-Neuve-et-Labrador St. John's East/St. John's-Est                   Sean Burton Communist   140       NA   0.3     NA
6                   New Brunswick/Nouveau-Brunswick                    Fundy Royal                 Alaina Lockhart Liberal 19136     1775  40.9    3.8

Top of Dataset

这里的业余问题,我试图将候选列分成两个,一个包含名称,另一个包含该方。我已经尝试了一些单独的功能:

separate(ElecResults, Candidate, into = c("Name", "Party"), sep = " (?=[^ ]+$)")

但这似乎错过了很多观察。对于有三个名字的候选人来说,问题很明显,但还有一些人似乎完全错过了(候选人用一个莫名的双星号)。

我试图想一下,如果函数与grepl结合,它会识别最常见的聚会名称,例如Liberal,Conservative,NDP和Green,并创建一个名为Party的新列,其中包含聚会名称,但是每次尝试都会收到错误消息。

如果有人知道我如何分割这个专栏,那将是一个巨大的帮助。

谢谢!

以下是使用dput的代码:

structure(list(Province = c("Newfoundland and Labrador/Terre-Neuve-et-Labrador", 
"Newfoundland and Labrador/Terre-Neuve-et-Labrador", "Newfoundland and Labrador/Terre-Neuve-et-Labrador", 
"Newfoundland and Labrador/Terre-Neuve-et-Labrador", "Newfoundland and Labrador/Terre-Neuve-et-Labrador", 
"New Brunswick/Nouveau-Brunswick"), ElecDistName = c("St. John's East/St. John's-Est", 
"St. John's East/St. John's-Est", "St. John's East/St. John's-Est", 
"St. John's East/St. John's-Est", "St. John's East/St. John's-Est", 
"Fundy Royal"), Candidate = c("Nick Whalen Liberal", "Jack Harris ** NDP-New Democratic Party", 
"Deanne Stapleton Conservative", "David Anthony Peters Green Party", 
"Sean Burton Communist", "Alaina Lockhart Liberal"), Votes = c(20974L, 
20328L, 2938L, 500L, 140L, 19136L), Majority = c(646L, NA, NA, 
NA, NA, 1775L), Vper = c(46.7, 45.3, 6.5, 1.1, 0.3, 40.9), MajPer = c(1.4, 
NA, NA, NA, NA, 3.8)), .Names = c("Province", "ElecDistName", 
"Candidate", "Votes", "Majority", "Vper", "MajPer"), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

2 个答案:

答案 0 :(得分:0)

以下是一些您需要修改的基本代码。将每个参与方名称放在由|

分隔的引号内
require(dplyr)
require(stringr)

df <- data.frame(Candidate = "Nick Whalen Liberal", Majority = 1)
parties <- c("Liberal|Conservative")
df %>% mutate(Name = str_sub(Candidate, 1, str_locate(Candidate, parties)[1] - 1))

答案 1 :(得分:0)

这是使用library(tidyverse) library(fuzzyjoin) parties <- data_frame(party = c("Liberal", "NDP-New Democratic Party", "Conservative", "Green Party", "Communist")) df %>% regex_left_join(parties, by = c(Candidate = "party")) %>% replace_na(list(party = "minor")) %>% mutate(Candidate = str_replace(Candidate, party, "")) %>% select(Candidate, party) #> # A tibble: 6 x 2 #> Candidate party #> <chr> <chr> #> 1 Nick Whalen Liberal #> 2 Jack Harris ** NDP-New Democratic Party #> 3 Deanne Stapleton Conservative #> 4 David Anthony Peters Green Party #> 5 Sean Burton Communist #> 6 Alaina Lockhart Liberal

的另一种方法


replace_na

请注意,仅添加了最后一个选择以说明该方法有效。我特别喜欢这种方法,因为在数据框中可能出现的其他方可以很好地使用{{1}}