子集处理时,通过排除一系列字符串来处理空格

时间:2019-08-28 18:00:38

标签: r subset

我有一个看起来像这样的数据框:

Author ID     Country Year
A      12345  US      2011
B      13254  Germany 2018
C      54952  Belgium 2005
D      58774  UK      2009
E      88569  Lebanon 2015
...

我想排除所有属于欧盟和美国的国家。但是,我遇到了包含空格的国家/地区,例如捷克共和国和英国。

到目前为止,我已经尝试使用

non_other_countries<-c("Belgium", "Bulgaria", "Demnark", "Germany", "Estonia", "Finland", "France", "Greece", "Ireland", "Italy", "Croatia", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Austria", "Poland", "Portugal", "Romania", "Slovakia", "Slovania", "Spain", "Sweden", "Czech Republic", "Hungary", "United Kingdom", "Cyprus", "United States")
other_post_2011 <- other_post_2011_with_id[, setdiff(names(other_post_2011_with_id), non_other_countries)]

other_post_2011 <- subset(other_post_2011_with_id, ! Country %in% c("Belgium", "Bulgaria", "Demnark", "Germany", "Estonia", "Finland", "France", "Greece", "Ireland", "Italy", "Croatia", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Austria", "Poland", "Portugal", "Romania", "Slovakia", "Slovania", "Spain", "Sweden", "Czech Republic", "Hungary", "United Kingdom", "Cyprus", "United States", "USA"))

但是,两个国家都无法排除包含空格的国家。

我现在开发了一种(imo)丑陋的解决方法,方法是将所有捷克共和国替换为Czechia,将所有英国替换为UK,

other_post_2011_with_id$Country[other_post_2011_with_id$Country == "Czech Republic"] <- "Czechia"
other_post_2011_with_id$Country[other_post_2011_with_id$Country == "United Kingdom"] <- "UK"

但是我一直想知道是否还有其他更优雅,更通用的解决方案。 非常感谢你!

2 个答案:

答案 0 :(得分:1)

由于所提供的数据不完整,因此不知道代码到底出了什么问题,请尝试以下方法。

head(dat)
#   a id        country year
# 1 a  1 United Kingdom 2006
# 2 b  5  Bouvet Island 2010
# 3 c  8        Hungary 2010
# 4 d 10 Czech Republic 2004
# 5 e 12  Bouvet Island 2001
# 6 f 19 United Kingdom 2004

excl <- c("Czech Republic", "Hungary", "United Kingdom", "Cyprus", 
          "United States")

dat[!dat$country %in% excl, ]
#    a id       country year
# 2  b  5 Bouvet Island 2010
# 5  e 12 Bouvet Island 2001
# 7  g 20      Dominica 2004
# 9  i 32       Namibia 2000
# 10 j 34 Bouvet Island 2011
# 11 k 35 Bouvet Island 2001
# 12 l 52 Bouvet Island 2010
# 13 m 54      Dominica 2005
# 14 n 56       Namibia 2000
# 17 q 77 Bouvet Island 2001
# 18 r 79         Qatar 2011
# 19 s 82 Bouvet Island 2002

数据

dat <- structure(list(a = structure(1:20, .Label = c("a", "b", "c", 
"d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", 
"q", "r", "s", "t"), class = "factor"), id = c(1L, 5L, 8L, 10L, 
12L, 19L, 20L, 31L, 32L, 34L, 35L, 52L, 54L, 56L, 61L, 67L, 77L, 
79L, 82L, 90L), country = structure(c(8L, 1L, 5L, 3L, 1L, 8L, 
4L, 2L, 6L, 1L, 1L, 1L, 4L, 6L, 5L, 2L, 1L, 7L, 1L, 3L), .Label = c("Bouvet Island", 
"Cyprus", "Czech Republic", "Dominica", "Hungary", "Namibia", 
"Qatar", "United Kingdom"), class = "factor"), year = c(2006L, 
2010L, 2010L, 2004L, 2001L, 2004L, 2004L, 2009L, 2000L, 2011L, 
2001L, 2010L, 2005L, 2000L, 2001L, 2006L, 2001L, 2011L, 2002L, 
2003L)), class = "data.frame", row.names = c(NA, -20L))

答案 1 :(得分:1)

一种与您建议的解决方案相比更优雅的解决方案:

您可以在运行代码之前用下划线替换空格:

df$Country <- gsub(" ", "_", df$Country)

然后运行您的代码

并撤消替换:

df$Country <- gsub("_", " ", df$Country)

但是,空格不太可能是造成问题的原因。 尝试排除您想要的国家/地区:

df <- df[!(df$Country %in% c("Country3","Country3","Country3")]

如果您一致使用字符串中的空格,则不会影响结果。这只是一个假设,但是某些国家/地区名称可能包含多个空格。由于“美国”和“美国”通常很难区分,因此始终建议使用“ _”。

希望这会有所帮助!