我有一个看起来像这样的数据框:
Author ID Country Year
A 12345 US 2011
B 13254 Germany 2018
C 54952 Belgium 2005
D 58774 UK 2009
E 88569 Lebanon 2015
...
我想排除所有属于欧盟和美国的国家。但是,我遇到了包含空格的国家/地区,例如捷克共和国和英国。
到目前为止,我已经尝试使用
non_other_countries<-c("Belgium", "Bulgaria", "Demnark", "Germany", "Estonia", "Finland", "France", "Greece", "Ireland", "Italy", "Croatia", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Austria", "Poland", "Portugal", "Romania", "Slovakia", "Slovania", "Spain", "Sweden", "Czech Republic", "Hungary", "United Kingdom", "Cyprus", "United States")
other_post_2011 <- other_post_2011_with_id[, setdiff(names(other_post_2011_with_id), non_other_countries)]
和
other_post_2011 <- subset(other_post_2011_with_id, ! Country %in% c("Belgium", "Bulgaria", "Demnark", "Germany", "Estonia", "Finland", "France", "Greece", "Ireland", "Italy", "Croatia", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Austria", "Poland", "Portugal", "Romania", "Slovakia", "Slovania", "Spain", "Sweden", "Czech Republic", "Hungary", "United Kingdom", "Cyprus", "United States", "USA"))
但是,两个国家都无法排除包含空格的国家。
我现在开发了一种(imo)丑陋的解决方法,方法是将所有捷克共和国替换为Czechia,将所有英国替换为UK,
other_post_2011_with_id$Country[other_post_2011_with_id$Country == "Czech Republic"] <- "Czechia"
other_post_2011_with_id$Country[other_post_2011_with_id$Country == "United Kingdom"] <- "UK"
但是我一直想知道是否还有其他更优雅,更通用的解决方案。 非常感谢你!
答案 0 :(得分:1)
由于所提供的数据不完整,因此不知道代码到底出了什么问题,请尝试以下方法。
head(dat)
# a id country year
# 1 a 1 United Kingdom 2006
# 2 b 5 Bouvet Island 2010
# 3 c 8 Hungary 2010
# 4 d 10 Czech Republic 2004
# 5 e 12 Bouvet Island 2001
# 6 f 19 United Kingdom 2004
excl <- c("Czech Republic", "Hungary", "United Kingdom", "Cyprus",
"United States")
dat[!dat$country %in% excl, ]
# a id country year
# 2 b 5 Bouvet Island 2010
# 5 e 12 Bouvet Island 2001
# 7 g 20 Dominica 2004
# 9 i 32 Namibia 2000
# 10 j 34 Bouvet Island 2011
# 11 k 35 Bouvet Island 2001
# 12 l 52 Bouvet Island 2010
# 13 m 54 Dominica 2005
# 14 n 56 Namibia 2000
# 17 q 77 Bouvet Island 2001
# 18 r 79 Qatar 2011
# 19 s 82 Bouvet Island 2002
dat <- structure(list(a = structure(1:20, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p",
"q", "r", "s", "t"), class = "factor"), id = c(1L, 5L, 8L, 10L,
12L, 19L, 20L, 31L, 32L, 34L, 35L, 52L, 54L, 56L, 61L, 67L, 77L,
79L, 82L, 90L), country = structure(c(8L, 1L, 5L, 3L, 1L, 8L,
4L, 2L, 6L, 1L, 1L, 1L, 4L, 6L, 5L, 2L, 1L, 7L, 1L, 3L), .Label = c("Bouvet Island",
"Cyprus", "Czech Republic", "Dominica", "Hungary", "Namibia",
"Qatar", "United Kingdom"), class = "factor"), year = c(2006L,
2010L, 2010L, 2004L, 2001L, 2004L, 2004L, 2009L, 2000L, 2011L,
2001L, 2010L, 2005L, 2000L, 2001L, 2006L, 2001L, 2011L, 2002L,
2003L)), class = "data.frame", row.names = c(NA, -20L))
答案 1 :(得分:1)
一种与您建议的解决方案相比更优雅的解决方案:
您可以在运行代码之前用下划线替换空格:
df$Country <- gsub(" ", "_", df$Country)
然后运行您的代码
并撤消替换:
df$Country <- gsub("_", " ", df$Country)
但是,空格不太可能是造成问题的原因。 尝试排除您想要的国家/地区:
df <- df[!(df$Country %in% c("Country3","Country3","Country3")]
如果您一致使用字符串中的空格,则不会影响结果。这只是一个假设,但是某些国家/地区名称可能包含多个空格。由于“美国”和“美国”通常很难区分,因此始终建议使用“ _”。
希望这会有所帮助!