我的df2
:
League freq
18 England 108
27 Italy 79
20 Germany 74
43 Spain 64
19 France 49
39 Russia 34
31 Mexico 27
47 Turkey 24
32 Netherlands 23
37 Portugal 21
49 United States 18
29 Japan 16
25 Iran 15
7 Brazil 13
22 Greece 13
14 Costa 11
45 Switzerland 11
5 Belgium 10
17 Ecuador 10
23 Honduras 10
42 South Korea 9
2 Argentina 8
48 Ukraine 7
3 Australia 6
11 Chile 6
12 China 6
15 Croatia 6
35 Norway 6
41 Scotland 6
34 Nigeria 5
我尝试选择europe
。
europe <- subset(df2, nrow(x=18, 27, 20) select=c(1, 2))
从europe
选择africa
,Asia
,df2
...的最有效方法是什么?
答案 0 :(得分:5)
您需要手动确定哪些国家/地区属于哪些国家/地区,或者您可以从某个地方获取此信息:
(来自Scraping html tables into R data frames using the XML package的基本策略)
library(XML)
theurl <- "http://en.wikipedia.org/wiki/List_of_European_countries_by_area"
tables <- readHTMLTable(theurl)
library(stringr)
europe_names <- str_extract(as.character(tables[[1]]$Country),"[[:alpha:] ]+")
head(sort(europe_names))
## [1] "Albania" "Andorra" "Austria" "Azerbaijan" "Belarus"
## [6] "Belgium"
## there's also a 'Total' entry in here but it's probably harmless ...
subset(df2,League %in% europe_names)
当然,你必须再次为亚洲,美国等解决这个问题。
答案 1 :(得分:3)
因此,使用countrycode
包时,@ BenBolker的方法略有不同。
library(countrycode)
cdb <- countrycode_data # database of countries
df2[toupper(df2$League) %in% cdb[cdb$continent=="Europe",]$country.name,]
# League freq
# 27 Italy 79
# 20 Germany 74
# 43 Spain 64
# 19 France 49
# 32 Netherlands 23
# 37 Portugal 21
# 22 Greece 13
# 45 Switzerland 11
# 5 Belgium 10
# 48 Ukraine 7
# 15 Croatia 6
# 35 Norway 6
您将遇到的一个问题是“英格兰”不是任何数据库中的国家(而是“英国”),所以您必须将其作为一个特例进行处理。
此外,该数据库将“美洲”视为一个大陆。
df2[toupper(df2$League) %in% cdb[cdb$continent=="Americas",]$country.name,]
所以要获得南美洲,你必须使用region
字段:
df2[toupper(df2$League) %in% cdb[cdb$region=="South America",]$country.name,]
# League freq
# 7 Brazil 13
# 17 Ecuador 10
# 2 Argentina 8
# 11 Chile 6