选择"欧洲"来自df

时间:2014-06-05 22:05:00

标签: r wc

我的df2

          League freq
18       England  108
27         Italy   79
20       Germany   74
43         Spain   64
19        France   49
39        Russia   34
31        Mexico   27
47        Turkey   24
32   Netherlands   23
37      Portugal   21
49 United States   18
29         Japan   16
25          Iran   15
7         Brazil   13
22        Greece   13
14         Costa   11
45   Switzerland   11
5        Belgium   10
17       Ecuador   10
23      Honduras   10
42   South Korea    9
2      Argentina    8
48       Ukraine    7
3      Australia    6
11         Chile    6
12         China    6
15       Croatia    6
35        Norway    6
41      Scotland    6
34       Nigeria    5

我尝试选择europe

europe <- subset(df2, nrow(x=18, 27, 20) select=c(1, 2))

europe选择africaAsiadf2 ...的最有效方法是什么?

2 个答案:

答案 0 :(得分:5)

您需要手动确定哪些国家/地区属于哪些国家/地区,或者您可以从某个地方获取此信息:

(来自Scraping html tables into R data frames using the XML package的基本策略)

library(XML)
theurl <- "http://en.wikipedia.org/wiki/List_of_European_countries_by_area"
tables <- readHTMLTable(theurl)
library(stringr)
europe_names <- str_extract(as.character(tables[[1]]$Country),"[[:alpha:] ]+")
head(sort(europe_names))
## [1] "Albania"    "Andorra"    "Austria"    "Azerbaijan" "Belarus"     
## [6] "Belgium"   
## there's also a 'Total' entry in here but it's probably harmless ...
subset(df2,League %in% europe_names)

当然,你必须再次为亚洲,美国等解决这个问题。

答案 1 :(得分:3)

因此,使用countrycode包时,@ BenBolker的方法略有不同。

library(countrycode)
cdb <- countrycode_data         # database of countries

df2[toupper(df2$League) %in% cdb[cdb$continent=="Europe",]$country.name,]
#         League freq
# 27       Italy   79
# 20     Germany   74
# 43       Spain   64
# 19      France   49
# 32 Netherlands   23
# 37    Portugal   21
# 22      Greece   13
# 45 Switzerland   11
# 5      Belgium   10
# 48     Ukraine    7
# 15     Croatia    6
# 35      Norway    6

您将遇到的一个问题是“英格兰”不是任何数据库中的国家(而是“英国”),所以您必须将其作为一个特例进行处理。

此外,该数据库将“美洲”视为一个大陆。

df2[toupper(df2$League) %in% cdb[cdb$continent=="Americas",]$country.name,]

所以要获得南美洲,你必须使用region字段:

df2[toupper(df2$League) %in% cdb[cdb$region=="South America",]$country.name,]
#       League freq
# 7     Brazil   13
# 17   Ecuador   10
# 2  Argentina    8
# 11     Chile    6