除了少数例外,人们会在Word(.doc)文档中找到物种(特别是鸟类)的列表,并且通常它们将以对任何类型的数据分析都无用的方式构建。
列表将是这样的,包含空格和其他所有内容: 它包括分类学(家庭)和具有共同和科学名称的物种。
1 STRUTHIONIDAE (1)
Common Ostrich Struthio camelus
2 DIOMEDEIDAE (5 – 1 + 1)
++Northern Royal Albatross Diomedea sanfordi
Black-browed Albatross Thalassarche melanophris
Shy Albatross Thalassarche cauta
Grey-headed Albatross Thalassarche chrysostoma
Atlantic Yellow-nosed Albatross Thalassarche chlororhynchos
3 Procellaridae (11 – 1 + 1)
Southern Giant Petrel Macronectes giganteus
Pintado Petrel Daption capense
Great-winged Petrel Pterodroma macroptera
Soft-plumaged Petrel Pterodroma mollis
Antarctic Prion Pachyptila desolata
White-chinned Petrel Procellaria aequinoctialis
++Spectacled Petrel Procellaria conspicillata
Cory's Shearwater Calonectris [diomedea] borealis
Great Shearwater Puffinus gravis
Sooty Shearwater Puffinus griseus
Manx Shearwater Puffinus puffinus
4 HYDROBATIDAE (3)
Wilson's Storm-Petrel Oceanites oceanicus
British Storm-Petrel Hydrobates pelagicus
Leach's Storm-Petrel Oceanodroma leucorhoa
这样的列表是技术报告,地理分布设计,区域保护状态,摘要等的非凡信息来源。 对于那些几乎没有或出版的地区来说,这是特别令人感兴趣的(上面的例子是来自www.birdsangola.org的安哥拉鸟类名单的一部分)。 如果格式正确,将更好地使用数据。对于随后对数据的分析,数据帧将是一个很好的候选者。
我想将上面的列表转换为可用的东西,提取物种的通用名称,科学名称和分类系列。 data.frame将是一个很好的,自然的候选人。
答案 0 :(得分:0)
library(stringr)
# Read from clipboard (blank.lines.skip = T)
orig.list <- read.delim2('clipboard', header = F, stringsAsFactors = F)
l.species <- data.frame()
for(i in 1:nrow(orig.list)) {
tmp.string <- unlist(str_extract_all(orig.list[i, ], "[A-Za-z]+"))
l.species[i, 1] <- ifelse(length(tmp.string) == 1, tmp.string,
paste(tmp.string[1:(length(tmp.string)-2)],
collapse = ' '))
l.species[i, 2] <- paste(tmp.string[(length(tmp.string) - 1) : length(tmp.string)],
collapse = ' ')
l.species[i, 3]<-ifelse(length(tmp.string) == 1, 1, 0)
}
names(l.species) <- c('common name', 'species', 'is.family')
taxon.family <- toupper(subset(l.species, is.family == 1,
select = species)$species)
rows.family <- as.numeric(row.names(subset(l.species, is.family == 1)))
l.species$family <- rep(taxon.family, times = diff(c(rows.family,
nrow(l.species)+1)))
l.spec.family <- subset(l.species, is.family == 0, select = -is.family)
> head(l.spec.family)
common name species family
2 Common Ostrich Struthio camelus STRUTHIONIDAE
4 Northern Royal Albatross Diomedea sanfordi DIOMEDEIDAE
5 Black browed Albatross Thalassarche melanophris DIOMEDEIDAE
6 Shy Albatross Thalassarche cauta DIOMEDEIDAE
7 Grey headed Albatross Thalassarche chrysostoma DIOMEDEIDAE
8 Atlantic Yellow nosed Albatross Thalassarche chlororhynchos DIOMEDEIDAE
library(plyr)
summary.nesp <- ddply(l.spec.family, .(family), summarise,
prop_esp = length(family)/nrow(*all.data*)*100)
top.summary.nesp <- head(summary.nesp[order(summary.nesp$prop_esp, decreasing = T),], 6)
> top.summary.nesp
family prop_esp
79 SYLVIIDAE 8.076514
1 ACCIPITRIDAE 5.419766
48 PASSERIDAE 5.100956
24 ESTRILDIDAE 4.250797
83 TURDIDAE 3.613177
44 NECTARINIIDAE 3.506908