我希望删除以下维基文章:http://en.wikipedia.org/wiki/Periodic_table
因此我的R代码的输出将是一个包含以下列的表:
(显然每个化学元素都有一行)
我正在尝试使用XML包来获取页面内的值,但似乎一直停留在开头,所以我很欣赏如何做到这一点的示例(和/或相关示例的链接)< / p>
library(XML)
base_url<-"http://en.wikipedia.org/wiki/Periodic_table"
base_html<-getURLContent(base_url)[[1]]
parsed_html <- htmlTreeParse(base_html, useInternalNodes = TRUE)
xmlChildren(parsed_html)
getNodeSet(parsed_html, "//html", c(x = base_url))
[[1]]
attr(,"class")
[1] "XMLNodeSet"
答案 0 :(得分:13)
试试这个:
library(XML)
URL <- "http://en.wikipedia.org/wiki/Periodic_table"
root <- htmlTreeParse(URL, useInternalNodes = TRUE)
# extract attributes and value of all 'a' tags within 3rd table
f <- function(x) c(xmlAttrs(x), xmlValue(x))
m1 <- xpathApply(root, "//table[3]//a", f)
m2 <- suppressWarnings(do.call(rbind, m1))
# extract rows that correspond to chemical symbols
ix <- grep("^[[:upper:]][[:lower:]]{0,2}", m2[, "class"])
m3 <- m2[ix, 1:3]
colnames(m3) <- c("URL", "Name", "Symbol")
m3[,1] <- sub("^", "http://en.wikipedia.org", m3[,1])
m3[,2] <- sub(" .*", "", m3[,2])
输出的一点点:
> dim(m3)
[1] 118 3
> head(m3)
URL Name Symbol
[1,] "http://en.wikipedia.org/wiki/Hydrogen" "Hydrogen" "H"
[2,] "http://en.wikipedia.org/wiki/Helium" "Helium" "He"
[3,] "http://en.wikipedia.org/wiki/Lithium" "Lithium" "Li"
[4,] "http://en.wikipedia.org/wiki/Beryllium" "Beryllium" "Be"
[5,] "http://en.wikipedia.org/wiki/Boron" "Boron" "B"
[6,] "http://en.wikipedia.org/wiki/Carbon" "Carbon" "C"
我们可以通过从Jeffrey的xpath表达式开始进一步增强xpath表达式来使这更紧凑(因为它几乎将元素放在顶部)并为它添加一个确切的限定条件。在这种情况下,xpathSApply
可用于消除对do.call
或plyr包的需要。我们确定赔率和结束的最后一点与之前相同。这产生了一个矩阵而不是数据帧,因为内容完全是字符,所以它似乎更合适。
library(XML)
URL <- "http://en.wikipedia.org/wiki/Periodic_table"
root <- htmlTreeParse(URL, useInternalNodes = TRUE)
# extract attributes and value of all a tags within 3rd table
f <- function(x) c(xmlAttrs(x), xmlValue(x))
M <- t(xpathSApply(root, "//table[3]/tr/td/a[.!='']", f))[1:118,]
# nicer column names, fix up URLs, fix up Mercury.
colnames(M) <- c("URL", "Name", "Symbol")
M[,1] <- sub("^", "http://en.wikipedia.org", M[,1])
M[,2] <- sub(" .*", "", M[,2])
View(M)
答案 1 :(得分:4)
但是,这不是你想要的:
library(XML)
url = 'http://en.wikipedia.org/wiki/Periodic_table'
tables = readHTMLTable(html)
# ... look through the list to find the one you want...
table = tables[3]
table
$`NULL`
Group # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 Period <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 1 1H 2He <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 2 3Li 4Be 5B 6C 7N 8O 9F 10Ne <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 3 11Na 12Mg 13Al 14Si 15P 16S 17Cl 18Ar <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 4 19K 20Ca 21Sc 22Ti 23V 24Cr 25Mn 26Fe 27Co 28Ni 29Cu 30Zn 31Ga 32Ge 33As 34Se 35Br 36Kr
6 5 37Rb 38Sr 39Y 40Zr 41Nb 42Mo 43Tc 44Ru 45Rh 46Pd 47Ag 48Cd 49In 50Sn 51Sb 52Te 53I 54Xe
7 6 55Cs 56Ba * 72Hf 73Ta 74W 75Re 76Os 77Ir 78Pt 79Au 80Hg 81Tl 82Pb 83Bi 84Po 85At 86Rn
8 7 87Fr 88Ra ** 104Rf 105Db 106Sg 107Bh 108Hs 109Mt 110Ds 111Rg 112Cn 113Uut 114Uuq 115Uup 116Uuh 117Uus 118Uuo
9 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
10 * Lanthanoids 57La 58Ce 59Pr 60Nd 61Pm 62Sm 63Eu 64Gd 65Tb 66Dy 67Ho 68Er 69Tm 70Yb 71Lu <NA> <NA>
11 ** Actinoids 89Ac 90Th 91Pa 92U 93Np 94Pu 95Am 96Cm 97Bk 98Cf 99Es 100Fm 101Md 102No 103Lr <NA> <NA>
名称消失了,原子序数会进入符号。
回到绘图板......
我的DOM walk-fu不是很强,所以这并不漂亮。它获取表格单元格中的每个链接,仅保留具有“标题”属性的那些(符号所在的位置),并在data.frame中粘贴您想要的内容。它也会在页面上获得其他所有这样的链接,但我们很幸运,元素是第118个这样的链接:
library(XML)
library(plyr)
url = 'http://en.wikipedia.org/wiki/Periodic_table'
# don't forget to parse the HTML, doh!
doc = htmlParse(url)
# get every link in a table cell:
links = getNodeSet(doc, '//table/tr/td/a')
# make a data.frame for each node with non-blank text, link, and 'title' attribute:
df = ldply(links, function(x) {
text = xmlValue(x)
if (text=='') text=NULL
symbol = xmlGetAttr(x, 'title')
link = xmlGetAttr(x, 'href')
if (!is.null(text) & !is.null(symbol) & !is.null(link))
data.frame(symbol, text, link)
} )
# only keep the actual elements -- we're lucky they're first!
df = head(df, 118)
head(df)
symbol text link
1 Hydrogen H /wiki/Hydrogen
2 Helium He /wiki/Helium
3 Lithium Li /wiki/Lithium
4 Beryllium Be /wiki/Beryllium
5 Boron B /wiki/Boron
6 Carbon C /wiki/Carbon
答案 2 :(得分:0)
您必须抓取 Wikipedia 吗?您可以改为对Wikidata运行此SPARQL查询(results):
SELECT
?elementLabel
?symbol
?article
WHERE
{
?element wdt:P31 wd:Q11344;
wdt:P1086 ?n;
wdt:P246 ?symbol.
OPTIONAL {
?article schema:about ?element;
schema:inLanguage "en";
schema:isPartOf <https://en.wikipedia.org/>.
}
FILTER (?n >= 1 && ?n <= 118).
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
ORDER BY ?n
对不起,如果这不能直接回答您的问题,但这应该可以帮助希望以干净的方式抓取相同信息的人们。