Question

我从维基百科文章上传了一些数据。网址为：http://en.wikipedia.org/wiki/List_of_countries_by_number_of_police_officers

我使用了XML包，它运行得很好。但是，当我读取数据时，数字有一个不必要的模式。

以下是我用来阅读和上传数据的代码：

library(XML)
u <- 'http://en.wikipedia.org/wiki/List_of_countries_by_number_of_police_officers'
t <- readHTMLTable(u)
t1 <- t[1]
write.csv(t1, 'test1.csv', row.names = F)
d <- read.csv('test1.csv',colClasses = 'character')

我需要消除700xxxx000000000和末尾括号中的数字，以便保留零后的数字。

例如，第一行中的数字是 7005122000000000000122,000 [3] 我需要： 122000

有什么建议吗？我想过使用gsub或任何类似的功能，但我不知道要看哪种模式。我可以手工完成，但它没有那么有效。

由于

Answer 1

我也使用sub。例如：

d[, c(2, 4)] <- sapply(d[, c(2, 4)], 
                       sub, 
                       pattern = ".*0{8,}([0-9,]+).*", 
                       replacement = "\\1")
head(d)
#          NULL.Country NULL.Size NULL.Year NULL.Police.per.100.000.people
# 1         Afghanistan   122,000      2012                            401
# 2      American Samoa       200      2012                            720
# 3             Andorra       237      2012                            278
# 4 Antigua and Barbuda       600      2012                            733
# 5           Argentina   205,902      2000                            558
# 6           Australia    49,242      2009                            217

Answer 2

定义修改后的elFun：

testFun <- function(x){xmlValue(xmlChildren(x)$text)}
out <- readHTMLTable(u, elFun = testFun)[[1]]

> head(out)
  Country    Size Year Police per\n100,000 people
1         122,000 2012                        401
2             200 2012                        720
3             237 2012                        278
4             600 2012                        733
5         205,902 2000                        558
6          49,242 2009                        217

说明：

某些条目在节点中有多个元素，如：

> xmlChildren(getNodeSet(htmlParse(u), "//table[1]/tr/td")[[534]])
$span
<span style="display:none" class="sortkey">7004500000000000000</span> 

$text
50,000 

$sup
<sup id="cite_ref-100" class="reference">
  <a href="#cite_note-100"><span>[</span>100<span>]</span></a>
</sup>

我们的目标是＆＃34; TEXT＆＃34;节点：

> xmlValue(xmlChildren(getNodeSet(htmlParse(u), "//table[1]/tr/td")[[534]])$text)
[1] "50,000"

而不是上面示例中的span或sup。

如何给出一个模式R，我该如何清理我的数字条目？

2 个答案: