我写了一个简单的函数来从Baseball Reference.com中删除所有MLB投手的名字。我用刮下的名字创建了一个向量,从原始的刮下名称中删除了Â,并强制转换为字符向量。
library(rvest)
url <- "http://www.baseball-reference.com/leagues/MLB/2016-standard-pitching.shtml"
mlbpitcherdata <- read_html(url)
mlbpitchers <- mlbpitcherdata %>% html_nodes("td:nth-child(2) a") %>% html_text()
mlbpitchers <- as.character(sapply(mlbpitchers, function(x) gsub("Â","",x))) # Remove "Â" from all raw pitcher names
然后我尝试在向量中查找字符向量中的特定名称的索引,并且which()
函数返回integer(0)
。
# Search for pitcher name in list of pitchers = Returns integer(0)!
which(mlbpitchers=="Chad Bettis")
integer(0)
# But, mlbpitchers CLEARLY has Chad Bettis inside of it.
mlbpitchers[26]
[1] "Chad Bettis"
我很困惑为什么which()
函数没有识别名称。我真的很感激任何人的帮助。我知道这可能是非常愚蠢和简单的事情,但我无法理解!谢谢!
(注意:删除Â字符后,我被要求选择保存的编码。我选择系统默认值:ISO 8859-1。我不确定这是否可以在问题中发挥作用。)< / p>
答案 0 :(得分:3)
这是一个编码问题。特别是,如果你看看
R> substr(mlbpitchers[26], 1, 4) == "Chad"
[1] TRUE
R> substr(mlbpitchers[26], 5, 5) == " "
[1] FALSE
正如Joran建议的那样,使用
R> rawToChar(charToRaw(mlbpitchers[26]),multiple = TRUE)
[1] "C" "h" "a" "d" "\xc2" "\xa0" "B" "e" "t" "t"
[11] "i" "s"
也突出了问题。这些字符(感谢Nicola)是html non-breaking空格。要删除它们,请使用
gsub("\xc2\xa0"," ",mlbpitchers)