R正则表达式在Linux中似乎无法正常工作

时间:2017-09-16 19:40:59

标签: r regex linux web-scraping

我试图抓取网页of Fangraphs with alphabetical player indices以获取每个字母引用的单个列数据框。

我已经能够获得下面的代码来成功运行Windows版本的R 3.4.1,但是根本无法让它在Linux端工作,我无法弄清楚到底是什么出错/不同。

library(XML)

# Scrape to get the webpage
url <- paste0("http://www.fangraphs.com/players.aspx?")
table <- readHTMLTable(url, stringsAsFactors = FALSE)
letterz <- table[[2]]
letterz <- as.character(letterz)
letterz <- strsplit(letterz, split=", ")
letterz <- as.data.frame(letterz)
names(letterz) <- c("letters") 
letterz$letters <- as.character(letterz$letters)

# Below this is where I can notice that the code is not operating the same
# as on my Windows machine. None of the gsub commands seem to impact
# the strings at all.

# Stripping the trailing whitespace
letterz$letters <- gsub("[[:space:]]+$", "", letterz$letters)

# Replacing patterns like "AzB   Ba" to instead have "Az,Ba"
letterz$letters <- gsub("[[:upper:]]+?[[:space:]]+?[[:space:]]+?[[:space:]]+", ",", letterz$letters)

# Final cleaning up
letterz <- as.character(letterz)
letterz <- strsplit(letterz, split=",")
letterz <- as.data.frame(letterz)
names(letterz) <- c("letters") 
letterz$letters <- as.character(letterz$letters)
letterz$letters <- gsub('c\\("|"\\)|"', "", letterz$letters)
letterz$letters <- gsub('^$', NA, letterz$letters)
letterz$letters <- gsub("^[[:space:]]+","", letterz$letters)
letterz$letters <- gsub("[[:space:]]+$","", letterz$letters)
letterz$letters <- gsub("'", "%27", letterz$letters)
letterz <- na.omit(letterz)

从我能找到的结果来看,Windows / Linux正则表达式之间唯一真正的区别就是线路传输实现,我回过头来试图看看是否有所不同......但仍然没有变化。

我也尝试用R-specific&#34; [[:space:]]&#34;和&#34; [[:upper:]]&#34;风格符号用更标准化的&#34; \ s&#34;看看是否会解决任何问题。

至于修复,我知道还有一些其他软件包我可以查看以简单地获得我正在寻找的结果,但更一般地说,只是在Windows和Linux如何实现正则表达式方面存在差异我没有意识到并且没有注意到?如果是这样,我将如何将它们实现到gsub中以获得与Windows相同的结果?

感谢。

0 个答案:

没有答案