我有(html-)文本,我想将ö
内容更改为ä,ü,ö等真实字符,因为否则xml-package不接受它。
所以我写了一个小函数,循环遍历替换表(link1,link2)并用sp特殊字符替换特殊字符...函数看起来像这样(只有looonger) :
html.charconv <- function(text){
replacer <- matrix(c(
"Á", "Á",
"á", "á",
"Â", "Â",
"â", "â",
"´", "´"
)
,ncol=2,byrow=T)
for(i in 1:length(replacer[,1])){
text <- str_replace_all(text,replacer[i,2],replacer[i,1])
}
text
}
我怎样才能加快速度?我考虑过矢量化,但没有任何帮助解决方案,因为对于每个周期,最后一个周期的结果都是它的起点。
答案 0 :(得分:8)
通过构建有点不同的功能可以获得显着的加速,并忘记文本工具。基本上你:
您可以使用以下功能执行此操作:
html.fastconv <- function(x,old,new){
xs <- strsplit(x,"&|;")
old <- gsub("&|;","",old)
xs <- lapply(xs,function(i){
id <- match(i,old,0L)
i[id!=0] <- new[id]
return(i)
})
sapply(xs,paste,collapse="")
}
这适用于:
> sometext <- c("Ádd somá leÂtterâ acute problems et´ cetera",
+ "Ádd somá leÂtterâ acute p ..." ... [TRUNCATED]
> newchar <- c("Á","á","Â","â","´")
> oldchar <- c("Á","á","Â","â","´")
> html.fastconv(sometext,oldchar,newchar)
[1] "Ádd somá leÂtterâ acute problems et´ cetera" "Ádd somá leÂtterâ acute problems et´ cetera"
为了记录,一些基准测试:
require(rbenchmark)
benchmark(html.fastconv(sometext,oldchar,newchar),html.charconv(sometext),
columns=c("test","elapsed","relative"),
replications=1000)
test elapsed relative
2 html.charconv(sometext) 0.79 5.643
1 html.fastconv(sometext, oldchar, newchar) 0.14 1.000
答案 1 :(得分:8)
只是为了好玩,这是一个基于Rcpp
的版本。
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::export]]
CharacterVector rcpp_conv(
CharacterVector text, CharacterVector old , CharacterVector new_){
int n = text.size() ;
int nr = old.size() ;
std::string buffer, current_old, current_new ;
size_t pos, current_size ;
CharacterVector res(n) ;
for( int i=0; i<n; i++){
buffer = text[i] ;
for( int j=0; j<nr; j++){
current_old = old[j] ;
current_size = current_old.size() ;
current_new = new_[j] ;
pos = 0 ;
pos = buffer.find( current_old ) ;
while( pos != std::string::npos ){
buffer.replace(
pos, current_size,
current_new
) ;
pos = buffer.find( current_old ) ;
}
}
res[i] = buffer ;
}
return res ;
}
为此我获得了更多的性能提升:
> microbenchmark(
+ html.fastconv( sometext,oldchar,newchar),
+ html.fastconvJC(sometext, oldchar, newchar),
+ rcpp_conv( sometext, oldchar, newchar)
+ )
Unit: microseconds
expr min lq median uq
1 html.fastconv(sometext, oldchar, newchar) 97.588 99.9845 101.4195 103.072
2 html.fastconvJC(sometext, oldchar, newchar) 19.945 23.3060 25.8110 28.134
3 rcpp_conv(sometext, oldchar, newchar) 4.047 5.1555 6.2340 9.275
max
1 256.061
2 40.647
3 25.763
以下是基于Rcpp::String
功能的实施,可从Rcpp >= 0.10.2
获取:
class StringConv{
public:
typedef String result_type ;
StringConv( CharacterVector old_, CharacterVector new__):
nr(old_.size()), old(old_), new_(new__){}
String operator()(String text) const {
for( int i=0; i<nr; i++){
text.replace_all( old[i], new_[i] ) ;
}
return text ;
}
private:
int nr ;
CharacterVector old ;
CharacterVector new_ ;
} ;
// [[Rcpp::export]]
CharacterVector test_sapply_string(
CharacterVector text, CharacterVector old , CharacterVector new_
){
CharacterVector res = sapply( text, StringConv( old, new_ ) ) ;
return res ;
}
答案 2 :(得分:5)
我猜测36,000个文件读写是你的瓶颈,而你在R中编码的方式也无济于事。有些事情只需要一段时间。你的功能看起来会正常工作,让它运行。您可以进行一些小的改进。
replacer <- matrix(c(
"Á", "Á",
"á", "á",
"Â", "Â",
"â", "â",
"´", "´"
)
,ncol=2, byrow=T)
html.fastconvJC <- function(x,old,new){
n <- length(new)
s <- x #make a copy cause I'm scared of scoping in R :)
for (i in 1:n) s <- gsub(old[i], new[i], s, fixed = TRUE)
s
}
# borrowing the strings from Joris Meys
benchmark(html.fastconvJC(sometext, replacer[,2], replacer[,1]),
html.charconv(sometext), columns = c("test", "elapsed", "relative"),
replications=1000)
test elapsed relative
2 html.charconv(sometext) 0.727 17.31
1 html.fastconvJC(sometext, replacer[, 2], replacer[, 1]) 0.042 1.00
他们的速度超出了我的预期。请注意,加速的很大一部分是fixed = TRUE
,否则Joris Meys的回答大致相同。
如果这没有达到你的整体速度,你知道你的瓶颈在其他地方,可能是文件读写。除非你有固态驱动器或RAID驱动器,否则并行运行它不会加快速度,可能会降低速度。
答案 3 :(得分:-1)
我会尝试使用plyr:
input.data <- llply(input.files, html.charconv, .parallel=TRUE)