R Hadoop映射器错误下标超出范围

时间:2014-06-16 06:40:07

标签: r hadoop mapreduce hadoop-streaming

我正在尝试使用R(Hadoop Streaming)编写基本的MapReduce。以下是我写的Mapper函数:

#! /usr/bin/env Rscript

con <- file("stdin",open = "r")

while(length(line <- readLines(con = con,n = 1,warn = FALSE)) > 0 )
{
  line1 <- gsub("^\\s+|\\s+$", "", line)
  if(is.null(strsplit(line1," ")) == FALSE){
    x <- as.numeric(unlist(strsplit(line1," "))[[1]])
    y <- as.numeric(unlist(strsplit(line1," "))[[2]])
    x2 <- x*x
    xy <- x*y
    cat(x,"\t",y,"\t",xy,"\t",x2,"\n")   
  }
}

close(con)

此输入文件包含两列,如下所示:

1  15.55511341
2   27.53983952
3   39.7767569
4   47.44065279
5   55.0606804
6   68.57527802
7   77.03639749
8   80.92939421
9   94.4431412
10  106.5353655

我尝试使用以下命令直接在命令提示符下运行此映射器:

cat ../data/Input.txt | ./mapper.R

但是,我收到以下错误消息:

Error in unlist(strsplit(line1, " "))[[2]] : subscript out of bounds
  In addition: Warning message:
  NAs introduced by coercion 
  Execution halted

看起来我在代码中犯了一些基本错误。有人可以帮我解决这个问题吗?

1 个答案:

答案 0 :(得分:1)

您在regex中使用的gsub存在问题。 尝试以下代码。

con <- file('stdin',open = 'r')
while(length(line <- readLines(con = con,n = 1,warn = FALSE)) > 0 )
    {
        line1 <- gsub('\\s+', ' ', line)
        line1 <- gsub("^\\s+|\\s+$", '', line1)
        res <- unlist(strsplit(line1,' '))
        if(length(res)==2){
            x <- as.numeric(res[1])
            y <- as.numeric(res[2])
            x2 <- x*x
            xy <- x*y
            cat(x,"\t",y,"\t",xy,"\t",x2,"\n")
        }
    }

close(con)

它对我有用。