如何使用biglm超过2 ^ 31观察

时间:2017-06-11 17:43:53

标签: r bigdata

我正在使用包含超过2 ^ 31个观测值的大量数据。实际观测数接近35亿次观测。

我正在使用R包" biglm"运行大约70个预测变量的回归。我一次读取数据一百万行并更新回归结果。数据已使用R库" ffdf"以ffdf格式保存。快速加载并避免耗尽我所有的RAM。

以下是我正在使用的代码的基本概要:

library(ff,ffbase,biglm)
load.ffdf(dir='home')

dim(data) #the ffdf contains about 70 predictors and 3.5 billion rows

chunk_1 <- data[1:1000000,]
rest_of_data <- data[1000000:nrow(data),]

# Running biglm for first chunk
b <- biglm(y~x1+x2+...+x70, chunk_1)

chunks <- ceiling((nrow(rest_of_data)/1000000)

# Updating biglm results by iterating through the rest of the data chunks
for (i in seq(1,chunks)){
      start <- 1+((i-1))*1000000
      end <- min(i*1000000,nrow(d))
      d_chunk <- d[start:end,]
      b<-update(b,d_chunk)
}

结果看起来很棒并且一切都在顺利进行,直到用每个数据块更新模型的累计观察数超过2 ^ 31个观测值。然后,我得到一个错误,读取

In object$n + NROW(mm) : NAs produced by integer overflow

如何解决此溢出问题?在此先感谢您的帮助!

1 个答案:

答案 0 :(得分:7)

我相信我在biglm代码中找到了问题的根源。

观察次数(n)存储为整数,2^31 - 1numeric

n类型不受此限制的约束,并且据我所知,可以使用而不是整数来存储n

has a max value显示如何使用一行将整数numeric转换为n的其他代码来解决此问题。随着模型的更新,新批次中的行数会添加到旧版n中,因此numeric的类型仍为library(biglm) df = as.data.frame(replicate(3, rnorm(10000000))) a = biglm(V1 ~ V2 + V3, df) for (i in 1:300) { a = update(a, df) } print(summary(a))

我能够重现此问题中描述的错误,并验证我的修复程序是否适用于此代码:

(警告:这会占用大量内存,如果您有严格的内存限制,请考虑使用较小的数组进行更多迭代)

Large data regression model: biglm(ff, df)
Sample size =  NA 
              Coef (95% CI) SE  p
(Intercept) -1e-04   NA  NA NA NA
V2          -1e-04   NA  NA NA NA
V3          -2e-04   NA  NA NA NA

在原始的biglm库中,此代码输出:

Large data regression model: biglm(V1 ~ V2 + V3, df)
Sample size =  3.01e+09 
              Coef   (95%    CI) SE p
(Intercept) -3e-04 -3e-04 -3e-04  0 0
V2          -2e-04 -2e-04 -1e-04  0 0
V3           3e-04  3e-04  3e-04  0 0

我的修补版本输出:

    function browsing(win, datapost)
    {
        var closed = false;
        exchange = getCookie('exchange<?php echo Encryption::decode(get("sid")); ?>');
        if(exchange == 'no' )
        {
            $.ajax({
                url : "<?php _router("exchange_process", array("uid" => get("uid"), "sid" => get("sid"))); ?>",
                type: "POST",
                data : datapost,
                success: function(res)
                {
                    var data = res[0];
                    var time   = Math.floor(data["duration"])*1000;
                    var duration = Math.floor(data["duration"]);
                    var url    = data["url"];
                    var browse = data["browse"];
                    var points = data["points"];
                    win = win;
                    if(data["open_status"] == false)
                    {
                        setCookie('exchange<?php echo Encryption::decode(get("sid")); ?>', 'yes');
                        win.close();
                    }
                    win.location = url;
                    progress(duration);
                    countdown("realtime_progress_counter", duration);
                    $("#realtime_url").html(data["show_url"]);
                    $("#realtime_points").html(points);
                    if(win || !win.closed)
                    {
                        window.setTimeout(function(){
                            closed = true;
                            if (win || !win.closed) {
                                newdatapost = 'browse='+browse;
                                browsing(win, newdatapost);
                            }else{
                                closed = false;
                            }
                        }, time);
                    }
                },
                error: function(jqXHR, textStatus, errorThrown)
                {
                    $('#exchange_alert').html('<b>'+app_network_error+'</b>');
                }
            });
        }
        var checkopen = setInterval(function() {
            if (!win || win.closed) {
                if(!closed)
                {
                    $("#realtime_progress").children('.progress-bar').css('width', '0%');
                    setCookie('exchange<?php echo Encryption::decode(get("sid")); ?>', 'yes');
                    clear_session('<?php _get("sid"); ?>', '<?php _router("browsing_process", array("uid" => get("uid"), "sid" => get("sid"))); ?>', false);
                    window.location = "<?php _router("browsing_process", array("uid" => get("uid"), "sid" => get("sid"))); ?>";
                    clearInterval(checkopen);
                }
            }
        }, 500);

        <?php if(s("exchange/focus")=="yes") { ?>
        var checkfocus = setInterval(function(){
        if(win)
        {
            var focused = win.document.hasFocus();
            if(!focused)
            {
                setCookie('exchange<?php echo Encryption::decode(get("sid")); ?>', 'yes');
                win.close();
                clear_session('<?php _get("sid"); ?>', '<?php _router("browsing_process", array("uid" => get("uid"), "sid" => get("sid"))); ?>', true);
                clearInterval(checkfocus);
            }
        }
        }, 500);
        <?php } ?>
        $('#StopBrowsing').on('click', function(event) {
            setCookie('exchange<?php echo Encryption::decode(get("sid")); ?>', 'yes');
            win.close();
            clear_session('<?php _get("sid"); ?>', '<?php _router("browsing_process", array("uid" => get("uid"), "sid" => get("sid"))); ?>', true);
        });
    }

    $('#Browsing').on('click', function(event) {
      event.preventDefault();
      exchange = getCookie('exchange<?php echo Encryption::decode(get("sid")); ?>');
      $('#exchange_alert').html('<div class="alert alert-danger" ><b style="color: white" ><?php _l("browsing_hint"); ?></b></div>');
          var win = $.popupWindow('<?php _router("browsing_process", array("uid" => get("uid"), "sid" => get("sid"))); ?>?start=true', {
          name: 'surfow_browsing',
          scrollbars:  true,
          width: screen.width,
          height: screen.height,
          center: 'screen',
          onLoad: function() {
              if(exchange == 'yes' || exchange == '' || exchange == null)
              {
                setCookie('exchange<?php echo Encryption::decode(get("sid")); ?>', 'no');
                browsing(win, '');
                $('#Browsing').hide();
                $('#StopBrowsing').show();
              }
          }
      });
    });
  });
  $(document).ready(function() {
      exchange = getCookie('exchange<?php echo Encryption::decode(get("sid")); ?>');
      if(exchange == 'no')
      {
          $('#Browsing').hide();
          $('#exchange_alert').html("<div class=\"alert alert-info fg-white\" ><?php _l("browsing_start_hint"); ?> <a href='javascript::void(0)' class=\"btn btn-primary\" id='Refresh'><?php _l("refresh"); ?></a></div>");
      }
      $('#Refresh').on('click', function(event) {
          <?php if(s("exchange/openmode")=="popup") { ?>
          win = window.open("", 'surfow_browsing');
          win.close();
          <?php } ?>
          setCookie('exchange<?php echo Encryption::decode(get("sid")); ?>', 'yes');
          window.location = "<?php _router("browsing_process", array("uid" => get("uid"), "sid" => get("sid"))); ?>";
      });
  });
</script>

SE和p值不为零,只是在上面的输出中四舍五入。

我是R生态系统的新手,所以如果有人能告诉我如何提交这个补丁以便原始作者可以审阅并最终包含在上游包中,我将不胜感激。