按组绘制的奇怪错误

时间:2016-04-04 21:48:39

标签: r plot data.table

对于海量数据转储感到抱歉,但我无法在我尝试过的数据子集上重现这一点。将数据的dput复制粘贴(165个,而不是疯狂)到this Gist

我试图在DT中按sport绘制数据,根据:

  1. 创建具有适当限制的空图以容纳所有数据
  2. 将列gini绘制为散点图,颜色变化sport
  3. 将列five_year_ma绘制为一条线,颜色与2中的颜色相匹配。
  4. 这应该很简单,我以前做过类似的事情。这是应该工作的:

    #empty plot with proper axes
    DT[ , plot(
      NA, ylim = range(gini), xlim = range(season), 
      xlab = "Season", ylab = "Gini",
      main = "Comparison of Gini Coefficient Across Sports")]
    
    #pick colors for each sport
    cols <- c(NHL="black", NBA="red")
    
    DT[ , {
      #add points to current plot
      points(season, gini, col = cols[.BY$sport])
    
      #add lines to current plot
      lines(season, five_yr_ma, col = cols[.BY$sport], lwd = 3)},
      by = sport]
    

    但是这给了我输出/错误:

    # Empty data.table (0 rows) of 1 col: sport
    
      

    错误:xy长度因plot.xy()

    而异

    这很奇怪。如果我们跳过分组并且只是手动完成,那么它可以完美地运行:

    all_sports[sport == "NBA", {
      points(season, gini, col = "red")
      lines(season, five_yr_ma, col = "red", lwd = 3)}]
    
    all_sports[sport == "NHL", {
      points(season, gini, col = "black")
      lines(season, five_yr_ma, col = "black", lwd = 3)}]
    

    expected

    此外,即使在分组的情况下,也不清楚为什么plot.xy已经接收到不同长度的参数 - 如果我们进行以下调整以强制R在它们被发送之前记录输入,那么似乎有任何问题:

    all_sports[ , {
      cat("\n\nPlotting for sport: ", .BY$sport)
      points(x1 <- season, y1 <- gini, col = cols[.BY$sport])
      lines(x2 <- season, y2 <- five_yr_ma, col = cols[.BY$sport], lwd = 3)
      cat("\npoints/season: ",length(x1),
          "\npoints/gini: ", length(y1),
          "\nlines/season: ", length(x2),
          "\nlines/five_yr_ma: ", length(y2))},
      by = sport]
    

    有输出:

    # Plotting for sport:  NHL
    # points/season:  98 
    # points/gini:  98 
    # lines/season:  98 
    # lines/five_yr_ma:  98
    
    # Plotting for sport:  NBA
    # points/season:  67 
    # points/gini:  67 
    # lines/season:  67 
    # lines/five_yr_ma:  67
    

    可能会发生什么?

    由于看起来这种情况在各种机器上都不常见,因此这是我的sessionInfo()

    R version 3.2.4 (2016-03-10)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: Ubuntu 14.04.3 LTS
    
    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
     [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
    [1] data.table_1.9.7
    
    loaded via a namespace (and not attached):
    [1] rsconnect_0.4.1.11 tools_3.2.4  
    

1 个答案:

答案 0 :(得分:2)

事实上,正如@Arun指出的那样,这似乎是(尚未解决的)问题的重新铺设导致了这个问题的错误:

Values of the wrong group are used when using plot() within a data.table() in RStudio

正如@Arun在那里发现的那样,似乎RStudio的原生图形设备因为在j存在时评估by时创建的不同子组所使用的更改指针而被绊倒,这有助于每次只需copy所有.SD的解决方法,例如:

points(copy(season), copy(gini),
       col = cols[.BY$sport])
lines(copy(season), copy(five_yr_ma), 
      col = cols[.BY$sport], lwd = 3)

或者

x <- copy(.SD)
with(x, {points(season, gini, cols = cols[.BY$sport]);
         lines(copy(season), copy(five_yr_ma), 
           col = cols[.BY$sport], lwd = 3)})

这两个对我有用(因为子组太小,这里没有计算效率问题 - 我们可以copy离开而不会明显影响性能。)

这是data.table GitHub页面上的#1524,我在RStudio支持处提交了this错误报告;如果推送修复,则会更新此内容。