具有大数据和小十进制值的dcast.data.table问题

时间:2018-12-05 11:37:26

标签: r datatable dplyr tidyverse int64

我使用了this答案中的函数来读取多个文件并创建一个数据表。 我想在不同的列中使用文件名,对于其他“文件名”中不存在的每个变量,都用0填充它

部分数据集:

    dput(dt[1:4])
structure(list(FileName = c("Sample_4C_NaIO4", "Sample_4C_NaIO4", 
"Sample_4C_NaIO4", "Sample_4C_NaIO4"), smallRNA = c("TCGTACGACTCTTAGCGG", 
"GTACGACTCTTAGCGG", "CTCGTACGACTCTTAGCGG", "CGTACGACTCTTAGCGG"
), counts = c(4166178L, 564940L, 89932L, 52670L)), class = c("data.table", 
"data.frame"), row.names = c(NA, -4L), .internal.selfref = <pointer: 0x180a460>)

我的代码:

temp <- list.files(pattern = ".txt")
dt <- rbindlist( sapply(temp,fread,simplify=FALSE),
use.names = TRUE, idcol = "FileName")
dt$FileName <- gsub(".txt","",dt$FileName)
finaldt <- dcast.data.table(dt, smallRNA+counts ~FileName,
drop=FALSE,fill=0)

结果:

    finaldt <- dcast.data.table(dt,smallRNA+counts ~ FileName,drop = FALSE,fill = 0)
Using 'counts' as value column. Use 'value.var' to override
Error in CJ(smallRNA = c("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAA", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAG",  : 
  Cross product of elements provided to CJ() would result in 70585808594 rows which exceeds .Machine$integer.max == 2147483647

我考虑使用此软件包:bit64 但我不确定如何。

版本:

version
               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          3                           
minor          5.1                         
year           2018                        
month          07                          
day            02                          
svn rev        74947                       
language       R                           
version.string R version 3.5.1 (2018-07-02)
nickname       Feather Spray
  

编辑


代码的最后一部分必须更改为:

finaldt <- dcast.data.table(dt, smallRNA ~FileName,
drop=FALSE,fill=0,value.var=counts)
  

Edit2问题,数字小于1


在组合数据集“ dt”中,没有任何值小于1:

filter(dt,counts<1)
[1] FileName smallRNA counts  
<0 rows> (or 0-length row.names)
> myfiles[[1]] %>% filter(counts<1) %>% tail()
# A tibble: 6 x 2
  smallRNA                                                                                counts
  <chr>                                                                                    <dbl>
1 ENST00000592744.1 ncrna chromosome:GRCh38:9:81946438:81976806:-1 gene:ENSG00000267559… 0.00106
2 ENST00000594089.1 ncrna chromosome:GRCh38:11:64778954:64779405:1 gene:ENSG00000269038… 0.00106
3 ENST00000607991.1 ncrna chromosome:GRCh38:22:38743495:38743910:1 gene:ENSG00000273076… 0.00106
4 ENST00000608972.1 ncrna chromosome:GRCh38:7:29008926:29010252:1 gene:ENSG00000272568.… 0.00106
5 ENST00000618845.1 ncrna chromosome:GRCh38:14:49863072:49864379:1 gene:ENSG00000278002… 0.00106
6 ENST00000625800.1 ncrna chromosome:GRCh38:CHR_HG2232_PATCH:233205199:233205479:1 gene… 0.00106

是否也可以包含这些值?

0 个答案:

没有答案