当我使用fread读取一个大表时,它稍微改变了其中一列中的数字

时间:2016-08-30 09:03:22

标签: r fread

我有一个看起来像这样的大文件

region              type    coeff      p-value  distance    count
82365593523656436   A      -0.9494     0.050    -16479472.5 8
82365593523656436   B      0.47303     0.526    57815363.0  8
82365593523656436   C      -0.8938     0.106    42848210.5  8

当我使用fread读取它时,突然找不到82365593523656436

correlations <- data.frame(fread('all_to_all_correlations.txt'))
> "82365593523656436" %in% correlations$region
[1] FALSE

我可以找到一个略有不同的数字

> "82365593523656432" %in% correlations$region
[1] TRUE

但是这个数字不在实际文件中

grep 82365593523656432 all_to_all_correlations.txt 

没有结果,而

grep 82365593523656436 all_to_all_correlations.txt 

确实

当我尝试阅读上面显示的小样本文件而不是我得到的完整文件

Warning message:
In fread("test.txt") :
  Some columns have been read as type 'integer64' but package bit64 isn't  loaded. 
Those columns will display as strange looking floating point data. 
There is no need to reload the data. 
Just require(bit64) toobtain the integer64 print method and print the data again.

,数据看起来像

     region type    coeff       p.value  distance      count
1 3.758823e-303    A -0.94940   0.050    -16479472     8
2 3.758823e-303    B  0.47303   0.526     57815363     8
3 3.758823e-303    C -0.89380   0.106     42848210     8

所以我认为在阅读期间,82365593523656436已更改为82365593523656432。如​​何防止这种情况发生?

1 个答案:

答案 0 :(得分:1)

ID(这显然是第一列的内容)通常应该被理解为字符:

correlations <- setDF(fread('region              type    coeff      p-value  distance    count
                                 82365593523656436   A      -0.9494     0.050    -16479472.5 8
                                 82365593523656436   B      0.47303     0.526    57815363.0  8
                                 82365593523656436   C      -0.8938     0.106    42848210.5  8',
                            colClasses = c(region = "character")))
str(correlations)
#'data.frame':  3 obs. of  6 variables:
# $ region  : chr  "82365593523656436" "82365593523656436" "82365593523656436"
# $ type    : chr  "A" "B" "C"
# $ coeff   : num  -0.949 0.473 -0.894
# $ p-value : num  0.05 0.526 0.106
# $ distance: num  -16479473 57815363 42848211
# $ count   : int  8 8 8