2个csv文件返回不同的数据类型

时间:2013-04-07 03:55:48

标签: r csv

我有2个csv数据文件。一分钟一分钟的酒吧和其他5分钟的酒吧。 这两个文件格式相同。

文件1是

> 2007-01-02 10:00:00.000,NIFTY,ABB,2007-01-02 10:00:00.000,750.4,750.4,750,750.2
  2007-01-02 10:01:00.000,NIFTY,ABB,2007-01-02 10:01:00.000,750.38,750.4,749.8,749.8
  2007-01-02 10:02:00.000,NIFTY,ABB,2007-01-02 10:02:00.000,749.8,750,749.6,750
  2007-01-02 10:03:00.000,NIFTY,ABB,2007-01-02 10:03:00.000,749.6,752.4,749.6,752
  2007-01-02 10:04:00.000,NIFTY,ABB,2007-01-02 10:04:00.000,752,755.8,752,754.2
  2007-01-02 10:05:00.000,NIFTY,ABB,2007-01-02 10:05:00.000,754.02,755,752.05,753.6
  2007-01-02 10:06:00.000,NIFTY,ABB,2007-01-02 10:06:00.000,753,753,751,751
  2007-01-02 10:07:00.000,NIFTY,ABB,2007-01-02 10:07:00.000,751,751.62,750.5,751
  2007-01-02 10:08:00.000,NIFTY,ABB,2007-01-02 10:08:00.000,750.8,751,750.2,750.62 

文件2是

 > 2007-01-02 10:00:00.000,NIFTY,ABB,2007-01-02 10:00:00.000,750.2,754.2,749.8,753.6
   2007-01-02 10:05:00.000,NIFTY,ABB,2007-01-02 10:05:00.000,753.6,753.6,750.62,752.8
   2007-01-02 10:10:00.000,NIFTY,ABB,2007-01-02 10:10:00.000,752.8,752.8,750.2,751.5
   2007-01-02 10:15:00.000,NIFTY,ABB,2007-01-02 10:15:00.000,751.5,752,751,751.6
   2007-01-02 10:20:00.000,NIFTY,ABB,2007-01-02 10:20:00.000,751.6,751.6,750.8,751
   2007-01-02 10:25:00.000,NIFTY,ABB,2007-01-02 10:25:00.000,751,751.2,749,749
   2007-01-02 10:30:00.000,NIFTY,ABB,2007-01-02 10:30:00.000,749,751.8,749,751.8
   2007-01-02 10:35:00.000,NIFTY,ABB,2007-01-02 10:35:00.000,751.8,752,751.1,751.4

现在我跑了 a< - read.csv(“file1.csv”)

class(a [,1:4])是因子

class(a [,5:8])是数字

而在文件2的情况下

b< - read.csv(“file2.csv”)

class(b [,1:4])是因子

class(b [,5:8])是因子。

为什么列5:8的类是因素。这个因素类型数据不让我继续我的分析。任何的想法。

2 个答案:

答案 0 :(得分:4)

很难说不看实际文件。这些列中可能隐藏了一些字符。

要查找,请使用stringsAsFactors = F中的read.csv将因子作为字符读取。接下来,使用as.numeric将字符列转换为数字。这将引入NA代替实际角色。最后找出使用is.na

注意:可以使用as.numeric将因子类型转换为数字,但在这种情况下会产生不希望的结果。

答案 1 :(得分:2)

当我将您的示例数据粘贴到两个文件中并使用read.csv()读取这两个文件时,两个都将前四列作为因子,第二组四列作为数字,因此我无法复制不幸的是,你的问题。它可能是文件中没有出现在你的例子中的东西。

当我修改“file2.csv”以阅读:

2007-01-02 10:00:00.000,NIFTY,ABB,2007-01-02 10:00:00.000,750.2,754.2,749.8,753.6
2007-01-02 10:05:00.000,NIFTY,ABB,2007-01-02 10:05:00.000,753.6,753.6,750.62,752.8
2007-01-02 10:10:00.000,NIFTY,ABB,2007-01-02 10:10:00.000,752.8,752.8,750.2,751.5
2007-01-02 10:15:00.000,NIFTY,ABB,2007-01-02 10:15:00.000,751.5,752,751,751.6
2007-01-02 10:20:00.000,NIFTY,ABB,2007-01-02 10:20:00.000,751.6,751.6,750.8,751
2007-01-02 10:25:00.000,NIFTY,ABB,2007-01-02 10:25:00.000,751,751.2,749,749
2007-01-02 10:30:00.000,NIFTY,ABB,2007-01-02 10:30:00.000,749,751.8,749,751.8
2007-01-02 10:35:00.000,NIFTY,ABB,2007-01-02 10:35:00.000,a,b,c,d

...我确实发现最后四列是作为因素读入的,因此这表明在“file2.csv”的那些列中某处可能存在非数字数据。

我还注意到你可能想要使用类似的东西:

a<-read.csv("file1.csv",header=F)
b<-read.csv("file2.csv",header=F)

...避免将第一行转换为标题,但是,我是否header=F没有改变结果。仅供参考我在Windows 7,64位上使用R 2.15.3。