我有2个csv数据文件。一分钟一分钟的酒吧和其他5分钟的酒吧。 这两个文件格式相同。
文件1是
> 2007-01-02 10:00:00.000,NIFTY,ABB,2007-01-02 10:00:00.000,750.4,750.4,750,750.2
2007-01-02 10:01:00.000,NIFTY,ABB,2007-01-02 10:01:00.000,750.38,750.4,749.8,749.8
2007-01-02 10:02:00.000,NIFTY,ABB,2007-01-02 10:02:00.000,749.8,750,749.6,750
2007-01-02 10:03:00.000,NIFTY,ABB,2007-01-02 10:03:00.000,749.6,752.4,749.6,752
2007-01-02 10:04:00.000,NIFTY,ABB,2007-01-02 10:04:00.000,752,755.8,752,754.2
2007-01-02 10:05:00.000,NIFTY,ABB,2007-01-02 10:05:00.000,754.02,755,752.05,753.6
2007-01-02 10:06:00.000,NIFTY,ABB,2007-01-02 10:06:00.000,753,753,751,751
2007-01-02 10:07:00.000,NIFTY,ABB,2007-01-02 10:07:00.000,751,751.62,750.5,751
2007-01-02 10:08:00.000,NIFTY,ABB,2007-01-02 10:08:00.000,750.8,751,750.2,750.62
文件2是
> 2007-01-02 10:00:00.000,NIFTY,ABB,2007-01-02 10:00:00.000,750.2,754.2,749.8,753.6
2007-01-02 10:05:00.000,NIFTY,ABB,2007-01-02 10:05:00.000,753.6,753.6,750.62,752.8
2007-01-02 10:10:00.000,NIFTY,ABB,2007-01-02 10:10:00.000,752.8,752.8,750.2,751.5
2007-01-02 10:15:00.000,NIFTY,ABB,2007-01-02 10:15:00.000,751.5,752,751,751.6
2007-01-02 10:20:00.000,NIFTY,ABB,2007-01-02 10:20:00.000,751.6,751.6,750.8,751
2007-01-02 10:25:00.000,NIFTY,ABB,2007-01-02 10:25:00.000,751,751.2,749,749
2007-01-02 10:30:00.000,NIFTY,ABB,2007-01-02 10:30:00.000,749,751.8,749,751.8
2007-01-02 10:35:00.000,NIFTY,ABB,2007-01-02 10:35:00.000,751.8,752,751.1,751.4
现在我跑了 a< - read.csv(“file1.csv”)
class(a [,1:4])是因子
class(a [,5:8])是数字
而在文件2的情况下
b< - read.csv(“file2.csv”)
class(b [,1:4])是因子
class(b [,5:8])是因子。
为什么列5:8的类是因素。这个因素类型数据不让我继续我的分析。任何的想法。
答案 0 :(得分:4)
很难说不看实际文件。这些列中可能隐藏了一些字符。
要查找,请使用stringsAsFactors = F
中的read.csv
将因子作为字符读取。接下来,使用as.numeric
将字符列转换为数字。这将引入NA代替实际角色。最后找出使用is.na
注意:可以使用as.numeric
将因子类型转换为数字,但在这种情况下会产生不希望的结果。
答案 1 :(得分:2)
当我将您的示例数据粘贴到两个文件中并使用read.csv()
读取这两个文件时,两个都将前四列作为因子,第二组四列作为数字,因此我无法复制不幸的是,你的问题。它可能是文件中没有出现在你的例子中的东西。
当我修改“file2.csv”以阅读:
2007-01-02 10:00:00.000,NIFTY,ABB,2007-01-02 10:00:00.000,750.2,754.2,749.8,753.6
2007-01-02 10:05:00.000,NIFTY,ABB,2007-01-02 10:05:00.000,753.6,753.6,750.62,752.8
2007-01-02 10:10:00.000,NIFTY,ABB,2007-01-02 10:10:00.000,752.8,752.8,750.2,751.5
2007-01-02 10:15:00.000,NIFTY,ABB,2007-01-02 10:15:00.000,751.5,752,751,751.6
2007-01-02 10:20:00.000,NIFTY,ABB,2007-01-02 10:20:00.000,751.6,751.6,750.8,751
2007-01-02 10:25:00.000,NIFTY,ABB,2007-01-02 10:25:00.000,751,751.2,749,749
2007-01-02 10:30:00.000,NIFTY,ABB,2007-01-02 10:30:00.000,749,751.8,749,751.8
2007-01-02 10:35:00.000,NIFTY,ABB,2007-01-02 10:35:00.000,a,b,c,d
...我确实发现最后四列是作为因素读入的,因此这表明在“file2.csv”的那些列中某处可能存在非数字数据。
我还注意到你可能想要使用类似的东西:
a<-read.csv("file1.csv",header=F)
b<-read.csv("file2.csv",header=F)
...避免将第一行转换为标题,但是,我是否header=F
没有改变结果。仅供参考我在Windows 7,64位上使用R 2.15.3。