解析R中的不规则数据表

时间:2014-06-06 20:50:24

标签: r parsing csv for-loop

我是R的新手,所以我可能无法找到正确的搜索字词,所以会喜欢任何方向。

我需要解析一个非常不规则的,非常大的.csv。单个列包含10类分类数据,后跟14列,其中可能包含一个或多个数字记录值。这个.csv是手工格式化的,看起来像一个数据透视表,所以我的每个记录之间有不同的行数。这些数据的输入方式存在很多不一致之处。这只是一个很小的片段。

Categories       x1      x2       x3

12123           222.0   206.7   236.7
Novartis Seeds  222.0       
N67-T4          220.8       
4/19/2000       220.8       
32000           220.8       
Soybean         220.8       
Y               220.8       
No-Till         220.8       
N7070BT         223.2       
4/19/2000       223.2       
32000           223.2       
Soybeans        223.2       
Y               223.2       
No-Till         223.2       
Syngenta               206.7    236.7
N68-K7                          236.7
4/24/2002                       236.7
36500                           236.7
Soybeans                        236.7
Y                               236.7
No-Till                         236.7
NX7210                 206.7    
5/8/2001               206.7    
38000                  206.7    
Corn                   206.7    
Y                      206.7    
No-Till                 206.7   

我想我已经想出了一个系统,因为(虽然我一行一行地阅读,我已经看到提到的是在R中编码效率最低的方式):

      #yc is my data table.
    #This function was designed to identify character strings in the only category of data (tillage record)
    #which would be consistantly associated with a single record (yield). I create a new column of 0's and 1's, where
    #1's are associated with a single record
    tillTFfn<-function(yc){
    yc$tillTF<-rep(NA, length(yc$Categories))
                             for (i in 1:length(yc$Categories))
                               if (grepl("till", yc$Categories[i], ignore.case=TRUE)==TRUE){
                                 yc$tillTF[i]<-1
                               } else if (grepl(" Minimum-Till ", yc$Categories[i], ignore.case=TRUE)==TRUE){
                                 yc$tillTF[i]<-1
                               } else if (grepl("Conv", yc$Categories[i], ignore.case=TRUE)==TRUE){
                                 yc$tillTF[i]<-1
                               } else if (grepl("Not", yc$Categories[i], ignore.case=TRUE)==TRUE){
                                 yc$tillTF[i]<-1
                               } else {
                                 yc$tillTF[i]<-0}


                             return(yc)
    }
    YC<-tillTFfn(yc)
    #I then create another new column with the sum of all records reported in each row. Values in "colsum" which coincide with
    #"tillTF" are my record
    YC$colsums<-rowSums(YC[,2:16], na.rm=TRUE)
    #now I'm attempting to create a function that reads YC row by row, and returns a row value with each categorical variable
    array<-rep(0, length(12))
    #The assumption here is that the first 10 rows of column 1 will contain a single value of each category
    for (i in 1:12))
      if(YC$tillTF[i]==1){
        array1[12]<-(YC$colsums[i])
        array1[11]<-(YC$Categories[i])
        array1[10]<-(YC$Categories[i-1])
        array1[9]<-(YC$Categories[i-2])
        array1[8]<-(YC$Categories[i-3])
        array1[7]<-(YC$Categories[i-4])
        array1[6]<-(YC$Categories[i-5])
        array1[5]<-(YC$Categories[i-6])
        array1[4]<-(YC$Categories[i-7])
        array1[3]<-(YC$Categories[i-8])
        array1[2]<-(YC$Categories[i-9])
        array1[1]<-(YC$Categories[i-10])
      }
    #This is my imaginary way to create a data table where the array created above is the first row of a new data table YC_NT
    YC_NT<-rbind(array)

    #This is my imaginary function for the remainder of YC. The idea is that the loop will run through each row of YC, stop
    #when YC$tillTF = 1, rewrite values of the array by reading back up through the column until YC$tillTF=1 again, and then print 
    #that array magically as a row on the new data table YC_NT
    for (i in 13:length(YC$tillTF))
      if (YC$tillTF[i]=1)
        array[12]<-(YC$colsums[i])
        array[11]<-(YC$Categories[i])
        if (YC$tillTF[i-1]==0)
          array[10]<-YC$Categories[i-1]
                else 
                  rbind(array, YC_NT)
        if (YC$tillTF[i-2]==0)
          array[9]<-YC$Categories[i-2]
              else
                rbind(array, YC_NT)
        if(YC$tillTF[i-3]==0)
          array[8]<-YC$Categories[i-3]
              else
                rbind(array, YC_NT)
        if(YC$tillTF[i-4]==0)
          array[7]<-YC$Categories[i-4]
              else
                rbind(array, YC_NT)
        if(YC$tillTFF[i-5]==0)
          array[6]<-YC$Categories[i-5]
              else
                rbind(array, YC_NT)
        if(YC$tillTFF[i-6]==0)
          array[5]<-YC$Categories[i-6]
              else
                rbind(array, YC_NT)
        if(YC$tillTFF[i-7]==0)
          array[4]<-YC$Categories[i-7]
              else
                rbind(array, YC_NT)
        if(YC$tillTFF[i-8]==0)
          array[3]<-YC$Categories[i-8]
              else
                rbind(array, YC_NT)
        if(YC$tillTFF[i-9]==0)
          array[2]<-YC$Categories[i-8]
              else
                rbind(array, YC_NT)
    else 
      array<-array
    return(YC_NT)

#I recognize that my parenthesis and brackets aren't in yet, and once again that this is not how rbind() works.

1)我可以在R中执行嵌套条件语句,就像我在这里完成的那样吗? 2)是否有一个函数可以用来将矢量作为一行打印到数据表而不先命名单个矢量并将其键入rbind(apply()似乎没有工作)

0 个答案:

没有答案