将大数据帧拆分为较小的段

时间:2013-01-04 20:33:33

标签: r split dataframe

我有以下数据框,我想将其分解为10个不同的数据框。我想将最初的100行数据帧分成10行10个数据帧。我可以做以下事情并获得理想的结果。

df = data.frame(one=c(rnorm(100)), two=c(rnorm(100)), three=c(rnorm(100)))

df1 = df[1:10,]
df2 = df[11:20,]
df3 = df[21:30,]
df4 = df[31:40,]
df5 = df[41:50,]
...

当然,当初始数据帧较大或者没有可以分解的简单数量的段时,这不是执行此任务的优雅方式。

所以考虑到上述情况,假设我们有以下数据框架。

df = data.frame(one=c(rnorm(1123)), two=c(rnorm(1123)), three=c(rnorm(1123)))

现在我想将它拆分为由200行组成的新数据帧,以及包含剩余行的最终数据帧。什么是更优雅(又称“快速”)的方式来执行此任务。

5 个答案:

答案 0 :(得分:23)

 > str(split(df, (as.numeric(rownames(df))-1) %/% 200))
List of 6
 $ 0:'data.frame':  200 obs. of  3 variables:
  ..$ one  : num [1:200] -1.592 1.664 -1.231 0.269 0.912 ...
  ..$ two  : num [1:200] 0.639 -0.525 0.642 1.347 1.142 ...
  ..$ three: num [1:200] -0.45 -0.877 0.588 1.188 -1.977 ...
 $ 1:'data.frame':  200 obs. of  3 variables:
  ..$ one  : num [1:200] -0.0017 1.9534 0.0155 -0.7732 -1.1752 ...
  ..$ two  : num [1:200] -0.422 0.869 0.45 -0.111 0.073 ...
  ..$ three: num [1:200] -0.2809 1.31908 0.26695 0.00594 -0.25583 ...
 $ 2:'data.frame':  200 obs. of  3 variables:
  ..$ one  : num [1:200] -1.578 0.433 0.277 1.297 0.838 ...
  ..$ two  : num [1:200] 0.913 0.378 0.35 -0.241 0.783 ...
  ..$ three: num [1:200] -0.8402 -0.2708 -0.0124 -0.4537 0.4651 ...
 $ 3:'data.frame':  200 obs. of  3 variables:
  ..$ one  : num [1:200] 1.432 1.657 -0.72 -1.691 0.596 ...
  ..$ two  : num [1:200] 0.243 -0.159 -2.163 -1.183 0.632 ...
  ..$ three: num [1:200] 0.359 0.476 1.485 0.39 -1.412 ...
 $ 4:'data.frame':  200 obs. of  3 variables:
  ..$ one  : num [1:200] -1.43 -0.345 -1.206 -0.925 -0.551 ...
  ..$ two  : num [1:200] -1.343 1.322 0.208 0.444 -0.861 ...
  ..$ three: num [1:200] 0.00807 -0.20209 -0.56865 1.06983 -0.29673 ...
 $ 5:'data.frame':  123 obs. of  3 variables:
  ..$ one  : num [1:123] -1.269 1.555 -0.19 1.434 -0.889 ...
  ..$ two  : num [1:123] 0.558 0.0445 -0.0639 -1.934 -0.8152 ...
  ..$ three: num [1:123] -0.0821 0.6745 0.6095 1.387 -0.382 ...

如果某些代码可能更改了rownames,则使用起来会更安全:

 split(df, (seq(nrow(df))-1) %/% 200) 

答案 1 :(得分:4)

require(ff)
df <- data.frame(one=c(rnorm(1123)), two=c(rnorm(1123)), three=c(rnorm(1123)))
for(i in chunk(from = 1, to = nrow(df), by = 200)){
  print(df[min(i):max(i), ])
}

答案 2 :(得分:3)

如果您可以生成定义组的向量,则可以split任何内容:

f <- rep(seq_len(ceiling(1123 / 200)),each = 200,length.out = 1123)
> df1 <- split(df,f = f)
> lapply(df1,dim)
$`1`
[1] 200   3

$`2`
[1] 200   3

$`3`
[1] 200   3

$`4`
[1] 200   3

$`5`
[1] 200   3

$`6`
[1] 123   3

答案 3 :(得分:2)

这样的东西......?

b <- seq(10, 100, 10)
lapply(seq_along(b), function(i) df[(b-9)[i]:b[i], ])

[[1]]
          one        two      three
1  -2.4157992 -0.6232517  1.0531358
2   0.6769020  0.3908089 -1.9543895
3   0.9804026 -2.5167334  0.7120919
4  -1.2200089  0.5108479  0.5599177
5   0.4448290 -1.2885275 -0.7665413
6   0.8431848 -0.9359947  0.1068137
7  -1.8168134 -0.2418887  1.1176077
8   1.4475904 -0.8010347  2.3716663
9   0.7264027 -0.3573623 -1.1956806
10  0.2736119 -1.5553148  0.2691115

[[2]]
          one         two       three
11 -0.3273536 -1.92475496 -0.08031696
12  1.5558892 -1.20158371  0.09104958
13  1.9202047 -0.13418754  0.32571632
14 -0.0515136 -2.15669216  0.23099397
15  0.1909732 -0.30802742 -1.28651457
16  0.8545580 -0.18238266  1.57093844
17  0.4903039  0.02895376 -0.47678196
18  0.5125400  0.97052082 -0.70541908
19 -1.9324370  0.22093545 -0.34436105
20 -0.5763433  0.10442551 -2.05597985

[[3]]
          one         two       three
21  0.7168771 -1.22902943 -0.18728871
22  1.2785641  0.14686576 -1.74738091
23 -1.1856173  0.43829361  0.41269975
24  0.0220843  1.57428924 -0.80163986
25 -1.0012255  0.05520813  0.50871603
26 -0.1842323 -1.61195239  0.04843504
27  0.2328831 -0.38432225  0.95650710
28  0.8821687 -1.32456215 -1.33367967
29 -0.8902177  0.86414661 -1.39629358
30 -0.6586293 -2.27325919  0.27367902

[[4]]
          one        two       three
31  1.3810437 -1.0178835  0.07779591
32  0.6102753  0.3538498  1.92316801
33 -1.5034439  0.7926925  2.21706284
34  0.8251638  0.3992922  0.56781321
35 -1.0832114  0.9878058 -0.16820827
36 -0.4132375 -0.9214491  1.06681472
37 -0.6787631  1.3497766  2.18327887
38 -3.0082585 -1.3047024 -0.04913214
39 -0.3433300  1.1008951 -2.02065141
40  0.6009334  1.2334421  0.15623298

[[5]]
          one         two       three
41 -1.8608051 -0.08589437  0.02370983
42 -0.1829953  0.91139017 -0.01356590
43  1.1146731  0.42384993 -0.68717391
44  1.9039900 -1.70218225  0.06100297
45 -0.4851939  1.38712015 -1.30613414
46 -0.4661664  0.23504099 -0.29335162
47  0.5807227 -0.87821946 -0.14816121
48 -2.0168910 -0.47657382  0.90503226
49  2.5056404  0.27574224  0.10326333
50  0.2238735  0.34441325 -0.17186115

[[6]]
           one        two      three
51  1.51613140 -2.5630782 -0.6720399
52  0.03859537 -2.6688365  0.3395574
53 -0.08695292 -0.5114117 -0.1378789
54 -0.51878363 -0.5401962  0.3946324
55 -2.20482710  0.1716744  0.1786546
56 -0.28133749 -0.4497112  0.5936497
57 -2.38269088 -0.4625695  1.0048914
58  0.37865952  0.5055141  0.3337986
59  0.09329172  0.1560469  0.2835735
60 -1.10818863 -0.2618910  0.3650042

[[7]]
          one        two       three
61 -1.2507208 -1.5050083 -0.63871084
62  0.1379394  0.7996674 -1.80196762
63  0.1582008 -0.3208973  0.40863693
64 -0.6224605  0.1416938 -0.47174711
65  1.1556149 -1.4083576 -1.12619693
66 -0.6956604  0.7994991  1.16073748
67  0.6576676  1.4391007  0.04134445
68  1.4610598 -1.0066840 -1.82981058
69  1.1951788 -0.4005535  1.57256648
70 -0.1994519  0.2711574 -1.04364396

[[8]]
           one        two       three
71  1.23897065  0.4473611 -0.35452535
72  0.89015916  2.3747385  0.87840852
73 -1.17339703  0.7433220  0.40232381
74 -0.24568490 -0.4776862  1.24082294
75 -0.47187443 -0.3271824  0.38542703
76 -2.20899136 -1.1131712 -0.33663075
77 -0.05968035 -0.6023045 -0.23747388
78  1.19687199 -1.3390960 -1.37884241
79 -1.29310506  0.3554548 -0.05936756
80 -0.17470891  1.6198307  0.69170207

[[9]]
           one         two       three
81 -1.06792315  0.04801998  0.08166394
82  0.84152560 -0.45793907  0.27867619
83  0.07619456 -1.21633682 -2.51290495
84  0.55895466 -1.01844178 -0.41887672
85  0.33825508 -1.15061381  0.66206732
86 -0.36041720  0.32808609 -1.83390913
87 -0.31595401 -0.87081019  0.45369366
88  0.92331087  1.22055348 -1.91048757
89  1.30491142  1.22582353 -1.32244004
90 -0.32906839  1.76467263  1.84479228

[[10]]
            one        two       three
91   2.80656707 -0.9708417  0.25467304
92   0.35770119 -0.6132523 -1.11467041
93   0.09598908 -0.5710063 -0.96412216
94  -1.08728715  0.3019572 -0.04422049
95   0.14317455  0.1452287 -0.46133199
96  -1.00218917 -0.1360570  0.88864256
97  -0.25316855  0.6341925 -1.37571664
98   0.36375921  1.2244921  0.12718650
99   0.13345555  0.5330221 -0.29444683
100  2.28548261 -2.0413222 -0.53209956

答案 4 :(得分:-1)

将df分成100万个行组,并一次推送并向SQL中的df追加一百万个

batchsize = 1000000 # vary to your liking

# cycles through data by batchsize
for (i in 1:ceiling(nrow(df)/batchsize)) 
 {
print(i) # just to show the progress

# below shows how to cycle through data 
batch <- df[(((i-1)*batchsize)+1(batchsize*i),,drop=FALSE] # drop = FALSE keeps it from being converted to a vector 

# if below not done then the last batch has Nulls above the number of rows of actual data
batch <- batch[!is.na(batch$ID),] # ID is a variable I presume is in every row

#in this case the table already existed, if new table overwrite = TRUE
(dbWriteTable(con, "df", batch, append = TRUE,row.names = FALSE)) 
}