dplyr - 间隔的条件扩展

时间:2016-04-11 19:49:21

标签: r algorithm dplyr

我在R中有一个data.frame,我在下面附上了。第一列包含row.names,我认为这个例子可以忽略。

以下是我想做的事情:

对于value列的每次连续运行,我想生成startend的最长组合。

以下data.frame的解决方案如下所示:

     start      end  value
1 11498007 11675212   NIC2
2 11675212 11675695 ED3048
3 11675695 12007383   NIC2

我可以使用for循环在R中使用它,但这是禁止的,因为我正在使用更大的数据集。

有没有办法用dplyr或其他快速方法轻松完成?

       start      end  value
1   11498007 11500185   NIC2
2   11500185 11503809   NIC2
3   11503809 11504028   NIC2
4   11504028 11505268   NIC2
5   11505268 11506382   NIC2
6   11506382 11506414   NIC2
7   11506414 11506422   NIC2
8   11506422 11506659   NIC2
9   11506659 11506790   NIC2
10  11506790 11506921   NIC2
11  11506921 11507408   NIC2
12  11507408 11507482   NIC2
13  11507482 11508111   NIC2
14  11508111 11510776   NIC2
15  11510776 11514107   NIC2
16  11514107 11514141   NIC2
17  11514141 11514941   NIC2
18  11514941 11515753   NIC2
19  11515753 11516308   NIC2
20  11516308 11520681   NIC2
21  11520681 11522554   NIC2
22  11522554 11523088   NIC2
23  11523088 11525130   NIC2
24  11525130 11527377   NIC2
25  11527377 11527525   NIC2
26  11527525 11527939   NIC2
27  11527939 11528408   NIC2
28  11528408 11528420   NIC2
29  11528420 11528444   NIC2
30  11528444 11528453   NIC2
31  11528453 11528611   NIC2
32  11528611 11529008   NIC2
33  11529008 11529017   NIC2
34  11529017 11529257   NIC2
35  11529257 11530157   NIC2
36  11530157 11530186   NIC2
37  11530186 11530421   NIC2
38  11530421 11530518   NIC2
39  11530518 11530624   NIC2
40  11530624 11530666   NIC2
41  11530666 11530994   NIC2
42  11530994 11532649   NIC2
43  11532649 11532738   NIC2
44  11532738 11533042   NIC2
45  11533042 11533454   NIC2
46  11533454 11533912   NIC2
47  11533912 11534304   NIC2
48  11534304 11537299   NIC2
49  11537299 11539754   NIC2
50  11539754 11541846   NIC2
51  11541846 11543431   NIC2
52  11543431 11557925   NIC2
53  11557925 11558476   NIC2
54  11558476 11559622   NIC2
55  11559622 11562905   NIC2
56  11562905 11569135   NIC2
57  11569135 11569433   NIC2
58  11569433 11570277   NIC2
59  11570277 11570284   NIC2
60  11570284 11574102   NIC2
61  11574102 11577288   NIC2
62  11577288 11579868   NIC2
63  11579868 11584487   NIC2
64  11584487 11585017   NIC2
65  11585017 11585996   NIC2
66  11585996 11586122   NIC2
67  11586122 11587155   NIC2
68  11587155 11588850   NIC2
69  11588850 11601008   NIC2
70  11601008 11605243   NIC2
71  11605243 11606089   NIC2
72  11606089 11609905   NIC2
73  11609905 11611376   NIC2
74  11611376 11621733   NIC2
75  11621733 11623480   NIC2
76  11623480 11625922   NIC2
77  11625922 11634546   NIC2
78  11634546 11634930   NIC2
79  11634930 11639416   NIC2
80  11639416 11640314   NIC2
81  11640314 11641999   NIC2
82  11641999 11643118   NIC2
83  11643118 11650865   NIC2
84  11650865 11658435   NIC2
85  11658435 11660037   NIC2
86  11660037 11660064   NIC2
87  11660064 11660490   NIC2
88  11660490 11660544   NIC2
89  11660544 11666281   NIC2
90  11666281 11667555   NIC2
91  11667555 11675212   NIC2
92  11675212 11675638 ED3048
93  11675638 11675695 ED3048
94  11675695 11677084   NIC2
95  11677084 11677388   NIC2
96  11677388 11683114   NIC2
97  11683114 11685474   NIC2
98  11685474 11689877   NIC2
99  11689877 11694696   NIC2
100 11694696 11702279   NIC2
101 11702279 11703345   NIC2
102 11703345 11703916   NIC2
103 11703916 11704719   NIC2
104 11704719 11705706   NIC2
105 11705706 11714124   NIC2
106 11714124 11714678   NIC2
107 11714678 11715411   NIC2
108 11715411 11716478   NIC2
109 11716478 11717317   NIC2
110 11717317 11720168   NIC2
111 11720168 11734503   NIC2
112 11734503 11744967   NIC2
113 11744967 11759069   NIC2
114 11759069 11759607   NIC2
115 11759607 11766365   NIC2
116 11766365 11769861   NIC2
117 11769861 11769896   NIC2
118 11769896 11769916   NIC2
119 11769916 11769931   NIC2
120 11769931 11769932   NIC2
121 11769932 11769935   NIC2
122 11769935 11769994   NIC2
123 11769994 11770048   NIC2
124 11770048 11770088   NIC2
125 11770088 11770090   NIC2
126 11770090 11771234   NIC2
127 11771234 11772929   NIC2
128 11772929 11781474   NIC2
129 11781474 11781973   NIC2
130 11781973 11783884   NIC2
131 11783884 11784493   NIC2
132 11784493 11784498   NIC2
133 11784498 11784732   NIC2
134 11784732 11785308   NIC2
135 11785308 11785860   NIC2
136 11785860 11789778   NIC2
137 11789778 11792506   NIC2
138 11792506 11794567   NIC2
139 11794567 11801832   NIC2
140 11801832 11802161   NIC2
141 11802161 11802507   NIC2
142 11802507 11802508   NIC2
143 11802508 11803263   NIC2
144 11803263 11803364   NIC2
145 11803364 11803373   NIC2
146 11803373 11803568   NIC2
147 11803568 11803980   NIC2
148 11803980 11804107   NIC2
149 11804107 11804369   NIC2
150 11804369 11805042   NIC2
151 11805042 11805711   NIC2
152 11805711 11805863   NIC2
153 11805863 11806743   NIC2
154 11806743 11806942   NIC2
155 11806942 11808615   NIC2
156 11808615 11808839   NIC2
157 11808839 11809970   NIC2
158 11809970 11810603   NIC2
159 11810603 11811912   NIC2
160 11811912 11813086   NIC2
161 11813086 11820680   NIC2
162 11820680 11820771   NIC2
163 11820771 11820818   NIC2
164 11820818 11820984   NIC2
165 11820984 11821011   NIC2
166 11821011 11821360   NIC2
167 11821360 11821380   NIC2
168 11821380 11821597   NIC2
169 11821597 11823045   NIC2
170 11823045 11824456   NIC2
171 11824456 11824484   NIC2
172 11824484 11824622   NIC2
173 11824622 11825060   NIC2
174 11825060 11825674   NIC2
175 11825674 11825769   NIC2
176 11825769 11826152   NIC2
177 11826152 11826183   NIC2
178 11826183 11826192   NIC2
179 11826192 11826220   NIC2
180 11826220 11826222   NIC2
181 11826222 11826229   NIC2
182 11826229 11826236   NIC2
183 11826236 11826259   NIC2
184 11826259 11826262   NIC2
185 11826262 11826275   NIC2
186 11826275 11826284   NIC2
187 11826284 11826311   NIC2
188 11826311 11826354   NIC2
189 11826354 11826363   NIC2
190 11826363 11826366   NIC2
191 11826366 11826450   NIC2
192 11826450 11826495   NIC2
193 11826495 11826522   NIC2
194 11826522 11827132   NIC2
195 11827132 11827151   NIC2
196 11827151 11827178   NIC2
197 11827178 11827257   NIC2
198 11827257 11827281   NIC2
199 11827281 11827309   NIC2
200 11827309 11827341   NIC2
201 11827341 11827418   NIC2
202 11827418 11827450   NIC2
203 11827450 11827751   NIC2
204 11827751 11828070   NIC2
205 11828070 11828970   NIC2
206 11828970 11832662   NIC2
207 11832662 11833369   NIC2
208 11833369 11833706   NIC2
209 11833706 11833787   NIC2
210 11833787 11834531   NIC2
211 11834531 11835129   NIC2
212 11835129 11835167   NIC2
213 11835167 11836265   NIC2
214 11836265 11836393   NIC2
215 11836393 11838190   NIC2
216 11838190 11839047   NIC2
217 11839047 11840050   NIC2
218 11840050 11842764   NIC2
219 11842764 11845235   NIC2
220 11845235 11849208   NIC2
221 11849208 11855696   NIC2
222 11855696 11856301   NIC2
223 11856301 11860647   NIC2
224 11860647 11861397   NIC2
225 11861397 11875177   NIC2
226 11875177 11880848   NIC2
227 11880848 11881762   NIC2
228 11881762 11882261   NIC2
229 11882261 11887769   NIC2
230 11887769 11895586   NIC2
231 11895586 11898469   NIC2
232 11898469 11898719   NIC2
233 11898719 11900746   NIC2
234 11900746 11901060   NIC2
235 11901060 11901664   NIC2
236 11901664 11905614   NIC2
237 11905614 11905670   NIC2
238 11905670 11906209   NIC2
239 11906209 11910442   NIC2
240 11910442 11910450   NIC2
241 11910450 11912061   NIC2
242 11912061 11912249   NIC2
243 11912249 11913903   NIC2
244 11913903 11917884   NIC2
245 11917884 11919309   NIC2
246 11919309 11922775   NIC2
247 11922775 11923192   NIC2
248 11923192 11923408   NIC2
249 11923408 11924092   NIC2
250 11924092 11925352   NIC2
251 11925352 11925626   NIC2
252 11925626 11926682   NIC2
253 11926682 11928066   NIC2
254 11928066 11928440   NIC2
255 11928440 11928450   NIC2
256 11928450 11928495   NIC2
257 11928495 11928500   NIC2
258 11928500 11928528   NIC2
259 11928528 11928883   NIC2
260 11928883 11930073   NIC2
261 11930073 11931553   NIC2
262 11931553 11933250   NIC2
263 11933250 11936043   NIC2
264 11936043 11937320   NIC2
265 11937320 11937813   NIC2
266 11937813 11942138   NIC2
267 11942138 11945949   NIC2
268 11945949 11947373   NIC2
269 11947373 11949849   NIC2
270 11949849 11951251   NIC2
271 11951251 11952909   NIC2
272 11952909 11956032   NIC2
273 11956032 11956098   NIC2
274 11956098 11956192   NIC2
275 11956192 11956361   NIC2
276 11956361 11956809   NIC2
277 11956809 11957113   NIC2
278 11957113 11957238   NIC2
279 11957238 11958013   NIC2
280 11958013 11964579   NIC2
281 11964579 11964696   NIC2
282 11964696 11964715   NIC2
283 11964715 11972147   NIC2
284 11972147 11974077   NIC2
285 11974077 11974946   NIC2
286 11974946 11975462   NIC2
287 11975462 11975463   NIC2
288 11975463 11975981   NIC2
289 11975981 11977701   NIC2
290 11977701 11978314   NIC2
291 11978314 11978494   NIC2
292 11978494 11978866   NIC2
293 11978866 11980251   NIC2
294 11980251 11981137   NIC2
295 11981137 11981470   NIC2
296 11981470 11981767   NIC2
297 11981767 11981769   NIC2
298 11981769 11981786   NIC2
299 11981786 11981867   NIC2
300 11981867 11983276   NIC2
301 11983276 11983333   NIC2
302 11983333 11983494   NIC2
303 11983494 11983699   NIC2
304 11983699 11983876   NIC2
305 11983876 11983926   NIC2
306 11983926 11983968   NIC2
307 11983968 11984130   NIC2
308 11984130 11984180   NIC2
309 11984180 11984185   NIC2
310 11984185 11984277   NIC2
311 11984277 11984457   NIC2
312 11984457 11984855   NIC2
313 11984855 11986267   NIC2
314 11986267 11986269   NIC2
315 11986269 11986535   NIC2
316 11986535 11987332   NIC2
317 11987332 11989515   NIC2
318 11989515 11989615   NIC2
319 11989615 11991259   NIC2
320 11991259 11991905   NIC2
321 11991905 11991922   NIC2
322 11991922 11992083   NIC2
323 11992083 11992132   NIC2
324 11992132 11992133   NIC2
325 11992133 11992665   NIC2
326 11992665 11993396   NIC2
327 11993396 11993616   NIC2
328 11993616 11994093   NIC2
329 11994093 11994280   NIC2
330 11994280 11994287   NIC2
331 11994287 11995665   NIC2
332 11995665 11995678   NIC2
333 11995678 11995684   NIC2
334 11995684 11995716   NIC2
335 11995716 11995775   NIC2
336 11995775 11995802   NIC2
337 11995802 11995982   NIC2
338 11995982 11995997   NIC2
339 11995997 11996008   NIC2
340 11996008 11996011   NIC2
341 11996011 11996014   NIC2
342 11996014 11996018   NIC2
343 11996018 11996028   NIC2
344 11996028 11996035   NIC2
345 11996035 11996142   NIC2
346 11996142 11996284   NIC2
347 11996284 11996418   NIC2
348 11996418 11996452   NIC2
349 11996452 11998022   NIC2
350 11998022 12002709   NIC2
351 12002709 12003081   NIC2
352 12003081 12006843   NIC2
353 12006843 12007383   NIC2

2 个答案:

答案 0 :(得分:2)

data.table的{​​{1}}函数对于此类任务非常方便。您可以像这样使用它:

rleid

或者您可以在dplyr链中使用它:

library(data.table)
dt <- as.data.table(df)
dt[, .(start = min(start), end = max(end)), by = .(value, rleid(value))][
   , !"rleid", with=FALSE]
#    value    start      end
#1:   NIC2 11498007 11675212
#2: ED3048 11675212 11675695
#3:   NIC2 11675695 12007383

答案 1 :(得分:0)

借用here看到的rle个技巧,你可以这样做:

library(dplyr)

df$value <- as.character(df$value)

df %>%
  group_by(cont_run_val = paste(value, 
                                {tmp = rle(value); rep(seq_along(tmp$lengths), tmp$lengths)},
                                sep = "_")) %>%
  summarize(min_start = min(start),
            max_end = max(end))

#   cont_run_val min_start  max_end
#          (chr)     (int)    (int)
# 1     ED3048_2  11675212 11675695
# 2       NIC2_1  11498007 11675212
# 3       NIC2_3  11675695 12007383