分组和总结

时间:2016-09-26 14:44:52

标签: r

出于研究目的,我需要处理来自csv表的数据。该表如下所示:

    Frame Nr. 0      frame_type  I_frame
    Frame Nr. 1      frame_type  P_frame
    Frame Nr. 2      frame_type  P_frame
    Frame Nr. 3      frame_type  B_frame
    Frame Nr. 4      frame_type  P_frame
    Frame Nr. 5      frame_type  P_frame
    Frame Nr. 6      frame_type  B_frame
    Frame Nr. 7      frame_type  P_frame
    Frame Nr. 8      frame_type  P_frame
    Frame Nr. 9      frame_type  I_frame
    Frame Nr. 10     frame_type  P_frame
    Frame Nr. 11     frame_type  P_frame
    Frame Nr. 12     frame_type  P_frame
    Frame Nr. 13     frame_type  I_frame
    Frame Nr. 14     frame_type  P_frame
    Frame Nr. 15     frame_type  P_frame
    Frame Nr. 16     frame_type  B_frame
    Frame Nr. 17     frame_type  P_frame
    Frame Nr. 18     frame_type  P_frame
    Frame Nr. 19     frame_type  P_frame
    Frame Nr. 20     frame_type  P_frame
    Frame Nr. 21     frame_type  I_frame
    Frame Nr. 22     frame_type  P_frame
    Frame Nr. 23     frame_type  P_frame
    Frame Nr. 24     frame_type  P_frame
    Frame Nr. 25     frame_type  I_frame
    ...

我希望R首先对每个I_frame开始的帧进行分组,然后用另一个I_frame计算p帧和b帧的总和。在这个例子中,我的R程序应该提供如下结果:

I2PB2PB2P I3P I2PB4P I3P ...

R中有没有办法做到这一点?

1 个答案:

答案 0 :(得分:1)

从以前的错误答案编辑并从@akron借用rle,您可以这样做:假设您的数据位于名为" df"的数据框中。和你的"框架类"在名为" frame_class"的列中,如下面的代码所示,这应该有效:

df = data.frame(n_frame = seq(1:13), frame_type = "frame_type",
                frame_class = c("I_frame", "P_frame", "P_frame", "B_frame", "P_frame", "P_frame",
                                "B_frame", "I_frame", "B_frame", "P_frame", "I_frame", "P_frame", "I_frame"))
df$frame_letter = substring(df$frame_class,1,1) # get only the beginning letter

# Find the location of I_frames
where_i = which(df$frame_class == "I_frame") 
num_i = length(where_i)
out_codes = list()

for (ind_i in 1:(num_i-1)){ # cycle on "sandwiches"
  start = where_i[ind_i]
  end = where_i[ind_i+1]
  sub_data = df$frame_letter[(start+1):(end-1)]  # Get data in a sandwich
  count_reps = rle(sub_data)  # find repetitions pattern

  # build the codes
  out_code = "I"
  for (ind_letter in 1:length(count_reps$lengths)){
    out_code= paste0(out_code, ifelse(count_reps$lengths[ind_letter] == 1, 
                     count_reps$values[ind_letter],  # If only 1 rep, don't add "1" in the string
                     paste0(count_reps$lengths[ind_letter], count_reps$values[ind_letter]))) 
  }
  out_codes [[ind_i]] = out_code # put in list
}
out_codes

,它给出了:

> out_codes
[[1]]
[1] "I2PB2PB"

[[2]]
[1] "IBP"

[[3]]
[1] "IP"

请注意它非常快速和肮脏:你至少应该实施一些检查,以确保该系列始终以" I_frame"开头和结尾,但这可能会让你进入正确的方向......

另请注意,对于大型数据集,这可能会很慢。

洛伦佐