为流程中的跳过阶段添加行

时间:2019-04-11 06:40:57

标签: r dataframe analytics

我希望在R中获取一个数据帧,并根据我在V1和V2两列中看到的内容对其进行扩充。简而言之,我有阶段S1-S6是字符串。

对于阶段中存在间隙的每一行,我需要添加行。查看下面的数据框,如果我在同一行中看到“ S 3”和“ S 3”,则无需执行任何操作。同样,如果我在同一行中看到“ S 3”和“ S 4”,则也无需执行任何操作。

示例1

输入:

----------------------------------
|Var1               | V1   | V2  |    
----------------------------------
|0060a00000fUbAnAAK |'S 2' |'S 5'|
----------------------------------

输出:

----------------------------------
|Var1               | V1   | V2  |    
----------------------------------
|0060a00000fUbAnAAK |'S 2' |'S 3'|
----------------------------------
|0060a00000fUbAnAAK |'S 3' |'S 4'|
----------------------------------
|0060a00000fUbAnAAK |'S 4' |'S 5'|
----------------------------------

示例2

输入:

----------------------------------
|Var1               | V1   | V2  |    
----------------------------------
|0060a00000fUbAnAAK |'S 5' |'S 3'|
----------------------------------

输出:

----------------------------------
|Var1               | V1   | V2  |    
----------------------------------
|0060a00000fUbAnAAK |'S 5' |'S 4'|
----------------------------------
|0060a00000fUbAnAAK |'S 4' |'S 3'|
----------------------------------

2 个答案:

答案 0 :(得分:0)

使用tidyverse的想法是将其转换为长格式,将数字与S分开并完成序列。有了这些内容后,我们将各列粘贴回去(Svalues),然后转换回宽格式。最后,我们采用滞后变量V1,并删除NA,即

library(tidyverse)

df %>% 
 gather(var, val, -1) %>% 
 separate(val, into = c('char', 'number'), sep = ' ') %>% 
 mutate(number = as.numeric(number)) %>% 
 complete(nesting(var, Var1, char), number = full_seq(min(number):max(number), 1)) %>%
 unite('V1_2', c('char', 'number'), sep = ' ') %>% 
 group_by(var) %>% 
 mutate(new = row_number()) %>% 
 spread(var, V1_2) %>% 
 mutate(V1 = lag(V1)) %>% 
 na.omit() %>% 
 select(-new)

给出,

# A tibble: 3 x 3
   Var1  V1    V2   
  <chr> <chr> <chr>
1 xxx   S 2   S 3  
2 xxx   S 3   S 4  
3 xxx   S 4   S 5 

答案 1 :(得分:0)

更新的答案

此更新还考虑了递减的阶段

样本数据


import pyspark.sql.functions as f

r = t.select(f.explode("col24").alias("first_name")).toPandas()

代码

library(data.table)
DT <- fread("Var1               | V1   | V2
  0060a00000fUbAnAAK |S 2 |S 5
  0060a00000fUbAnAAK_ |S 5 |S 3")

#                   Var1  V1  V2
# 1:  0060a00000fUbAnAAK S 2 S 5
# 2: 0060a00000fUbAnAAK_ S 5 S 3

输出

#determine order of stages
DT[ as.numeric( gsub("[^0-9]", "", V2 ) ) < as.numeric( gsub("[^0-9]", "", V1 ) ), order := "desc" ]
DT[ is.na( order) , order := "asc" ]
#melt DT to long format
DT <- melt( DT, id.vars = c("Var1","order"), value.name = "stage")
#get stage as numeric and clean up unwanted columns
DT[, `:=`(stage = as.numeric( gsub("[^0-9]", "", stage)))]
#create new stages based on minimum and maximum stage per Var1-value
#use different methodes of ascending and descneding stages, then bind the rows together
rbind(
  DT[order == "asc", .( V1 = paste0( "S ", min(stage): (max(stage) - 1 ) ), 
                        V2 = paste0( "S ", (min(stage)+1):max(stage) ) ), by = .(Var1)],
  DT[order == "desc", .( V1 = paste0( "S ", max(stage): (min(stage) + 1 ) ), 
                         V2 = paste0( "S ", (max(stage)-1):min(stage) ) ), by = .(Var1)]
)

上一个答案

#                   Var1  V1  V2
# 1:  0060a00000fUbAnAAK S 2 S 3
# 2:  0060a00000fUbAnAAK S 3 S 4
# 3:  0060a00000fUbAnAAK S 4 S 5
# 4: 0060a00000fUbAnAAK_ S 5 S 4
# 5: 0060a00000fUbAnAAK_ S 4 S 3