我希望在R中获取一个数据帧,并根据我在V1和V2两列中看到的内容对其进行扩充。简而言之,我有阶段S1-S6是字符串。
对于阶段中存在间隙的每一行,我需要添加行。查看下面的数据框,如果我在同一行中看到“ S 3”和“ S 3”,则无需执行任何操作。同样,如果我在同一行中看到“ S 3”和“ S 4”,则也无需执行任何操作。
输入:
----------------------------------
|Var1 | V1 | V2 |
----------------------------------
|0060a00000fUbAnAAK |'S 2' |'S 5'|
----------------------------------
输出:
----------------------------------
|Var1 | V1 | V2 |
----------------------------------
|0060a00000fUbAnAAK |'S 2' |'S 3'|
----------------------------------
|0060a00000fUbAnAAK |'S 3' |'S 4'|
----------------------------------
|0060a00000fUbAnAAK |'S 4' |'S 5'|
----------------------------------
输入:
----------------------------------
|Var1 | V1 | V2 |
----------------------------------
|0060a00000fUbAnAAK |'S 5' |'S 3'|
----------------------------------
输出:
----------------------------------
|Var1 | V1 | V2 |
----------------------------------
|0060a00000fUbAnAAK |'S 5' |'S 4'|
----------------------------------
|0060a00000fUbAnAAK |'S 4' |'S 3'|
----------------------------------
答案 0 :(得分:0)
使用tidyverse
的想法是将其转换为长格式,将数字与S
分开并完成序列。有了这些内容后,我们将各列粘贴回去(S
和values
),然后转换回宽格式。最后,我们采用滞后变量V1
,并删除NA
,即
library(tidyverse)
df %>%
gather(var, val, -1) %>%
separate(val, into = c('char', 'number'), sep = ' ') %>%
mutate(number = as.numeric(number)) %>%
complete(nesting(var, Var1, char), number = full_seq(min(number):max(number), 1)) %>%
unite('V1_2', c('char', 'number'), sep = ' ') %>%
group_by(var) %>%
mutate(new = row_number()) %>%
spread(var, V1_2) %>%
mutate(V1 = lag(V1)) %>%
na.omit() %>%
select(-new)
给出,
# A tibble: 3 x 3 Var1 V1 V2 <chr> <chr> <chr> 1 xxx S 2 S 3 2 xxx S 3 S 4 3 xxx S 4 S 5
答案 1 :(得分:0)
此更新还考虑了递减的阶段
样本数据
import pyspark.sql.functions as f
r = t.select(f.explode("col24").alias("first_name")).toPandas()
代码
library(data.table)
DT <- fread("Var1 | V1 | V2
0060a00000fUbAnAAK |S 2 |S 5
0060a00000fUbAnAAK_ |S 5 |S 3")
# Var1 V1 V2
# 1: 0060a00000fUbAnAAK S 2 S 5
# 2: 0060a00000fUbAnAAK_ S 5 S 3
输出
#determine order of stages
DT[ as.numeric( gsub("[^0-9]", "", V2 ) ) < as.numeric( gsub("[^0-9]", "", V1 ) ), order := "desc" ]
DT[ is.na( order) , order := "asc" ]
#melt DT to long format
DT <- melt( DT, id.vars = c("Var1","order"), value.name = "stage")
#get stage as numeric and clean up unwanted columns
DT[, `:=`(stage = as.numeric( gsub("[^0-9]", "", stage)))]
#create new stages based on minimum and maximum stage per Var1-value
#use different methodes of ascending and descneding stages, then bind the rows together
rbind(
DT[order == "asc", .( V1 = paste0( "S ", min(stage): (max(stage) - 1 ) ),
V2 = paste0( "S ", (min(stage)+1):max(stage) ) ), by = .(Var1)],
DT[order == "desc", .( V1 = paste0( "S ", max(stage): (min(stage) + 1 ) ),
V2 = paste0( "S ", (max(stage)-1):min(stage) ) ), by = .(Var1)]
)
# Var1 V1 V2
# 1: 0060a00000fUbAnAAK S 2 S 3
# 2: 0060a00000fUbAnAAK S 3 S 4
# 3: 0060a00000fUbAnAAK S 4 S 5
# 4: 0060a00000fUbAnAAK_ S 5 S 4
# 5: 0060a00000fUbAnAAK_ S 4 S 3