根据列值映射行

时间:2017-03-06 12:27:32

标签: python r merge

如果日志文件位于csv中,则可以使用merge在R / Python中轻松完成此任务。

但是日志文件是用以下语法编写的

Key=1|Time=146656456446
Key=2|Time=146656456447
Key=1|Time=146656456448|field=10
Key=2|Time=146656456450|field=11

有什么方法可以合并它并以下列方式获取差异

Key,Time1,Time2,diff,field
Key=1,146656456446,146656456448,2,10
Key=2,146656456447,146656456450,3,11

2 个答案:

答案 0 :(得分:1)

将我的评论转换为答案,这是一种使用" data.table"的方法。封装

library(data.table)
x <- "path/to/yourLogFile.txt"      
mydt <- fread(x, header = FALSE, col.names = c("Key", "Time"))

dcast(mydt[, Time := as.numeric(sub("Time=", "", Time))][
  , Ind := sequence(.N), Key], Key ~ Ind, value.var = "Time")[
    , Diff := `2` - `1`][]
#      Key            1            2 Diff
# 1: Key=1 146656456446 146656456448    2
# 2: Key=2 146656456447 146656456450    3

使用我的&#34; splitstackshape&#34;的另一种类似方法包和读取数据的相同步骤可能如下所示:

library(splitstackshape)
dcast(getanID(cSplit(mydt, "Time", "="), "Key"), 
      Key ~ Time_1 + .id, value.var = "Time_2")[
        , Diff := Time_2 - Time_1, by = Key][]
#      Key       Time_1       Time_2 Diff
# 1: Key=1 146656456446 146656456448    2
# 2: Key=2 146656456447 146656456450    3

为了阅读日志文件,我做了以下假设:

  • 您知道预计会有两列。
  • 您的日志文件目前没有列名(因此header = FALSE)。
  • 您希望数据以|字符分隔,fread能够自动检测。

更新

它很漂亮,但它有效......

dcast(getanID(cSplit(mydt, names(mydt), "="), "Key_2"), 
      Key_2 ~ .id, fun=list(I, I), value.var = list("Field_2", "Time_2"), fill = 0)[
        , c("Field_2_I_1", "Diff") := list(NULL, Time_2_I_2 - Time_2_I_1)][]
##    Key_2 Field_2_I_2   Time_2_I_1   Time_2_I_2 Diff
## 1:     1          10 146656456446 146656456448    2
## 2:     2          11 146656456447 146656456450    3

样本数据

## Just to simulate a log file like the one you describe....
## "temp" would be your actual file....
x <- c("Key=1|Time=146656456446", "Key=2|Time=146656456447", 
       "Key=1|Time=146656456448|field=10", "Key=2|Time=146656456450|field=11")
temp <- tempfile() 
writeLines(x, temp)

mydt <- fread(temp, header = FALSE, fill = TRUE, 
              col.names = c("Key", "Time", "Field"))
mydt
##      Key              Time    Field
## 1: Key=1 Time=146656456446         
## 2: Key=2 Time=146656456447         
## 3: Key=1 Time=146656456448 field=10
## 4: Key=2 Time=146656456450 field=11

答案 1 :(得分:0)

如果您不需要列中的时间,则以下内容将起作用

library(tidyverse)
library(data.table)

df <- read_table(
"test       
Key=1|Time=146656456446  
Key=2|Time=146656456447  
Key=1|Time=146656456448  
Key=2|Time=146656456450" )

用“|”分隔字符串然后通过“=”得到数字

df <-
df %>% 
  separate(test, into = c("Key", "Time"), sep = "\\|") %>% 
  separate(Time, into = c("Timepoint", "Time"), sep = "=")

df
# A tibble: 4 × 3
    Key Timepoint         Time
* <chr>     <chr>        <chr>
1 Key=1      Time 146656456446
2 Key=2      Time 146656456447
3 Key=1      Time 146656456448
4 Key=2      Time 146656456450

将时间更改为数字,按键分组以计算差异

df$Time <- as.numeric(df$Time)

df <-
df %>% 
  group_by(Key) %>% 
  summarise(Diff = diff(Time))

df
# A tibble: 2 × 2
    Key  Diff
  <chr> <dbl>
1 Key=1     2
2 Key=2     3