用R优雅地解析巨大的文本表

时间:2018-10-31 16:09:20

标签: python r bash

我有一个巨大的表,该表具有3列制表符分隔的栏,其中包括对汉堡包的深入分析。 这是表格的一小部分(对定界符表示抱歉)。

NAME@FEATURE@DESCRIPTION
Kburger@supplier@Jackobs12
Kburger@type@Minkins Carnivorous
Kburger@deposit@http://minkins.sord/deposit/Kburger
Kburger@opinions@shitty::very gummy|shitty::grey
Kburger@code@HB3T2323|Tried just once; IT IS: disgusting…
Aburger@supplier@Mark11
Aburger@type@Minkins Carnivorous
Aburger@deposit@http://minkins.sord/deposit/Aburger
Aburger@opinions@shitty::very gummy
Cburger@supplier@David22
Cburger@type@Minkins Carnivorous
Cburger@deposit@http://minkins.sord/deposit/Cburger
Cburger@opinions@shitty::very gummy|slightly hard::never again
Dburger@supplier@Jason99
Dburger@type@Minkins Carnivorous
Dburger@deposit@http://minkins.sord/deposit/Dburger
Dburger@opinions@shitty::very gummy
Eburger@supplier@CarlX
Eburger@type@Minkins Vegetarian
Eburger@deposit@http://minkins.sord/deposit/Eburger
Eburger@opinions@shitty::very gummy|shitty::abnormal
Fburger@supplier@Dino12
Fburger@type@Minkins Vegetarian
Fburger@deposit@http://minkins.sord/deposit/Fburger
Fburger@opinions@shitty::very gummy
Fburger@code@HB3T5554|Reminds me about vomit (N3VER); MEMORIES: No other comments...

如您所见,对于某些汉堡名称,我们有5个功能,而不是4个,显示在许多情况下都缺少的“代码”功能(始终以HB开头)。我想要的是用R解析此表,​​并获得一个3行制表符分隔的摘要表,其中一行代码,名称和意见如下:

CODE    NAME    DESCRIPTION
HB3T2323    Kburger shitty::very gummy|shitty::grey
HB3T5554    Fburger shitty::very gummy

这是我对bash的粗略解决方案,对此我感到很ham愧:

cat table.txt  | tr '@' '\t' | grep "HB" | cut -d '|' -f1,3 | cut -f1,3 | while read i; do echo $(echo $i | awk '{print $2}') && grep $(echo $i | awk '{print $1}') table.txt | grep "opinions" ; done  | tr '@' '\t' | paste - - | cut -f1,2,4 > new_table.txt

如果您还可以在bash或python中提供解决方案,也将不胜感激,但老实说,我更喜欢R。

1 个答案:

答案 0 :(得分:1)

这是使用dplyr和tidyr软件包的潜在R解决方案。

library(tidyr)
library(dplyr)

df <- read.delim(file = "PATH/TO/TAB/DELIMITED/FILE/HERE.txt")

df <- df %>%
      spread(key = "FEATURE", value = "DESCRIPTION") %>%
      separate(col = "code", into = c("code", "code_desc"), sep = "\\|", extra = "merge", fill = "right") %>%
      filter(!is.na(code)) %>%
      select("CODE" = code, NAME, "DESCRIPTION" = opinions)

write.table(df, "PATH/TO/TAB/DELIMITED/OUTPUT/FILE/HERE.txt", sep = "\t", na = "", row.names = FALSE)