数据帧处理

时间:2016-07-21 10:32:16

标签: r dataframe bioinformatics

我有一个数据框,我按Match <- read.table("Match.txt", sep="", fill =T, stringsAsFactors = FALSE, quote = "", header = F)阅读,看起来像这样:

> ab
           V1       V2  V3                       V4 V5    V6 V7    V8 V9               V10
1  Inspecting sequence  ID chr1:173244300-173244500       NA       NA                     
2   V$ATF3_Q6        |  19                      (-)  | 0.877  | 0.622  |    aagtccCATCAggg
3   V$ATF3_Q6        |  34                      (-)  | 0.788  | 0.655  |    agggaaCGACAcag
4   V$ATF3_Q6        | 102                      (+)  | 0.738  | 0.685  |    cccTGAGCttagga
5  V$CEBPB_01        |  24                      (+)  | 0.950  | 0.882  |    ccatcagGGAAGgg
72   V$YY1_01        | 117                      (+)  | 0.996  | 0.984  | acttCCCATcttttaag
73 Inspecting sequence  ID chr1:173244350-173244550       NA       NA                     
74  V$ATF3_Q6        |  52                      (+)  | 0.738  | 0.685  |    cccTGAGCttagga
75  V$ATF3_Q6        | 160                      (+)  | 0.862  | 0.687  |    gtcTGACCtggaga
76 V$CEBPB_01        |  57                      (+)  | 0.966  | 0.958  |    agcttagGAAACtt

它包含数百万次这样的重复,其中第一行是:Inspecting sequence ID chr1:173244300-173244500,然后是一些值,如上所示。我想处理它,记住以下事项:

  1. 提取第一行,在:-上删除它,这样我就会得到三列:chr1 173244300 173244500
  2. 第4列应该包含V1 $ Row2 1st元素,在$_上分割,然后取第2个索引ATF3,就像这样我有30个确定(让我们称之为名字的情况下,有些会被观察到,而有些则不会在每种情况下(1例来自第1行至第72行,第2次从第73行开始)。
  3. 如果该名称出现在1个案例中,那么将为该列分配值B,否则将分配值U
  4. 因此根据我的输入,我希望获得以下输出:

    chr     start       stop        ATF3  CEBPB  YY1    ..(All which appear e.g from row 1 to 72, ignoring duplicates)
    chr1    173244300   173244500   B     B      B  
    chr1    173244350   173244550   B     B      U
    

    我想在标题中修改no.of列(我知道它们是32个这样的名称),所以如果它们出现在一个案例中B将被分配,否则将分配U。< / p>

    如果有人可以帮我这样做,那将是一个很大的帮助。

    以下是此示例数据框的输入:

    > ab <- dput(Match[c(1:5,72:76), ])
    structure(list(V1 = c("Inspecting", "V$ATF3_Q6", "V$ATF3_Q6", 
    "V$ATF3_Q6", "V$CEBPB_01", "V$YY1_01", "Inspecting", "V$ATF3_Q6", 
    "V$ATF3_Q6", "V$CEBPB_01"), V2 = c("sequence", "|", "|", "|", 
    "|", "|", "sequence", "|", "|", "|"), V3 = c("ID", "19", "34", 
    "102", "24", "117", "ID", "52", "160", "57"), V4 = c("chr1:173244300-173244500", 
    "(-)", "(-)", "(+)", "(+)", "(+)", "chr1:173244350-173244550", 
    "(+)", "(+)", "(+)"), V5 = c("", "|", "|", "|", "|", "|", "", 
    "|", "|", "|"), V6 = c(NA, 0.877, 0.788, 0.738, 0.95, 0.996, 
    NA, 0.738, 0.862, 0.966), V7 = c("", "|", "|", "|", "|", "|", 
    "", "|", "|", "|"), V8 = c(NA, 0.622, 0.655, 0.685, 0.882, 0.984, 
    NA, 0.685, 0.687, 0.958), V9 = c("", "|", "|", "|", "|", "|", 
    "", "|", "|", "|"), V10 = c("", "aagtccCATCAggg", "agggaaCGACAcag", 
    "cccTGAGCttagga", "ccatcagGGAAGgg", "acttCCCATcttttaag", "", 
    "cccTGAGCttagga", "gtcTGACCtggaga", "agcttagGAAACtt")), .Names = c("V1", 
    "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10"), row.names = c(1L, 
    2L, 3L, 4L, 5L, 72L, 73L, 74L, 75L, 76L), class = "data.frame")
    

3 个答案:

答案 0 :(得分:4)

this question中的输入文件设为/c/tmp.txt

此awk脚本保存为SO-38563400.awk

BEGIN {
 OFS="\t" # Set the output separator
 i=0 # Just to init the counter and be sure to start at 1 later
}
 {
 #print $0
 }
/Inspecting sequence ID/ { # Changing sequence, initialize new entry with start and end
  split($4,arr,"[:-]") # split the string in fields, split on : and -
  seq[i++,"chr"]=arr[1] # Save the chr part and increase the sequence beforehand
  seq[i,"start"]=arr[2] # save the start date
  seq[i,"end"]=arr[3] # Save the end date
}

/V[$][^_]+_.*/ { # V line type,
  split($1,arr,"[$_]") # Split on $ and underscore
  seq[i,arr[2]]="B" # This has been seen, setting to B
  seq[i,"print"]=1
  names[arr[2]]++ # Save the name for output
  # (and count occurences, just for fun, well mainly because an int is cheaper to store)
  # Main reason is it allow a quicker access toa rray keys ant END block
}

END {
  head=sprintf("char%sstart%sstop",OFS,OFS,OFS)
  for (h in names) {
    head=sprintf("%s%s%s",head,OFS,h)
  }
  print(head)
  for (l=1; l<i; l++) { # loop over each line/sequence
    line=sprintf("%s%s%s%s%s",seq[l,"chr"],OFS,seq[l,"start"],OFS,seq[l,"end"])
    for (h in names) {
      if (seq[l,h]=="B") line=sprintf("%s%s%s",line,OFS,"B")
      else line=sprintf("%s%s%s",line,OFS,"U")
    }
    if (seq[l,"print"]) print line
  }
}

传递此命令:

awk -f SO-38563400.awk /c/tmp.txt > /c/Rtable.txt

给出:

$ cat /c/Rtable.txt
char    start   stop    STAT3   ATF3    TEAD4   GATA3   JUND    HNF4A   FOXA2   MAX     CEBPB   SPI1    GABPA   CMYC    P300    E2F1    CTCF    ATF2
chr22   16049850        16050050        B       B       U       B       U       B       B       U       U       U       U       U       B       B       U       B
chr22   16049900        16050100        B       B       B       B       B       B       B       B       B       B       B       B       B       B       B       B

然后阅读r:

> x <- read.table("/c/Rtable.txt", sep="\t",  stringsAsFactors = FALSE, header=T)
> x
char    start     stop STAT3 ATF3 TEAD4 GATA3 JUND HNF4A FOXA2 MAX CEBPB SPI1 GABPA CMYC P300 E2F1 CTCF ATF2
1 chr22 16049850 16050050     B    B     U     B    U     B     B   U     U    U     U    U    B    B    U    B
2 chr22 16049900 16050100     B    B     B     B    B     B     B   B     B    B     B    B    B    B    B    B

请忽略使用/c/路径的设置,这可以在windows或linux上运行,在Windows下有awk的端口,我建议使用linux作为操作系统的大文件文件流的容量。

我们可以通过在打印结果之前不读取整个文件来节省更多内存,但这需要一组固定的&#34;名称&#34;但是你懒得自己提取名字并且只给我发了一堆条目,运动留给你调整,把它列在BEGIN块中,用它作为每个seq的条目,并且每个new seq在处理之前打印上一个结果。

我希望下次你能抽出一些时间来提出一个正确的问题并且你会明白你必须为别人做出一些努力来帮助你,特别是在一系列评论要求你提高之后你的问题。

答案 1 :(得分:2)

可能不是stringrtidyr的最佳用法,但这可以通过一种有点可读的方式在hadleyverse中完成......

逻辑流程是:

  • 使用tidyr::fill ifelse("Inspecting", rowname, NA)来确定群组。
  • 将字段变为您想要的字段
  • 使用重塑(dcast)获取所需的格式。
library(dplyr)
library(tidyr)
library(reshape2)
library(stringr)

is_in <- function(v1part) {
  return(ifelse(length(v1part) > 0, "B", "U"))
}

ab1<- ab %>% 
  add_rownames() %>%
  mutate(rowname = ifelse(V1=="Inspecting", rowname, NA),
         V4a = ifelse(V4 == "(-)" | V4 == "(+)", NA, V4),

         chr = str_extract_all(ab$V4, "^chr[^:]+", simplify = T)[,1],
         chr = ifelse(chr=="", NA, chr),

         start = str_split_fixed(V4a, ":|-", 3)[,2],
         start = ifelse(start=="", NA, start), 

         stop = str_split_fixed(V4a, ":|-", 3)[,3],
         stop = ifelse(stop=="", NA, stop),

         V1part = str_split_fixed(V1, "\\$|_", 3)[,2]) %>%
  fill(rowname, .direction="down") %>% 
  group_by(rowname) %>%
  fill(chr, .direction="down") %>%
  fill(start, .direction="down") %>%
  fill(stop, .direction="down") %>%
  dcast(chr+start+stop ~ V1part, fun.aggregate=is_in)

> ab1
   chr     start      stop Var.4 ATF3 CEBPB YY1
1 chr1 173244300 173244500     B    B     B   B
2 chr1 173244350 173244550     B    B     B   U

答案 2 :(得分:1)

不优雅,但它应该有效(你的数据有一个带“|”的列......我把它命名为df):

cond <- which(!df$V2 == "|")
new_df <- data.frame(chr=character(length(cond)), start=character(length(cond)), stop=character(length(cond)))

for (i in 1:length(cond)) {
  line <- df[cond[i], ]
  var <- unlist(strsplit(line$V4, split = ":"))
  var2 <- unlist(strsplit(var[2], split = "-"))
  new_df$chr[i] <- var[1]
  new_df$start[i] <- var2[1]
  new_df$stop[i] <- var2[2]
  for (k in (i+1):(cond[i+1]-1)) {
    # Your code using name <- df$V1 (Use strsplit again)
    # df[i, name] <- ...
  }
}