R- collapse rows based on contents of two columns

时间:2015-12-10 01:30:58

标签: mysql arrays r data.table

I apologize in advance if this question is too specific or involved for this type of forum. I have been a long time lurker on this site, and this is the first time I haven't been able to solve my issue by looking at previous questions, so I finally decided to post. Please let me know if there is a better place to post this, or if you have advice on making it more clear. here goes.

I have a data.table with the following structure:

library(data.table)
dt = structure(list(chr = c("chr1", "chr1", "chr1", "chr1", "chrX",
"chrX", "chrX", "chrX"), start = c(842326, 855423, 855426, 855739,
153880833, 153880841, 154298086, 154298089), end = c(842327L,
855424L, 855427L, 855740L, 153880834L, 153880842L, 154298087L,
154298090L), meth.diff = c(9.35200555410902, 19.1839617944039,
29.6734426495636, -12.3375577709254, 50.5830043986142, 52.7503561092491,
46.5783738475184, 41.8662800742733), mean_KO = c(9.35200555410902,
19.1839617944039, 32.962962583692, 1.8512250859083, 51.2741224212646,
53.0928367727283, 47.4901932463221, 44.8441659366298), mean_WT = c(0,
0, 3.28951993412841, 14.1887828568337, 0.69111802265039, 0.34248066347919,
0.91181939880374, 2.97788586235646), coverage_KO = c(139L, 55L,
55L, 270L, 195L, 194L, 131L, 131L), coverage_WT = c(120L, 86L,
87L, 444L, 291L, 293L, 181L, 181L)), .Names = c("chr", "start",
"end", "meth.diff", "mean_KO", "mean_WT", "coverage_KO", "coverage_WT"
), class = c("data.table", "data.frame"), row.names = c(NA, -8L
))

These are genomic coordinates with associated values, the file is sorted by by chromosome ("chr") (1 through 22, then X, then Y), start and end position so that the first row contains the lowest numbered start position on chromosome 1, and proceeds sequentially for all data points on chromosome 1, then 2, etc. At this point, every single row has a start-end length of 1. After collapsing the start-end lengths will vary depending on how many rows were collapsed and their distance from the adjacent row.

1st: I would like to collapse adjacent rows into larger start/end ranges based on the following criteria:

  1. The two adjacent rows share the same value for the "chr" column (row 1 "chr" = chr1, and row 2 "chr" = chr1)
  2. The two adjacent rows have "start" coordinate within 500 of one another (if row 1 "start" = 1000, and row 2 "start" <= 1499, collapse these into a single row; if row1 = 1000 and row2 = 1500, keep separate)
  3. The adjacent rows must have the same sign for the "diff" column (i.e. even if chr = chr and start within 500, if diff1 = + 5 and diff2 = -5, keep entries separate)

2nd: I would like to calculate the coverage_ weighted averages of the collapsed mean_KO/WT columns with the weighting by the coverage_KO/WT columns:

Ex: collapse 2 rows,

row 1 mean_1 = 5.0, coverage_1 = 20.

row 2 mean_1 =40.0, coverage_1 = 45.

weighted avg mean_1 = (((5.0*20)/(20+45)) + ((40.0*45)/(20+45))) = 29.23

What I would like the output to look like (except collapsed row means would be calculated and not in string form):

library(data.table)
dt_output = structure(list(chr = c("chr1", "chr1", "chr1", "chrX", "chrX"
), start = c(842326, 855423, 855739, 153880833, 154298086), end = c(842327, 
855427, 855740, 153880842, 154298090), mean_1 = c("9.35", "((19.18*55)/(55+55)) + ((32.96*55)/(55+55))", 
"1.85", "((51.27*195)/(195+194)) + ((53.09*194)/(195+194))", 
"((47.49*131)/(131+131)) + ((44.84*131)/(131+131))"), mean_2 = c("0", 
"((0.00*86)/(86+87)) + ((3.29*87)/(86+87))", "14.19", "((0.69*291)/(291+293)) + ((0.34*293)/(291+293))", 
"((0.91*181)/(181+181)) + ((2.98*181)/(181+181))")), .Names = c("chr", 
"start", "end", "mean_1", "mean_2"), row.names = c(NA, -5L), class = c("data.table", "data.frame"))

Help with either part 1 or 2 or any advice is appreciated.

I have been using R for most of my data manipulations, but I am open to any language that can provide a solution. Thanks in advance.

0 个答案:

没有答案