我试图编写一个相当复杂的迭代匹配函数,但是我沉迷于ifelse和无法正常工作的函数中。不幸的是,我没有任何人可以跳出主意,所以我们会给予任何支持或想法。
我的数据的每一行都是一个观察值,其中包含许多变量,此示例中包括相关变量。观察结果具有分配的Sample_Name
,与样品名称相对应的Matching_Group
,Time
的测量值以及主观的Assigned_idx
,这些主观数据是从数据清理的早期部分完成的。每个观测到的Sample_Name
可以包含0-7个观测值,但是Matching_Group
将始终包含7个观测值。
structure(list(Sample_Name = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS",
"SS", "SS", "SS", "SS", "SS", "SS"), Matching_Group = c("QQ",
"QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS", "SS", "QQ",
"QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS",
"SS", "SS"), Time = c(1, 1.1, 1.2, 1.4, 1.6, 7.203, 7.395,
7.5, 7.6, 7.7, 7.802, 1, 1.102, 1.2, 1.3, 1.398, 1.501, 1.6,
7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8), Assigned_idx = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 3, 4, 5, 6, 7, 1, 2,
3, 4, 5, 6, 7)), row.names = c(NA, -25L), class = c("tbl_df",
"tbl", "data.frame"))
Sample_Name Matching_Group Time Assigned_idx
A QQ 1.000
A QQ 1.100
A QQ 1.200
A QQ 1.400
A QQ 1.600
B SS 7.203
B SS 7.395
B SS 7.500
B SS 7.600
B SS 7.700
B SS 7.802
QQ QQ 1.000 1
QQ QQ 1.102 2
QQ QQ 1.200 3
QQ QQ 1.300 4
QQ QQ 1.398 5
QQ QQ 1.501 6
QQ QQ 1.600 7
SS SS 7.200 1
SS SS 7.300 2
SS SS 7.400 3
SS SS 7.500 4
SS SS 7.600 5
SS SS 7.700 6
SS SS 7.800 7
对于每个观察(行),我想计算相应Time
的每行之间Matching_Group
的比率。每个Matching_Group
都有一个分配的唯一Time_Ratio
值,计算所需的值必须等于+/-一些公差。如果该计算出的比率匹配特定于该组的预定义比率,我想从属于该观测值的行中提取并分配 Assigned_idx
Matching_Group
并将其分配给观察值。如果不是,请使用相同的观察到的Time
和Time
的下一行中的Matching_Group
重复计算。重复此操作,直到每个观察值在Assigned_idx
中都有一个值。
示例:在此数据集中,对于两个Matching_Group
,Time_Ratio
应该等于1.000 +/- 0.0020
。在我的真实数据集中,在单独的表中为每个Time_Ratio
指定了唯一的Matching_Group
值。因此,对于Time
= 1.200
的第3行,Matching_Group
是QQ
。当我们用观察到的第一个QQ
时间计算比率时,1.200/1.000 = 1.200
超出了我们定义的公差->下一个观察到的时间QQ
。 1.200/1.102 = 1.089
...同样超出我们的容忍范围。最后,尽管1.200/1.200 = 1.000
确实在我们为此Matching_Group
指定的容差之内。在Matching_Group
的观测值具有匹配率的行中,Assigned_idx
列包含3
。我们采用该值,并将其映射到第3行的Assigned_idx
列中。然后对第4行重复此过程并重复该过程。
所需结果:
Sample_Name Matching_Group Time Assigned_idx Time_Ratio (Sample:Matching)
A QQ 1.000 1 1.0000
A QQ 1.100 2 0.9982
A QQ 1.200 3 1.0000
A QQ 1.400 5 1.0014
A QQ 1.600 7 1.0000
B SS 7.203 1 1.0004
B SS 7.395 3 0.9993
B SS 7.500 4 1.0000
B SS 7.600 5 1.0000
B SS 7.700 6 1.0000
B SS 7.802 7 1.0003
QQ QQ 1.000 1 1.0000
QQ QQ 1.102 2 1.0000
QQ QQ 1.200 3 1.0000
QQ QQ 1.300 4 1.0000
QQ QQ 1.398 5 1.0000
QQ QQ 1.501 6 1.0000
QQ QQ 1.600 7 1.0000
SS SS 7.200 1 1.0000
SS SS 7.300 2 1.0000
SS SS 7.400 3 1.0000
SS SS 7.500 4 1.0000
SS SS 7.600 5 1.0000
SS SS 7.700 6 1.0000
SS SS 7.800 7 1.0000
我已经尝试使用dplyr解决这个问题,因为我认为它应该能够处理我要完成的工作(也许purrr更适合吗?)。不幸的是,我似乎无法在ifelse和函数中适当地排序我的条件和表达式。我的尝试包括使用比率计算,data.table :: shift等将%>%突变分组,以杂乱无章的方式进行,但我似乎无法使其与我的条件参数一起使用。同样,如果相关,在我的真实数据中将有〜50个“名称”和〜25个匹配组。我将有第二个数据源,列出匹配的组名和各自的比率,但在此示例中未包括此类详细信息。
老实说,我很沮丧,任何想法都值得赞赏。
答案 0 :(得分:0)
类似的事情应该起作用:
#!/usr/bin/R
a = structure(list(Sample_Name = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS",
"SS", "SS", "SS", "SS", "SS", "SS"), Matching_Group = c("QQ",
"QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS", "SS", "QQ",
"QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS",
"SS", "SS"), Time = c(1, 1.1, 1.2, 1.4, 1.6, 7.203, 7.395,
7.5, 7.6, 7.7, 7.802, 1, 1.102, 1.2, 1.3, 1.398, 1.501, 1.6,
7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8), Assigned_idx = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 3, 4, 5, 6, 7, 1, 2,
3, 4, 5, 6, 7)), row.names = c(NA, -25L), class = c("tbl_df",
"tbl", "data.frame"));
tol = 0.002;
a$Time_Ratio <- NA;
for (i in 1:nrow(a)) {
s_name <- a[i, "Sample_Name"];
mg <- a[i, "Matching_Group"];
s_time <- a[i, "Time"];
for (j in 1:nrow(a)) {
mg_name <- a[j, "Sample_Name"];
if (mg_name == mg) {
mg_time <- a[j, "Time"];
time_ratio = s_time/mg_time;
if (abs(time_ratio - 1.0) < tol) {
a[i, "Assigned_idx"] <- a[j, "Assigned_idx"];
a[i, "Time_Ratio"] <- time_ratio;
break;
}
}
}
}
print(a);
答案 1 :(得分:0)
更新
第一个版本很笨重,这是第二个更干净的版本:
library(tidyverse)
thresh <- .002
baseline <- 1.0
仍在制作compare
,但现在只有两行:每个匹配组一个,其中times
是每个Matching_Group
的所有时间的列表:
compare <- df %>%
filter(Sample_Name == Matching_Group) %>%
group_by(Matching_Group) %>%
summarise(times = list(Time))
compare
Matching_Group times
<chr> <list>
1 QQ <dbl [7]>
2 SS <dbl [7]>
将df
与compare
结合起来,然后使用purrr::map()
变体来获取比率,增量(来自基线),然后非常方便的detect_index()
可以为我们提供第一个匹配项亚阈值比率。 (注意:这也解决了您的评论中关于每个匹配组都有不同的thresh
和baseline
的问题-我们在此仍使用静态值,但所有操作都假定这两个变量现在df中的列,理论上每行或每组的列都可以不同。)
df %>%
mutate(thresh = thresh,
baseline = baseline) %>%
inner_join(compare, by = "Matching_Group") %>%
mutate(ratios = map2(Time, times, ~ .x / .y),
deltas = map2(baseline, ratios, ~ abs(.x - .y)),
Assigned_idx = map2_dbl(deltas, thresh,
~detect_index(.x, ~ .x < .y, .y))) %>%
select(-times, -ratios, -deltas)
输出:
Sample_Name Matching_Group Time Assigned_idx thresh baseline
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 A QQ 1.00 1. 0.00200 1.
2 A QQ 1.10 2. 0.00200 1.
3 A QQ 1.20 3. 0.00200 1.
4 A QQ 1.40 5. 0.00200 1.
5 A QQ 1.60 7. 0.00200 1.
6 B SS 7.20 1. 0.00200 1.
7 B SS 7.40 3. 0.00200 1.
8 B SS 7.50 4. 0.00200 1.
9 B SS 7.60 5. 0.00200 1.
10 B SS 7.70 6. 0.00200 1.
# ... with 15 more rows
原始解决方案
这是一个tidyverse
解决方案。想法是将Sample_Name
扩展为宽形式(即compare
),然后获取每一行的比率(并评估它们是否通过thresh
测试)。然后,只需重新组合和清理不必要的变量即可。
library(stringr)
library(tidyverse)
thresh <- .002
baseline <- 1.0
首先,通过将df
添加到name2
来创建data
。它只是Sample_Name
的副本,但添加了索引值:
df <- data %>%
group_by(Sample_Name) %>%
mutate(name2 = paste0(Sample_Name, 1:length(Sample_Name))) %>%
ungroup()
df
# A tibble: 25 x 5
Sample_Name Matching_Group Time Assigned_idx name2
<chr> <chr> <dbl> <dbl> <chr>
1 A QQ 1.00 NA A1
2 A QQ 1.10 NA A2
3 A QQ 1.20 NA A3
4 A QQ 1.40 NA A4
5 A QQ 1.60 NA A5
6 B SS 7.20 NA B1
...
现在创建compare
数据框:
compare <- df %>%
select(name2, Time) %>%
spread(name2, value = Time)
compare
# A tibble: 1 x 25
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 QQ1 QQ2
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1. 1.10 1.20 1.40 1.60 7.20 7.40 7.50 7.60 7.70 7.80 1. 1.10
# ... with 12 more variables: QQ3 <dbl>, QQ4 <dbl>, QQ5 <dbl>, QQ6 <dbl>,
# QQ7 <dbl>, SS1 <dbl>, SS2 <dbl>, SS3 <dbl>, SS4 <dbl>, SS5 <dbl>,
# SS6 <dbl>, SS7 <dbl>
使用purrr:pmap
计算比率并与thresh
进行比较:
matched_df <- df %>%
pmap(~ compare %>%
select(starts_with(..2)) %>%
mutate_all(funs(..3/., which(abs(baseline - ./..3 ) < thresh)[1])) %>%
select(contains("_"))
) %>%
bind_rows(.)
matched_df
# A tibble: 25 x 28
`QQ1_/` `QQ2_/` `QQ3_/` `QQ4_/` `QQ5_/` `QQ6_/` `QQ7_/` `QQ1_[` `QQ2_[`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1.00 0.907 0.833 0.769 0.715 0.666 0.625 1 NA
2 1.10 0.998 0.917 0.846 0.787 0.733 0.688 NA 1
3 1.20 1.09 1.00 0.923 0.858 0.799 0.750 NA NA
4 1.40 1.27 1.17 1.08 1.00 0.933 0.875 NA NA
5 1.60 1.45 1.33 1.23 1.14 1.07 1.00 NA NA
最后,将matched_df
绑定到df
并清理。
缩小到仅正确匹配的索引的关键操作是filter(Assigned_idx == matched2)
。到那时为止,每个Sample_Name
至Matching_Group
分配的所有可能比率都已存在。
bind_cols(df, matched_df) %>%
select(-name2, -Assigned_idx) %>%
gather(Assigned_idx, value, -contains("/"), -Sample_Name, -Matching_Group, -Time) %>%
filter(!is.na(value)) %>%
gather(matched2, Time_Ratio, -Assigned_idx, -value, -Sample_Name, -Matching_Group, -Time) %>%
mutate(Assigned_idx = str_replace(Assigned_idx, "_\\[", ""),
matched2 = str_replace(matched2, "_/", "")) %>%
filter(Assigned_idx == matched2) %>%
arrange(Sample_Name) %>%
select(-value, -matched2) %>%
mutate(Assigned_idx = str_sub(Assigned_idx, -1),
Time_Ratio = round(Time_Ratio, 4))
Sample_Name Matching_Group Time Assigned_idx Time_Ratio
1 A QQ 1.000 1 1.0000
2 A QQ 1.100 2 0.9982
3 A QQ 1.200 3 1.0000
4 A QQ 1.400 5 1.0014
5 A QQ 1.600 7 1.0000
6 B SS 7.203 1 1.0004
7 B SS 7.395 3 0.9993
8 B SS 7.500 4 1.0000
...
对于所有tidyverse
向导来说,这不是我最漂亮的解决方案...很高兴从任何建议中学习。
数据:
data <- structure(list(Sample_Name = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS",
"SS", "SS", "SS", "SS", "SS", "SS"), Matching_Group = c("QQ",
"QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS", "SS", "QQ",
"QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS",
"SS", "SS"), Time = c(1, 1.1, 1.2, 1.4, 1.6, 7.203, 7.395,
7.5, 7.6, 7.7, 7.802, 1, 1.102, 1.2, 1.3, 1.398, 1.501, 1.6,
7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8), Assigned_idx = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 3, 4, 5, 6, 7, 1, 2,
3, 4, 5, 6, 7)), row.names = c(NA, -25L), class = c("tbl_df",
"tbl", "data.frame"))