我有一个数据帧(将其称为“ df”),其中包含相当数量的变量(数字,逻辑和字符),这些变量代表一项实验,其中不同类型的细胞从一种特定的培养基移至另一种,并且在特定时间对细胞进行定量。第一列和第二列分别保存“源”媒体的名称和单元格要移动到的媒体的名称;第三列描述了活动的量化时间,第四列是细胞类型,第五列是测量的活动,这很有趣。
我有两个主要问题,第一个是要知道是否有一种'R-esque'的方式来完成我获得第六列的工作,该列包含了值的增加/减少(以百分比为单位) “活动”相对于上一行中存在的活动,但以分组方式(每个分组由Cell.Type,Pre.Medium和Time组成)组成,因此这就是为什么每次Time的值为零时其值为NA
假设这是我的数据框(为了使我的问题更清楚,我对其进行了简化):
df <- structure(list(Pre.Medium = c("Medium1", "Medium1", "Medium1",
"Medium2", "Medium2", "Medium2", "Medium1", "Medium1", "Medium1",
"Medium2", "Medium2", "Medium2"), Pos.Medium = c("Medium2", "Medium2",
"Medium2", "Medium1", "Medium1", "Medium1", "Medium2", "Medium2",
"Medium2", "Medium1", "Medium1", "Medium1"), Time = c(0, 2, 4,
0, 2, 4, 0, 2, 4, 0, 2, 4), Cell.Type = c("Cell_A", "Cell_A",
"Cell_A", "Cell_A", "Cell_A", "Cell_A", "Cell_B", "Cell_B", "Cell_B",
"Cell_B", "Cell_B", "Cell_B"), Activity = c(0.5, 1, 2, 2, 1,
0.5, 0.2, 0.8, 0.2, 0.2, 0.2, 0.4), Percent.Increase = c(NA,
100, 100, NA, -50, -50, NA, 300, -75, NA, 0, 100), Primary.Increase = c(NA,
TRUE, FALSE, NA, TRUE, FALSE, NA, TRUE, FALSE, NA, FALSE, FALSE
), Secondary.Increase = c(NA, FALSE, FALSE, NA, FALSE, FALSE,
NA, FALSE, FALSE, NA, FALSE, TRUE)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -12L), problems = structure(list(
row = 1L, col = NA_character_, expected = "8 columns", actual = "9 columns",
file = "'new 2'"), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")), spec = structure(list(cols = list(Pre.Medium = structure(list(), class = c("collector_character",
"collector")), Pos.Medium = structure(list(), class = c("collector_character",
"collector")), Time = structure(list(), class = c("collector_double",
"collector")), Cell.Type = structure(list(), class = c("collector_character",
"collector")), Activity = structure(list(), class = c("collector_double",
"collector")), Percent.Increase = structure(list(), class = c("collector_double",
"collector")), Primary.Increase = structure(list(), class = c("collector_logical",
"collector")), Secondary.Increase = structure(list(), class = c("collector_logical",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
### Pre.Med Pos.Med Time Cell.Type Activity Percent.Increase Primary.Increase Secondary.Increase
### Medium1 Medium2 0 Cell_A 0.5 NA NA NA
### Medium1 Medium2 2 Cell_A 1 100 TRUE FALSE
### Medium1 Medium2 4 Cell_A 2 100 FALSE FALSE
### Medium2 Medium1 0 Cell_A 2 NA NA NA
### Medium2 Medium1 2 Cell_A 1 -50 TRUE FALSE
### Medium2 Medium1 4 Cell_A 0.5 -50 FALSE FALSE
### Medium1 Medium2 0 Cell_B 0.2 NA NA NA
### Medium1 Medium2 2 Cell_B 0.8 300 TRUE FALSE
### Medium1 Medium2 4 Cell_B 0.2 -75 FALSE FALSE
### Medium2 Medium1 0 Cell_B 0.2 NA NA NA
### Medium2 Medium1 2 Cell_B 0.2 0 FALSE FALSE
### Medium2 Medium1 4 Cell_B 0.4 100 FALSE TRUE
我使用了group_by和mutate函数,然后使用lag函数来计算上一行和上一行的增加/减少,是否有更好的方法呢?对于我的特定情况,滞后就足够了,但是如果我在每个“组”中进行了三次以上的时间测量并且需要落后于时间来进行计算,该怎么办?用我的方法,在某些时候我将不得不使用lag(lag(lag(lag(lag(lag((Activity / lag(Activity))-1)* 100))))等。
另一件事是我无法以任何方式弄清楚的事情,它是通过将列“ Primary.Increase”和“ Secondary.Increase”变成“ long”数据集进入名为“ Increase.Type”的列,其中每个组(Cell.Type,Pre.Med和Time的组合)的值将由列名(Primary.Response或Secondary.Response)组成,其中其成员之一的值是TRUE。它应该看起来像这样:
df <- structure(list(Pre.Med = c("Medium1", "Medium1", "Medium1", "Medium2",
"Medium2", "Medium2", "Medium1", "Medium1", "Medium1", "Medium2",
"Medium2", "Medium2"), Pos.Med = c("Medium2", "Medium2", "Medium2",
"Medium1", "Medium1", "Medium1", "Medium2", "Medium2", "Medium2",
"Medium1", "Medium1", "Medium1"), Time = c(0, 2, 4, 0, 2, 4,
0, 2, 4, 0, 2, 4), Cell.Type = c("Cell_A", "Cell_A", "Cell_A",
"Cell_A", "Cell_A", "Cell_A", "Cell_B", "Cell_B", "Cell_B", "Cell_B",
"Cell_B", "Cell_B"), Activity = c(0.5, 1, 2, 2, 1, 0.5, 0.2,
0.8, 0.2, 0.2, 0.2, 0.4), Percent.Inc = c(NA, 100, 100, NA, -50,
-50, NA, 300, -75, NA, 0, 100), Increase.Type = c("Primary.Increase",
"Primary.Increase", "Primary.Increase", "Primary.Increase", "Primary.Increase",
"Primary.Increase", "Primary.Increase", "Primary.Increase", "Primary.Increase",
"Secondary.Increase", "Secondary.Increase", "Secondary.Increase"
)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-12L), spec = structure(list(cols = list(Pre.Med = structure(list(), class = c("collector_character",
"collector")), Pos.Med = structure(list(), class = c("collector_character",
"collector")), Time = structure(list(), class = c("collector_double",
"collector")), Cell.Type = structure(list(), class = c("collector_character",
"collector")), Activity = structure(list(), class = c("collector_double",
"collector")), Percent.Inc = structure(list(), class = c("collector_double",
"collector")), Increase.Type = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
### Pre.Med Pos.Med Time Cell.Type Activity Percent.Inc Increase.Type
### Medium1 Medium2 0 Cell_A 0.5 NA Primary.Increase
### Medium1 Medium2 2 Cell_A 1 100 Primary.Increase
### Medium1 Medium2 4 Cell_A 2 100 Primary.Increase
### Medium2 Medium1 0 Cell_A 2 NA Primary.Increase
### Medium2 Medium1 2 Cell_A 1 -50 Primary.Increase
### Medium2 Medium1 4 Cell_A 0.5 -50 Primary.Increase
### Medium1 Medium2 0 Cell_B 0.2 NA Primary.Increase
### Medium1 Medium2 2 Cell_B 0.8 300 Primary.Increase
### Medium1 Medium2 4 Cell_B 0.2 -75 Primary.Increase
### Medium2 Medium1 0 Cell_B 0.2 NA Secondary.Increase
### Medium2 Medium1 2 Cell_B 0.2 0 Secondary.Increase
### Medium2 Medium1 4 Cell_B 0.4 100 Secondary.Increase
首先有没有办法做到这一点?我以为是这样,但是到目前为止我还没有做到:/ 我是R的一门新兴生物学专业的本科生,我很喜欢您可以用它做些什么,但是距离要擅长它还有很长的路要走。
非常感谢您的帮助。
答案 0 :(得分:0)
我不确定我是否理解第一个问题。 如果您执行以下操作:
library(dplyr)
df %>%
group_by(Cell.Type, Pre.Medium, Pos.Medium) %>%
arrange(Time, .by_group = TRUE) %>% # remove if Time is always ascending
mutate(Percent.Increase = ((Activity / lag(Activity)) - 1) * 100)
Percent.Increase
的计算被矢量化,
因此Activity
多久都没关系
(另请参见下面的最后解释)。
对于第二个问题, 如果我理解正确, 您可以这样做:
df %>%
group_by(Cell.Type, Pre.Medium, Pos.Medium) %>%
mutate(Increase.Type = if (any(Secondary.Increase, na.rm = TRUE)) "Secondary.Increase" else "Primary.Increase") %>%
select(-(Primary.Increase:Secondary.Increase))
# A tibble: 12 x 7
# Groups: Cell.Type, Pre.Medium, Pos.Medium [4]
Pre.Medium Pos.Medium Time Cell.Type Activity Percent.Increase Increase.Type
<chr> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 Medium1 Medium2 0 Cell_A 0.5 NA Primary.Increase
2 Medium1 Medium2 2 Cell_A 1 100 Primary.Increase
3 Medium1 Medium2 4 Cell_A 2 100 Primary.Increase
4 Medium2 Medium1 0 Cell_A 2 NA Primary.Increase
5 Medium2 Medium1 2 Cell_A 1 -50 Primary.Increase
6 Medium2 Medium1 4 Cell_A 0.5 -50 Primary.Increase
7 Medium1 Medium2 0 Cell_B 0.2 NA Primary.Increase
8 Medium1 Medium2 2 Cell_B 0.8 300 Primary.Increase
9 Medium1 Medium2 4 Cell_B 0.2 -75 Primary.Increase
10 Medium2 Medium1 0 Cell_B 0.2 NA Secondary.Increase
11 Medium2 Medium1 2 Cell_B 0.2 0 Secondary.Increase
12 Medium2 Medium1 4 Cell_B 0.4 100 Secondary.Increase
mutate
内部的转换会从组中看到所有个值,
因此any(Secondary.Increase, na.rm = TRUE)
一次接收所有元素,
如果我们只返回1个值,
它将被复制以适合组的大小。