Question

我需要一种基于target_id过滤数据的方法。因为我有一组1600个target_id值没有一致的名称，而另一组包含“＆＃39; comp＆＃39;”，我认为最简单的方法是创建一个基于值的值的新列。 target_id。我有一个数百帧的数据帧看起来像这样（只是抓住随机行来显示它的要点）：

      sample_id          target_id l ength eff_length est_counts     tpm
159  SRR3884838C           CR1_Mam   2204       2005           0           0
160  SRR3884838C         CYRA11_MM    617        418           0           0
161  SRR3884838C          DERV2a_I   5989       5790          19    0.734541
162  SRR3884838C        DERV2a_LTR    335        136           7     11.5213
1094236 SRR3884878C comp78901_c0_seq3_1 1115     916       113.4     32.3604
1094237 SRR3884878C comp85230_c0_seq1_1 1201     1002      514       134.088
1094238 SRR3884878C comp56944_c0_seq1_1 2484     2285      10.5      1.20115

我需要为包含＆＃39; comp＆＃39;的sample_ids创建一个值为1的新列（＆＃34; class＆＃34;）所有其他人都为0。这可能吗？该数据具有40个样本（SRR3884838 - > SRR3884878），并且每个样本具有相同的target_id集合，一组非均匀目标名称，然后是另一组都包含comp。示例（出于格式化原因删除了tpm列）

 sample_id          target_id       length   eff_length      est_counts class
159  SRR3884838C           CR1_Mam   2204       2005           0           0        
160  SRR3884838C         CYRA11_MM    617        418           0           0
161  SRR3884838C          DERV2a_I   5989       5790          19           0
162  SRR3884838C        DERV2a_LTR    335        136           7           0
1094236 SRR3884878C comp78901_c0_seq3_1 1115     916       113.4           1
1094237 SRR3884878C comp85230_c0_seq1_1 1201     1002      514             1
1094238 SRR3884878C comp56944_c0_seq1_1 2484     2285      10.5            1

我尝试使用merge函数首先创建一个新的数据框，该数据框的类列具有一组target_id的正确值，可能不正确的期望它会创建新列，其中一个target_id中的一个列出，但当我这样做时，它删除了eff_length列并与数据格式混淆。我发现的所有示例都是用户根据其他列创建新列的值来使用数字，我不知道如何使用字符串comp。这就是我的所作所为：

total <- merge(data frameA,data frameB,by="target_id")

是df A是我的原始数据，df B看起来像上面的类列一样。

Answer 1

使用：

df$class <- as.integer(grepl('comp', df$target_id))

给出：

> df
          sample_id           target_id length eff_length est_counts class
159     SRR3884838C             CR1_Mam   2204       2005        0.0     0
160     SRR3884838C           CYRA11_MM    617        418        0.0     0
161     SRR3884838C            DERV2a_I   5989       5790       19.0     0
162     SRR3884838C          DERV2a_LTR    335        136        7.0     0
1094236 SRR3884878C comp78901_c0_seq3_1   1115        916      113.4     1
1094237 SRR3884878C comp85230_c0_seq1_1   1201       1002      514.0     1
1094238 SRR3884878C comp56944_c0_seq1_1   2484       2285       10.5     1

Answer 2

sample$class <- as.numeric(grepl ("^comp", sample$target_id))怎么样？

根据现有值

2 个答案: