删除数据框中单元格中包含多个字符串的行

时间:2017-02-13 03:52:17

标签: r dataframe filter dplyr subset

我有一个包含8列和许多行的数据框。我想在第6列和第7列中删除包含多个字符串的行,并在第6列和第7列中输出仅包含一个字符串的数据框

DF:

ID  Content_ID  Chromosome  Start   Stop    Reference   Alternate Length
1299675221  backbone    12  99675221    99675221    GG  T   0
1298583685  backbone    12  98583685    98583685    C   T   0
129833474   backbone    12  9833474     9833474     C   T   0
1297722695  backbone    12  97722695    97722695    A   G   0
1297381269  backbone    12  97381269    97381269    T   C   0
1297081605  backbone    12  97081605    97081605    G   AA  0
1297058068  backbone    12  97058068    97058068    T   C   0
1295891848  backbone    12  95891848    95891848    CCTT ATA    0
1294164312  backbone    12  94164312    94164312    T   C   0
12940191    backbone    12  940191      940191      T   C   0

期望的输出:

ID  Content_ID  Chromosome  Start   Stop    Reference   Alternate   Length
1298583685  backbone    12  98583685    98583685    C   T   0
129833474   backbone    12  9833474     9833474     C   T   0
1297722695  backbone    12  97722695    97722695    A   G   0
1297381269  backbone    12  97381269    97381269    T   C   0
1297058068  backbone    12  97058068    97058068    T   C   0
1294164312  backbone    12  94164312    94164312    T   C   0
12940191    backbone    12  940191      940191      T   C   0

3 个答案:

答案 0 :(得分:3)

我们可以使用lapply遍历第6列和第7列,检查字符数是否为1,使用Reduce&来获取逻辑vector比较list的相应元素,使用它来对'df'

行进行子集化
df[Reduce(`&`, lapply(df[6:7], function(x) nchar(x)==1)),]
#        ID Content_ID Chromosome    Start     Stop Reference Alternate Length
#2  1298583685   backbone         12 98583685 98583685         C         T      0
#3   129833474   backbone         12  9833474  9833474         C         T      0
#4  1297722695   backbone         12 97722695 97722695         A         G      0
#5  1297381269   backbone         12 97381269 97381269         T         C      0
#7  1297058068   backbone         12 97058068 97058068         T         C      0
#9  1294164312   backbone         12 94164312 94164312         T         C      0
#10   12940191   backbone         12   940191   940191         T         C      0

或另一个选项是rowSums

df[!rowSums(nchar(as.matrix(df[6:7]))!=1),]

答案 1 :(得分:2)

同样,您可以将列粘贴在一起,然后保留字符数等于3的行,每列一个空格和一个空格。

df[nchar(paste(df$Reference, df$Alternate)) == 3,]
           ID Content_ID Chromosome    Start     Stop Reference Alternate Length
2  1298583685   backbone         12 98583685 98583685         C         T      0
3   129833474   backbone         12  9833474  9833474         C         T      0
4  1297722695   backbone         12 97722695 97722695         A         G      0
5  1297381269   backbone         12 97381269 97381269         T         C      0
7  1297058068   backbone         12 97058068 97058068         T         C      0
9  1294164312   backbone         12 94164312 94164312         T         C      0
10   12940191   backbone         12   940191   940191         T         C      0

答案 2 :(得分:1)

使用data.table

这么简单
library(data.table)

setDT(df)
df <- df[ nchar(Reference)==1 & nchar(Alternate)==1]