从数据中删除重复项

时间:2014-02-06 16:58:35

标签: regex r duplicates

我已经加载了20个带有函数的csv文件:

tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)

我将所有这些筹码合并为一个:

all_data = do.call(rbind.fill, list_of_data)

在新表中有一个名为“Accession”的列。结合许多名称(Accession)后重复。我想删除所有重复项。 另一个问题是,其中一些“名称”几乎相同。区别在于有名称后成为点和数字。

让我告诉你它的外观:

AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--

<-- = 相同的样本,不同的名称。应该被视为一个。所以只需忽略点和数字。

试过这个:

all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))

Error in `$<-.data.frame`(`*tmp*`, "CleanedAccession", value = character(0)) : 

2 个答案:

答案 0 :(得分:2)

您可以使用此命令对值进行子集化和重命名:

subset(transform(alldata, Ascension = sub("\\..*", "", Ascension)), 
       !duplicated(Ascension))

   Ascension
1  AT3G26450
2  AT5G44520
3  AT4G24770
4  AT2G37220
5  AT3G02520
6  AT5G05270
7  AT1G32060
8  AT3G52380
9  AT2G43910
10 AT2G19760

答案 1 :(得分:1)

怎么样?
df  <- data.frame( Accession = c("AT3G26450.1",
                   "AT5G44520.2",
                   "AT4G24770.1",
                   "AT2G37220.2",
                   "AT3G02520.1",
                   "AT5G05270.1",
                   "AT1G32060.1",
                   "AT3G52380.1",
                   "AT2G43910.2",
                   "AT2G19760.1",
                   "AT3G26450.2"))

df[!duplicated(unlist(lapply(strsplit(as.character(df$Accession), 
   ".", fixed = T),  "[", 1))), ]