我已经加载了20个带有函数的csv文件:
tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
我将所有这些筹码合并为一个:
all_data = do.call(rbind.fill, list_of_data)
在新表中有一个名为“Accession”的列。结合许多名称(Accession)后重复。我想删除所有重复项。 另一个问题是,其中一些“名称”几乎相同。区别在于有名称后成为点和数字。
让我告诉你它的外观:
AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--
<--
= 相同的样本,不同的名称。应该被视为一个。所以只需忽略点和数字。
试过这个:
all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))
Error in `$<-.data.frame`(`*tmp*`, "CleanedAccession", value = character(0)) :
答案 0 :(得分:2)
您可以使用此命令对值进行子集化和重命名:
subset(transform(alldata, Ascension = sub("\\..*", "", Ascension)),
!duplicated(Ascension))
Ascension
1 AT3G26450
2 AT5G44520
3 AT4G24770
4 AT2G37220
5 AT3G02520
6 AT5G05270
7 AT1G32060
8 AT3G52380
9 AT2G43910
10 AT2G19760
答案 1 :(得分:1)
怎么样?
df <- data.frame( Accession = c("AT3G26450.1",
"AT5G44520.2",
"AT4G24770.1",
"AT2G37220.2",
"AT3G02520.1",
"AT5G05270.1",
"AT1G32060.1",
"AT3G52380.1",
"AT2G43910.2",
"AT2G19760.1",
"AT3G26450.2"))
df[!duplicated(unlist(lapply(strsplit(as.character(df$Accession),
".", fixed = T), "[", 1))), ]