这似乎是一个简单的问题,但却让我头痛不已(这不是家庭作业,而是实际研究中的一个难点)
我有一个包含2266个级别的列表。该列表看起来有点像这样:
[1] ~/folder1/folder1/a.bin
[2] ~/folder1/folder1/b.bin
[3] ~/folder1/folder1/c.bin
[4] ~/folder1/folder2/a.bin
[5] ~/folder1/folder2/b.bin
[6] ~/folder1/folder2/c.bin
解释:列表是我使用readBin
函数读取的二进制文件的文件名。我想比较每一行和每一行,所以我想要的是两列,其中包含从我的单列派生的所有独特组合。
(choose 2266,2)
告诉我,我们的单列有2566245种组合为两种。
`expand.grid()似乎让我走了一半。但是我的组合数量是我需要的四倍:我得到两行5132490.这意味着有重复:1 + 2和2 + 1对我来说是一样的。
带有expand.grid.df
的 unique=TRUE
似乎也无济于事。
我的最后一个想法是md5散列了500万行中的每一行并试图以这种方式检测重复。
我正在寻找制作两个列表的方法,这些列表涵盖了我的列表的2566245组合。或者删除所有重复项的某种方法。 我想我并不是绝对坚持使用R并调查awk或sed做同样的事情。但是没有成功。
答案 0 :(得分:2)
我认为您正在使用@Arun数据寻找combn
expand.grid
,
v <- c("~/folder1/folder1/a.bin",
"~/folder1/folder1/b.bin",
"~/folder1/folder1/c.bin",
"~/folder1/folder2/a.bin",
"~/folder1/folder2/b.bin",
"~/folder1/folder2/c.bin")
do.call(rbind,combn(v,2,simplify=F))
[,1] [,2]
[1,] "~/folder1/folder1/a.bin" "~/folder1/folder1/b.bin"
[2,] "~/folder1/folder1/a.bin" "~/folder1/folder1/c.bin"
[3,] "~/folder1/folder1/a.bin" "~/folder1/folder2/a.bin"
[4,] "~/folder1/folder1/a.bin" "~/folder1/folder2/b.bin"
[5,] "~/folder1/folder1/a.bin" "~/folder1/folder2/c.bin"
[6,] "~/folder1/folder1/b.bin" "~/folder1/folder1/c.bin"
[7,] "~/folder1/folder1/b.bin" "~/folder1/folder2/a.bin"
[8,] "~/folder1/folder1/b.bin" "~/folder1/folder2/b.bin"
[9,] "~/folder1/folder1/b.bin" "~/folder1/folder2/c.bin"
[10,] "~/folder1/folder1/c.bin" "~/folder1/folder2/a.bin"
[11,] "~/folder1/folder1/c.bin" "~/folder1/folder2/b.bin"
[12,] "~/folder1/folder1/c.bin" "~/folder1/folder2/c.bin"
[13,] "~/folder1/folder2/a.bin" "~/folder1/folder2/b.bin"
[14,] "~/folder1/folder2/a.bin" "~/folder1/folder2/c.bin"
[15,] "~/folder1/folder2/b.bin" "~/folder1/folder2/c.bin"
修改强>
我认为路径格式过于复杂化了问题。如果我们使用例如字母代替文件名,我们得到:
do.call(rbind,combn(letters[1:4],2,simplify=F))
[,1] [,2]
[1,] "a" "b"
[2,] "a" "c"
[3,] "a" "d"
[4,] "b" "c"
[5,] "b" "d"
[6,] "c" "d"
所以如你所见,没有重复。