我一直在尝试几种不同的方法(聚合,colSum,计数等),但是却很挣扎。
我有一个很大的样本结果数据集(100列,1,000,000行),在这里我已将各个样本名称替换为它们所属类别的名称。 (我需要按类别统计每个数据点的总数。
Sequence position cat1 cat1 cat2 cat2 cat2 cat3 cat3
abfsgdfy a 0 1 0 1 0 1 1
abfsgdfy b 0 0 1 1 1 1 0
dgdtecgd b 1 1 1 0 0 0 0
我知道不希望具有相同的列名,因此我一直在尝试转置数据,然后使用它。但这也没有使我走得太远。
我正在寻找的输出将类似于:
Sequence position cat1 cat2 cat3
abfsgdfy a 1 1 2
abfsgdfy b 0 3 1
dgdtecgd b 2 1 0
如果有帮助,我有一张表格,可将样品名称转换为较大的类别组:
Type Name
cat1 sample1
cat1 sample2
cat2 sample3
cat2 sample4
cat2 sample5
cat3 sample6
cat3 sample7
感谢您的帮助!
答案 0 :(得分:1)
这个怎么样?
library("tidyverse")
df = tibble(c("abfsgdfy", "abfsgdfy", "dgdtecgd"),
c("a", "b", "b"),
c(0, 0, 1),
c(1, 0, 1),
c(0, 1, 1),
c(1, 1, 0),
c(0, 1, 0))
colnames(df) = c("Sequence", "position", "cat1", "cat1", "cat2", "cat2", "cat2")
# rename column names to make them unique:
newcols = c("Sequence", "position", paste0("c", 1:(ncol(df)-2)))
oldcols = colnames(df)
colnames(df) = newcols
# make a crosswalk
col_cross = tibble(oldcols = oldcols,
newcols = newcols)
# gather the new columns, summarize across the old columns, spread
df %>%
gather(key = "newcols", value = "val", -Sequence, -position) %>%
left_join(col_cross) %>%
group_by(Sequence, position, oldcols) %>%
summarize(n = sum(val)) %>%
spread(key = oldcols, value = n, fill = 0)
答案 1 :(得分:0)
另一种tidyverse
可能性:
repair_names(df, prefix = "cat", sep = "_") %>%
rowid_to_column() %>%
gather(var, val, -c(Sequence, position, rowid)) %>%
mutate(temp = sub("\\_.*", "", var)) %>%
arrange(rowid) %>%
group_by(Sequence, position, temp) %>%
summarise(res = sum(val),
rowid = first(rowid)) %>%
spread(temp, res) %>%
select(-rowid)
Sequence position cat1 cat2
<chr> <chr> <dbl> <dbl>
1 abfsgdfy a 1. 1.
2 abfsgdfy b 0. 3.
3 dgdtecgd b 2. 1.
首先,它将列名更改为具有"cat"
前缀和"_"
作为分隔符。其次,它将数据从宽格式转换为长格式。第三,使用sub("\\_.*", "", var)
将字符串放在"_"
之前。然后,将这些值相加。最后,它将数据恢复为宽格式。
使用@svenhalvorson提供的示例数据:
df <- tibble(c("abfsgdfy", "abfsgdfy", "dgdtecgd"),
c("a", "b", "b"),
c(0, 0, 1),
c(1, 0, 1),
c(0, 1, 1),
c(1, 1, 0),
c(0, 1, 0))
colnames(df) <- c("Sequence", "position", "cat1", "cat1", "cat2", "cat2", "cat2")