我有一个数据框( hits_map ),其中包含针对每个基因(列)中结合位点的基因(行)列表。这些值表明每个基因内有多少个位点,NA为0。
这是一个小子集,因为实际数据框要大得多:
AscG Dan.4 IclR.3 MraZ.1
afaE NA 1 NA 1
afaF NA NA NA NA
agn43.1 1 NA 1 NA
agn43.2 1 NA NA NA
agn43.3 1 NA NA NA
chuA NA NA NA 1
csgA 1 NA NA 1
csgB NA NA NA NA
csgC NA NA NA NA`
对于每一列,我想获得一个具有值的绑定站点/列名列表,然后我可以使用它来从相应的数据框 nameseq 中提取行,以获得更多信息。
目前我使用以下命令逐行执行此操作,使用函数 remove_zero_cols 删除值0,但我希望能够通过输入数据为每一行执行此操作。帧。
vec <- hits_map[row,]
vec <- remove_zero_cols(vec)
vec <- colnames(vec)
nameseq[nameseq$Name %in% vec,]
关于如何解决这个问题的任何建议?
答案 0 :(得分:1)
一种方法是将数据帧逐行转换为单个向量,并根据您要查找的值创建一个逻辑向量,确保将FALSE
转换为{{1} }。然后创建一个重复列名称的向量,其长度与逻辑向量,子集相同,并重新转换为矩阵:
NA
现在,您可以在原始数据框中使用相同的行来获取每行的相应列名。
我希望有人可以发布> set.seed(1)
> DF = data.frame(first = sample(c(NA,1), 5, T), second = sample(c(NA,1), 5, T),
+ third = sample(c(NA,1), 5, T), fourth = sample(c(NA,1), 5, T),
+ fifth = sample(c(NA,1), 5, T))
> DF
first second third fourth fifth
1 NA 1 NA NA 1
2 NA 1 NA 1 NA
3 1 1 1 1 1
4 1 1 NA NA NA
5 NA NA 1 1 NA
> DFvector = as.vector(t(DF))
> DFvector
[1] NA 1 NA NA 1 NA 1 NA 1 NA 1 1 1 1 1 1 1 NA NA NA NA NA 1 1 NA
# Create a repeated vector of column names
> columnNames = rep(colnames(DF), times = nrow(DF))
> myNames = columnNames[as.logical(DFvector)]
> myNames[is.na(myNames)] = ""
> myNames
[1] "" "second" "" "" "fifth" "" "second" "" "fourth" "" "first"
[12] "second" "third" "fourth" "fifth" "first" "second" "" "" "" "" ""
[23] "third" "fourth" ""
# Convert to matrix, by row
myMatrix = matrix(myNames, ncol = ncol(DF), byrow = T)
# Can group per row, by using assertr package
> library(assertr)
> library(stringr)
> concat = assertr::col_concat(myMatrix[], sep = " ")
> concat
[1] " second fifth" " second fourth " "first second third fourth fifth"
[4] "first second " " third fourth "
> noWS = trimws(concat)
> noWS
[1] "second fifth" "second fourth" "first second third fourth fifth"
[4] "first second" "third fourth"
> noS = gsub(pattern = "\\s+", replacement = " ", x = noWS)
> noS
[1] "second fifth" "second fourth" "first second third fourth fifth"
[4] "first second" "third fourth"
> stringr::str_split(noS, " ", simplify = T)
[,1] [,2] [,3] [,4] [,5]
[1,] "second" "fifth" "" "" ""
[2,] "second" "fourth" "" "" ""
[3,] "first" "second" "third" "fourth" "fifth"
[4,] "first" "second" "" "" ""
[5,] "third" "fourth" "" "" ""
/ data.table
替代方案,因为如果要避免使用dplyr
,这将非常繁琐。