我有以下数据框(79000行):
ID P1 P2 P3 P4 P5 P6 P7 P8
1 38005 28002 38005 38005 28002 34002 NA NA
2 28002 28002 28002 38005 28002 NA NA NA
我想计算每个数字(代码)出现在数据帧行中的次数。所以输出是这样的:
38005 appears 3 28002 appears 2 34002 appears 1 NA appears 2
28002 appears 3 38005 appears 1 28002 appears 1 NA appears 3
到目前为止,我试图找到最常用的号码(代码):
df$frequency <-apply(df,1,function(x) names(which.max(table(x))))
但是我不知道如何计算每个数字(代码)连续出现的次数。
答案 0 :(得分:1)
使用tidyverse
和reshape2
,您可以执行以下操作:
df %>%
gather(var, val, -ID) %>% #Transforming the data from wide to long format
group_by(val, ID) %>% #Grouping
summarise(count = n()) %>% #Performing the count
dcast(ID~val, value.var = "count") #Reshaping the data
ID 28002 34002 38005 NA
1 1 2 1 3 2
2 2 4 NA 1 3
根据ID显示计数最高的前两个非NA列:
df %>%
gather(var, val, -ID) %>% #Transforming the data from wide to long format
group_by(val, ID) %>% #Grouping
mutate(temp = n()) %>% #Performing the count
group_by(ID) %>% #Grouping
mutate(temp2 = dense_rank(temp)) %>% #Creating the rank based on count
group_by(ID, val) %>% #Grouping
summarise(temp3 = first(temp2), #Summarising
temp = first(temp)) %>%
arrange(ID, desc(temp3)) %>% #Arranging
na.omit() %>% #Deleting the rows with NA
group_by(ID) %>%
mutate(temp4 = ifelse(temp3 == first(temp3) | temp3 == nth(temp3, 2), 1, 0)) %>% #Identifying the highest and the second highest count
filter(temp4 == 1) %>% #Selecting the highest and the second highest count
dcast(ID~val, value.var = "temp") #Reshaping the data
ID 28002 38005
1 1 2 3
2 2 4 1
答案 1 :(得分:0)
ID <- c("P1","P2","P3","P4","P5","P6","P7","P8","P1","P2","P3","P4","P5","P6","P7","P8","P1")
count <-c("38005","28002","38005","38005","28002","34002","NA","NA","2","28002","28002","28002","38005","28002","NA","NA","NA")
df<- cbind.data.frame(ID,count)
table(df$count)
使用此代码找出计数
答案 2 :(得分:0)
我认为您正在寻找这个。
sort(table(unlist(df1[-1])), decreasing=TRUE)
# 31002 38005 24003 34002 28002
# 13222 13193 13019 13018 12625
这是您要排除包含ID的第1列,并将其余数据框“取消列出”到向量中。然后table()
会计数每个值的外观,您也可以sort()
。设置选项decreasing=TRUE
,前两个值是两个最常使用的值。
如果由于许多值而导致输出变长,则可以将代码包含在head(.)
中。输出的默认长度为6,但是您可以通过指定n=2
来将其限制为2,这将为您提供所需的确切信息。不需要任何软件包。
head(sort(table(unlist(df1[-1])), decreasing=TRUE), n=2)
# 31002 38005
# 13222 13193
数据:
set.seed(42) # for sake of reproducibility
df1 <- data.frame(id=1:9750,
matrix(sample(c(38005, 28002, 34002, NA, 24003, 31002), 7.8e4,
replace=TRUE), nrow=9750,
dimnames=list(NULL, paste0("P", 1:8))))
答案 3 :(得分:0)
data.table解决方案
#read sample data
dt <- fread( "ID P1 P2 P3 P4 P5 P6 P7 P8
1 38005 28002 38005 38005 28002 34002 NA NA
2 28002 28002 28002 38005 28002 NA NA NA")
#melt
dt.melt <- melt(dt, id = 1, measure = patterns("^P"), na.rm = FALSE)
#and cast
dcast( dt.melt, ID ~ value, fun = length, fill = 0 )
# ID 28002 34002 38005 NA
# 1: 1 2 1 3 2
# 2: 2 4 0 1 3