我的数据框结构如下:
df <- structure(list(NAME1 = c("AAA","CCC","BBB","BBB"),
NAME2 = c("BBB", "AAA","DDD","AAA"),
AMT = c(10, 20, 30, 50)), .Names=c("NAME1","NAME2", "AMT"),
row.names = c("1", "2", "3", "4"), class =("data.frame"))
我想基于两个字符串列创建两个ID变量(ID1和ID2),即NAME1和NAME2。这两列可以共享值,因此ID必须一致。所需的数据框应如下所示:
df <- structure(list(NAME1 = c("AAA","CCC", "BBB", "BBB"),
NAME2 = c("BBB", "AAA","DDD", "AAA"),
ID1 = c(1,3,2,2),
ID2 = c(2,1,4,1),
AMT = c(10,20,30,50)),
.Names = c("NAME1","NAME2","ID1","ID2"),
row.names = c("1", "2", "3", "4"), class =("data.frame"))
您的建议将不胜感激。
干杯。
答案 0 :(得分:1)
您应该创建两者的向量并将其转换为因子,然后转换为数字。然后你可以用df中的行数正确地对它进行子集化并将它们放回去:
newIDs <- as.numeric(as.factor(c(df$NAME1, df$NAME2)))
df$ID1 <- newIDs[1:nrow(df)]
df$ID2 <- newIDs[-c(1:nrow(df))]
答案 1 :(得分:0)
以下答案可行,可以修改以允许在超过数据帧的范围内分配ID(例如df1,df2);这假设下面的变量afact
是使用所有所需的因子级别创建的。这也使用dplyr
来创建新列。
library(dplyr)
adf <- structure(list(NAME1 = c("AAA","CCC","BBB","BBB"),
NAME2 = c("BBB", "AAA","DDD","AAA"),
AMT = c(10, 20, 30, 50)), .Names=c("NAME1","NAME2", "AMT"),
row.names = c("1", "2", "3", "4"), class =("data.frame"))
## Create factor based on all unique values.
## include all variables (e.g. NAME1) needed in factor.
afact <- as.factor(unique(sort(c(adf$NAME1, adf$NAME2))))
## Factor level to numeric value.
num.lookup <- function(x) { as.numeric(afact[afact == x])}
# Create the new ID columns using the factor 'afact' and 'num.lookup'
# to assign numeric values consistant across columns.
adf %>%
mutate(ID1 = sapply(NAME1, num.lookup),
ID2 = sapply(NAME2, num.lookup))
# NAME1 NAME2 AMT ID1 ID2
# 1 AAA BBB 10 1 2
# 2 CCC AAA 20 3 1
# 3 BBB DDD 30 2 4
# 4 BBB AAA 50 2 1