我有两个看起来像这样的字符串数据框
df1
firm_id user_id_source1
"1234" "abc"
"1234" "bcd"
"1234" "cde"
"1234" "def"
"1234" "fgh"
"2345" "zyx"
"2345" "yxw"
"2345" "xwv"
"2345" "wvu"
"2345" "vut"
还有
df2
firm_id user_id_source2
"1234" "abc"
"1234" "bcd"
"1234" "ghi"
"1234" "hij"
"2345" "zyx"
"2345" "yxw"
"2345" "utr"
有很多公司,每个公司都有很多用户ID。我希望结合用户ID的匹配位置,然后结合其余数据,而不会丢失任何数据,因为用户ID的来源很重要。像这样:
result
firm_id user_id_source1 user_id_source2
"1234" "abc" "abc"
"1234" "bcd" "bcd"
"1234" "cde" NA
"1234" "def" NA
"1234" "fgh" NA
"1234" NA "ghi"
"1234" NA "hij"
"2345" "zyx" "zyx"
"2345" "yxw" "yxw"
"2345" "xwv" NA
"2345" "wvu" NA
"2345" "vut" NA
"2345" NA "utr"
我尝试了每种类型的联接和合并,但无法弄清楚。很感谢任何形式的帮助。 谢谢。
编辑:下面的r-schifini解决了这个问题。谢谢。 我添加了
library(dplyr)
df <- df %>% arrange(firm_id, is.na(user_id_source1), is.na(user_id_source2))
以所需的顺序排列。
EDIT2:我已经在使用dplyr,所以这对我来说不是问题,但是我发现路上有人可能希望将其保留在R底下。这是如何对R底进行排序。
df_ordered <- df[order(df$firm_id, is.na(df$user_id_source1), is.na(df$user_id_source2)),]
答案 0 :(得分:0)
您需要将数据帧分为两列:firm_id
和user_id
(1和2)。由于user_id列具有不同的名称,您想保留它们,然后这样做:
df1 = data.frame(firm_id = c('1234','1234','5678','5678'),user_id_1 = c('abc','bcd','abc','xyz'))
df2 = data.frame(firm_id = c('1234','3456','5678'),user_id_2 = c('abc','bcd','xyz'))
# Create a copy of the user_id columns with a common name: 'user'
df1$user = df1$user_id_1
df2$user = df2$user_id_2
# do a full merge (they will be merged by 'firm_id' and 'user')
df = merge(df1, df2, all = T)
# remove the 'user' column
df$user = NULL
结果:
> df1
firm_id user_id_1 user
1 1234 abc abc
2 1234 bcd bcd
3 5678 abc abc
4 5678 xyz xyz
> df2
firm_id user_id_2 user
1 1234 abc abc
2 3456 bcd bcd
3 5678 xyz xyz
> df
firm_id user_id_1 user_id_2
1 1234 abc abc
2 1234 bcd <NA>
3 5678 abc <NA>
4 5678 xyz xyz
5 3456 <NA> bcd