我有来自多个团队的团队成员如何评价的数据。每个人都有自己的身份证号码,但团队中也有团队和评分人员,如下所示:
StudyID TeamID CATMERater Rated Rating
(int) (int) (int) (dbl) (dbl)
1 2930 551 1 1 5.000000 #How rater 1 rated 1 (themselves)
2 2938 551 2 1 3.800000 #How rater 2 rated 1
3 2939 551 3 1 5.000000 #How rater 3 rated 1
4 2930 551 1 2 3.666667 #How rater 1 rated 2
5 2938 551 2 2 4.000000 #...
6 2939 551 3 2 3.866667
...
等等。我使用tidyr
获取此格式,并且我正在尝试获取StudyID的新列,其中TeamID和被评级的人员是相同的。这是我尝试过的,但是不起作用,因为我不确定如何引用同一个表:
edges %>% mutate(RatedStudyID = filter(edges, TeamID == TeamID & Rated == CATMERater))
希望这是有道理的,但我很欣赏建议朝着正确的方向前进。如果是left_join
,我该怎么说TeamID == TeamID
?
这是我最后想要看到的内容(尽管大多数是最后一栏):
StudyID TeamID CATMERater Rated Rating RatedStudyID
(int) (int) (int) (dbl) (dbl)
1 2930 551 1 1 5.000000 2930
2 2938 551 2 1 3.800000 2930
3 2939 551 3 1 5.000000 2930
4 2930 551 1 2 3.666667 2938
5 2938 551 2 2 4.000000 2938
6 2939 551 3 2 3.866667 2938
...
每个@akron的输出结果会产生错误:
structure(list(StudyID = c(2930L, 2938L, 2939L, 2930L, 2938L,
2939L, 2930L, 2938L, 2939L, 2930L, 2938L, 2939L, 2930L, 2938L,
2939L, 2930L, 2938L, 2939L, 2920L, 2941L, 2989L, 2920L, 2941L,
2989L, 2920L, 2941L, 2989L, 2920L, 2941L, 2989L, 2920L, 2941L,
2989L, 2920L, 2941L, 2989L, 2922L, 2924L, 2943L, 2922L, 2924L,
2943L, 2922L, 2924L, 2943L, 2922L, 2924L, 2943L, 2922L, 2924L
), TeamID = c(551L, 551L, 551L, 551L, 551L, 551L, 551L, 551L,
551L, 551L, 551L, 551L, 551L, 551L, 551L, 551L, 551L, 551L, 552L,
552L, 552L, 552L, 552L, 552L, 552L, 552L, 552L, 552L, 552L, 552L,
552L, 552L, 552L, 552L, 552L, 552L, 553L, 553L, 553L, 553L, 553L,
553L, 553L, 553L, 553L, 553L, 553L, 553L, 553L, 553L), CATMERater = c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L,
2L, 1L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L), Rated = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6,
6, 6, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 1,
1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5), Rating = c(5, 3.8, 5,
3.66666666666667, 4, 3.86666666666667, 4.53333333333333, 4, 4.8,
NaN, NaN, NaN, NaN, NaN, NaN, NA, NA, NA, 3.93333333333333, 5,
5, 5, 5, 5, 5, 5, 5, NaN, NaN, NaN, NaN, NaN, NaN, NA, NA, NA,
4, 4, 4, 4, 4, 4, 4, 3.86666666666667, 4, NaN, NaN, NaN, NaN,
NaN)), .Names = c("StudyID", "TeamID", "CATMERater", "Rated",
"Rating"), class = c("tbl_df", "data.frame"), row.names = c(NA,
-50L))
答案 0 :(得分:2)
来自评论:
library(dplyr)
x %>%
group_by(Rated, TeamID) %>% #group by each team/rated individual
filter(any(CATMERater == Rated)) %>% #filter out any groups with unrated individuals
mutate(new = StudyID[CATMERater == Rated]) #make the new column
通过对每个组进行子集化来创建新列 - 它与整个数据帧上的x$StudyID[x$CATMERater == x$Rated]
相同。只要我们有一个地方,这是真的(即自我评级),该值将设置为该组的每个成员。
答案 1 :(得分:1)
使用map() transformations are lazy so they don’t like to work until asked for.
I was expecting messages in a map() transformation.
put this expectation in rdd.take() which in non-lazy inside foreachRDD()
It worked.
data.table
在新数据集中,有些组在同一行中没有CATMERater和Rated的任何类似值。因此,我们可以使用异常来返回NA。
library(data.table)
setDT(edges)[ , RatedStudyID := StudyID[CATMERater == Rated] , .(Rated, TeamID)]
edges
# StudyID TeamID CATMERater Rated Rating RatedStudyID
#1: 2930 551 1 1 5.000000 2930
#2: 2938 551 2 1 3.800000 2930
#3: 2939 551 3 1 5.000000 2930
#4: 2930 551 1 2 3.666667 2938
#5: 2938 551 2 2 4.000000 2938
#6: 2939 551 3 2 3.866667 2938
答案 2 :(得分:0)
我认为您可以通过加入来解决此问题
edges %>%
select(TeamID, Rated = CATMERater, RaterStudyID = StudyID) %>%
inner_join(edges, by = c("TeamID", "Rated"))