根据2列

时间:2018-02-05 19:06:50

标签: r dplyr

我对R比较新,我在根据多列中的相似性合并行时遇到问题。 我有以下数据集

LAST_NAME   FIRST_NAME  INTERVAL    VISIT_DATE  MFQ_1   MFQ_2   MFQ_3   Handedness  ARI_1   ARI_2   ARI_4   ARI_COMPLETED_BY
Doe Jane    Interval 1  1/1/99  4   6   2   Na  Na  Na  Na  Na
Doe Jane    Interval 1  1/1/99  Na  Na  Na  Right-Handed    Na  Na  Na  Na
Doe Jane    Interval 1  1/1/99  Na  Na  Na  Na  4   2   2   Dad
Doe Jane    Interval 2  2/4/04  Na  Na  Na  Right-Handed    Na  Na  Na  Na
Doe Jane    Interval 2  2/4/04  5   6   3   Na  Na  Na  Na  Na 
Doe Jane    Interval 2  2/4/04  Na  Na  Na  Na  4   5   5   Mom
Smith   Joe Interval 1  3/1/01  5   1   7   Na  Na  Na  Na  Na
Smith   Joe Interval 1  3/1/01  Na  Na  Na  Left-Handed Na  Na  Na  Na
Smith   Joe Interval 1  3/1/01  Na  Na  Na  Na  8   8   2   Dad
Smith   Joe Interval 2  5/4/09  Na  Na  Na  Na  8   5   4   Dad
Smith   Joe Interval 2  5/4/09  7   2   8   Na  Na  Na  Na  Na
Smith   Joe Interval 2  5/4/09  Na  Na  Na  Left-Handed Na  Na  Na  Na

我想基于Name / Interval / Date合并行,使它看起来像这样:

LAST_NAME   FIRST_NAME  INTERVAL    VISIT_DATE  MFQ_1   MFQ_2   MFQ_3   Handedness  ARI_1   ARI_2   ARI_4   ARI_COMPLETED_BY
Doe Jane    Interval 1  1/1/99  4   6   2   Right-Handed    4   2   2   Dad
Doe Jane    Interval 2  2/4/04  5   6   3   Right-Handed    4   5   5   Mom
Smith   Joe Interval 1  3/1/01  5   1   7   Left-Handed 8   8   2   Dad
Smith   Joe Interval 2  5/4/09  7   2   8   Left-Handed 8   5   4   Dad

我尝试过以下代码:

CTDB %>% group_by(LAST_NAME:VISIT_DATE) %>% summarise_all(funs(na.omit(.)))

但我收到以下错误

Error in mutate_impl(.data, dots) : Evaluation error: NA/NaN argument.
In addition: Warning messages:
1: In LAST_NAME:VISIT_DATE :
  numerical expression has 3326 elements: only the first used
2: In LAST_NAME:VISIT_DATE :
  numerical expression has 3326 elements: only the first used
3: In evalq(LAST_NAME:VISIT_DATE, <environment>) :
  NAs introduced by coercion
4: In evalq(LAST_NAME:VISIT_DATE, <environment>) :
  NAs introduced by coercion

我不知道如何解决这个问题才能得到理想的结果。任何帮助将不胜感激!

3 个答案:

答案 0 :(得分:1)

您可以将vars(...)na.omit一起使用。 (请注意,na.exclude没有按照您的想法执行。NA更接近您想要的内容。如果您的值实际为i[!is.na(i)],那么您可以改为使用library(tidyverse) df %>% group_by_at(vars(LAST_NAME:VISIT_DATE)) %>% summarise_all(function(i) { i[i!="Na"] }) df <- read.table(text="LAST_NAME FIRST_NAME INTERVAL VISIT_DATE MFQ_1 MFQ_2 MFQ_3 Handedness ARI_1 ARI_2 ARI_4 ARI_COMPLETED_BY Doe Jane Interval_1 1/1/99 4 6 2 Na Na Na Na Na Doe Jane Interval_1 1/1/99 Na Na Na Right-Handed Na Na Na Na Doe Jane Interval_1 1/1/99 Na Na Na Na 4 2 2 Dad Doe Jane Interval_2 2/4/04 Na Na Na Right-Handed Na Na Na Na Doe Jane Interval_2 2/4/04 5 6 3 Na Na Na Na Na Doe Jane Interval_2 2/4/04 Na Na Na Na 4 5 5 Mom Smith Joe Interval_1 3/1/01 5 1 7 Na Na Na Na Na Smith Joe Interval_1 3/1/01 Na Na Na Left-Handed Na Na Na Na Smith Joe Interval_1 3/1/01 Na Na Na Na 8 8 2 Dad Smith Joe Interval_2 5/4/09 Na Na Na Na 8 5 4 Dad Smith Joe Interval_2 5/4/09 7 2 8 Na Na Na Na Na Smith Joe Interval_2 5/4/09 Na Na Na Left-Handed Na Na Na Na", header=TRUE, stringsAsFactors=FALSE)

div

答案 1 :(得分:0)

首先,您需要使用显式NA值替换“Na”字符串

CTDB[CTDB == "Na"] <- NA

您也无法在分组功能中使用:,因此我们将列出要分组的列。然后将na.omit()first()一起包裹,因为na.omit单独不是聚合函数,并且它不会告诉dplyr如何汇总。

CTDB %>% group_by(LAST_NAME, FIRST_NAME, INTERVAL, VISIT_DATE) %>% 
  summarize_all(funs(first(na.omit(.))))

答案 2 :(得分:0)

使用基数R:

df[df=="Na]=NA
aggregate(df,df[1:4],na.omit)[-(5:8)]
  LAST_NAME FIRST_NAME   INTERVAL VISIT_DATE MFQ_1 MFQ_2 MFQ_3   Handedness ARI_1 ARI_2 ARI_4 ARI_COMPLETED_BY
1       Doe       Jane Interval_1     1/1/99     4     6     2 Right-Handed     4     2     2              Dad
2       Doe       Jane Interval_2     2/4/04     5     6     3 Right-Handed     4     5     5              Mom
3     Smith        Joe Interval_1     3/1/01     5     1     7  Left-Handed     8     8     2              Dad
4     Smith        Joe Interval_2     5/4/09     7     2     8  Left-Handed     8     5     4              Dad