错误:数据源必须是字典(dplyr)

时间:2017-08-19 09:37:13

标签: r error-handling dplyr

我是R的新手并没有为我的问题找到解决方案。我真的希望你能帮助我。

虽然有更多列和观察,但我的数据框如下所示:

dt <- data.frame(hid = c(1, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4),
                     syear = c(2000, 2001, 2003, 2003, 2003, 2000, 2000, 2001, 2001, 2002, 2002),
                     employlvl = c("Full-time", "Part-time", "Part-time", "Unemployed", "Unemployed",
                                    "Full-time", "Full-time", "Full-time", "Unemployed", "Part-time", 
                                    "Full-time"),
                     relhead = c("Head", "Head", "Head", "Partner", "other", "Head", 
                                                  "Partner", "Head", "Partner", "Head", "Partner")) 
| hid | syear |  employlvl  |       relhead         |
|-----|-------|-------------|-----------------------|
|  1  | 2000  |  Full-time  |         Head          |
|  2  | 2001  |  Part-time  |         Head          |
|  2  | 2003  |  Part-time  |         Head          |
|  2  | 2003  |  Unemployed |        Partner        |
|  2  | 2003  |  Unemployed |         other         |
|  4  | 2000  |  Full-time  |         Head          |
|  4  | 2000  |  Full-time  |        Partner        |
|  4  | 2001  |  Full-time  |         Head          |
|  4  | 2001  |  Unemployed |        Partner        |
|  4  | 2002  |  Part-time  |         Head          |
|  4  | 2002  |  Full-time  |        Partner        |

我想创建另一个列,表明合作伙伴的就业水平,希望得到以下结果:

| hid | syear |  employlvl  |         relhead       |      Partner      |
|-----|-------|-------------|-----------------------|-------------------|
|  1  | 2000  |  Part-time  |         Head          |        NA         |
|  2  | 2001  |  Part-time  |         Head          |        NA         |
|  2  | 2003  |  Part-time  |         Head          |    Unemployed     |
|  2  | 2003  |  Unemployed |       Partner         |        NA         |
|  2  | 2003  |  Unemployed |         other         |        NA         |
|  4  | 2000  |  Full-time  |         Head          |     Full-time     |
|  4  | 2000  |  Full-time  |        Partner        |        NA         |
|  4  | 2001  |  Full-time  |         Head          |    Unemployed     |
|  4  | 2001  |  Unemployed |        Partner        |        NA         |
|  4  | 2002  |  Part-time  |         Head          |     Full-time     |
|  4  | 2002  |  Full-time  |        Partner        |        NA         |

目前我正在使用以下代码。 (再次感谢用户ycw)

library(dplyr)
library(tidyr)

dt2 <- dt %>%
  group_by(hid, syear) %>%
  filter(n() > 1) %>%
  filter(`relhead` != "Child") %>%
  spread(relhead, employlvl) %>%
  mutate(Relation = "Head") %>%
  rename(`Employment Partner` = Partner) %>%
  select(-Head)

dt3 <- dt %>%
  left_join(dt2, by = c("hid", "syear", "relhead" = "Relation"))

这个小数据集的代码非常好。但是一旦我尝试了我的整个数据,我就会得到以下结果:

Error: Data source must be a dictionary

非常感谢你的帮助。

7 个答案:

答案 0 :(得分:13)

刚刚遇到类似的问题,同样的错误信息。仔细检查了我的数据集后,我发现有两列具有相同的名称。在我重命名其中一个之后,它没有任何错误。

答案 1 :(得分:8)

当2列具有相同名称时,我得到了同样的错误,使用

修改了一个列名
  

names()&lt; - c(...)

为我做了诀窍。

答案 2 :(得分:5)

如其他答案中所述,这是由非唯一名称引起的。我能够通过修改你的例子(relhead)的第三个元素

来重现错误
dt <- data.frame(
  hid = c(1, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4),
  syear = c(2000, 2001, 2003, 2003, 2003, 2000, 2000, 2001, 2001, 2002, 2002),
  employlvl = c("Full-time", "Part-time", "Part-time", "Unemployed", "Unemployed",
     "Full-time", "Full-time", "Full-time", "Unemployed", "Part-time", 
     "Full-time"),
  relhead = c("Head", "Head", "Employment Partner", "Partner", "other", "Head", 
     "Partner", "Head", "Partner", "Head", "Partner")
) 

在这种情况下,spread会创建第一个"Employment Partner"列,而rename会创建第二个。您应该检查"Employment Partner""Relation"(可能还有hidsyear)是否在dt$relhead中(第一个会给您错误,第二个会被覆盖) mutate(Relation=...))。

最小可重复的例子:

data_frame(g = c("a1","a2","a3"), i=1) %>%
    spread(g, i) %>%
    rename(a1 = a3) %>%
    select(-a1)

答案 3 :(得分:3)

当我在rename() dplyr包的names(df2)语句中不注意地使用2个相同的新名称时,我收到了相同的错误消息。将unique(names(df2))// An array of objects containing date ranges var datesArray = [{ "from": "2/12/2016", "to": "8/12/2016", "schedule": 1 }, { "from": "11/10/2017", "to": "16/10/2017", "schedule": 2 }, { "from": "17/10/2017", "to": "22/10/2017", "schedule": 3 }]; // Today's date var d = new Date(); var dd = d.getDate(); var mm = d.getMonth() + 1; var yyyy = d.getFullYear(); var today = dd + "/" + mm + "/" + yyyy; console.log("Today: " + today); // For each calendar date, check if it is within a range. for (i = 0; i < datesArray.length; i++) { // Get each from/to ranges var From = datesArray[i].from.split("/"); var To = datesArray[i].to.split("/"); // Format them as dates : Year, Month (zero-based), Date var FromDate = new Date(From[2], From[1] - 1, From[0]); var ToDate = new Date(To[2], To[1] - 1, To[0]); var schedule = datesArray[i].schedule; // Set a flag to be used when found var found = false; // Compare date if (today >= FromDate && today <= ToDate) { found = true; console.log("Found: " + schedule); } } //At the end of the for loop, if the date wasn't found, return true. if (!found) { console.log("Not found"); } 进行比较,因为您之前可能已经拥有相同的变量名称。

答案 4 :(得分:1)

如果错误仅在您运行select(-Head)后发生,您可以通过使用基本R命令找到解决方法来实现相同的目的。

library(dplyr)
library(tidyr)

dt2 <- dt %>%
  group_by(hid, syear) %>%
  filter(n() > 1) %>%
  filter(`relhead` != "Child") %>%
  spread(relhead, employlvl) %>%
  mutate(Relation = "Head") %>%
  rename(`Employment Partner` = Partner)

以上部分与原始代码相同。之后,运行以下命令。

dt2$Head <- NULL

这是删除Head列的基本R命令,这与select(-Head)想要做的事情相同。

然后您可以运行其余代码来加入数据框。

dt3 <- dt %>%
  left_join(dt2, by = c("hid", "syear", "relhead" = "Relation"))

由于您没有提供可重现的示例,我们无法弄清楚此错误消息的真正含义,但也许这种解决方法可以帮助您现在完成任务。

答案 5 :(得分:1)

这是select(-variable)致电后rename造成的。我得到了同样的错误,当我删除“重命名”调用并执行相同的选择( - 变量)时,它有效。

不知道为什么会出现这种情况,但这是错误的触发因素。

答案 6 :(得分:0)

我知道现在这有点老了,但是对于所有感兴趣的人来说,问题(我相信)是plyr和dplyr中同名命令函数之间的行为差​​异。所以当你加载它们时,你会得到意想不到的结果。我用group_by看到这个并总结。

通常,我发现解决这个问题的最好方法是使用dplyr :: select,dplyr :: rename等等。

更好的只是不使用plyr,因为dplyr此时已经覆盖了它,但是我有一些使用plyr的遗留代码,所以我不好意思去搞乱它。