Question

我正在尝试迁移数据库，并希望使用R来协助完成该过程。作为迁移过程的一部分，我需要更新已更改的“项目ID”。我创建了一个将旧ID映射到新ID的函数：

old_to_new <- function(id, df) {
  return (df[which(df$Old == id), ]$New)
}

但是，每当我尝试应用它以在数据框中添加新列（从数据库表加载）时，

library(tidyverse)
library(RODBC)

cn <- odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=xxx;database=xxx;uid=xxx;pwd=xxx;")
df <- sqlQuery(cn, "SELECT * FROM [MaintDB_New].[dbo].[Priority]")
ticket_df <- sqlQuery(cn, "SELECT * FROM [MaintDB_New].[dbo].[Tickets]")
ticket_details_df <- sqlQuery(cn, "SELECT * FROM [MaintDB_New].[dbo].[Ticket_Details]")
new_items <- read_csv("./ticket_itm_export_temp.csv", col_names = c("Old", "Name", "New"))

ticket_df_new <- ticket_df %>% mutate(item_id = old_to_new(itemID, new_items))

我收到以下错误：

Error in `[[<-.data.frame`(`*tmp*`, col, value = c(NA_integer_, NA_integer_,  : 
  replacement has 280 rows, data has 69430
In addition: Warning message:
In df$Old == id :
  longer object length is not a multiple of shorter object length

我在做什么错，正确的方法是什么。尝试使用ddplyr时收到类似的错误。

我是R的新手，如果这是一个明显的问题，我深表歉意。

编辑-添加了数据结构：

    head(ticket_df)
  ticketID propertyID itemID roomNumber assignedToID isOpen openID latestID
1       11         10      1       <NA>           NA      0     22       23
2       12         17      1       <NA>           NA      0     24      289
3       13         17      1       <NA>           NA      0     25      292
4       14         17     17       <NA>           NA      0     26     4411
5       15         17     68       <NA>           NA      0     27      296
6       16         17     74       <NA>           NA      0     28      294

head(new_items)
           Old Name                    New
      <int> <chr>                 <int>
    1   257 Register Cash Drawers   425
    2   253 Alarm System            426
    3   135 CREDENZA/ ARMOIRE       427
    4    55 Back Office PC          428
    5   183 Backup All Data         429
    6   260 Base Boards             430

链接到dput输出：ticket_df和new_items

Answer 1

我（真的！）认为Gregor对left_join ing的评论很有道理。我将通过更改一些值来强制进行一些匹配：

new_items$Old[1:2] <- c(17L,74L)

现在加入：

library(dplyr)

ticket_df %>%
  left_join(select(new_items, Old, New), by=c("itemID" = "Old"))
#   ticketID propertyID itemID roomNumber assignedToID isOpen openID latestID New
# 1       11         10      1         NA           NA      0     22       23  NA
# 2       12         17      1         NA           NA      0     24      289  NA
# 3       13         17      1         NA           NA      0     25      292  NA
# 4       14         17     17         NA           NA      0     26     4411 425
# 5       15         17     68         NA           NA      0     27      296  NA
# 6       16         17     74         NA           NA      0     28      294 426

如果您满意此方法，请重新分配：

ticket_df %>%
  left_join(select(new_items, Old, New), by=c("itemID" = "Old")) %>%
  mutate(itemID = if_else(is.na(New), itemID, New)) %>%
  select(-New)
#   ticketID propertyID itemID roomNumber assignedToID isOpen openID latestID
# 1       11         10      1         NA           NA      0     22       23
# 2       12         17      1         NA           NA      0     24      289
# 3       13         17      1         NA           NA      0     25      292
# 4       14         17    425         NA           NA      0     26     4411
# 5       15         17     68         NA           NA      0     27      296
# 6       16         17    426         NA           NA      0     28      294

或者，您可以使用mutate(itemID = coalesce(New, itemID))，谢谢@Gregor。

但是，如果您需要使用一个函数（也许您的问题更复杂，或者您需要更通用的东西），那么请注意：

通常，mutate中使用的函数需要返回长度为1或与其长度相同的向量；这意味着子集（如您对df[which(df$Old == id), ]$New所做的设置）通常不起作用。（如果您可以<保证>始终返回长度1，则不会出错，但是我猜这是不安全的。）同样，summarize要求（我相信）函数返回长度1。

这是一个有点草率但得到相同结果的想法：

myfunc <- function(id, changes) {
  ind <- match(id, changes[["Old"]])
  indnonna <- !is.na(ind)
  id[which(indnonna)] <- changes[["New"]][ind[indnonna]]
  id
}

ticket_df %>%
  mutate(newid = myfunc(itemID, new_items))
#   ticketID propertyID itemID roomNumber assignedToID isOpen openID latestID newid
# 1       11         10      1         NA           NA      0     22       23     1
# 2       12         17      1         NA           NA      0     24      289     1
# 3       13         17      1         NA           NA      0     25      292     1
# 4       14         17     17         NA           NA      0     26     4411   425
# 5       15         17     68         NA           NA      0     27      296    68
# 6       16         17     74         NA           NA      0     28      294   426

您显然可以直接直接分配给itemID而不是其他列。我仍然不鼓励这样做，因为（1）连接效率更高；（2）我想更多地使用该功能，以便找到更可靠的方法；（3）将new_items的结构（即特定的列名）硬编码到函数中，而进行联接使您可以在联接时指定发生了什么，将代码紧挨着该结构-使用元素。

dplyr mutate-如何正确使用mutate自定义函数？

1 个答案: