通过R中的data.table中的键,查找一组变量中最常见的事件

时间:2014-11-16 16:54:19

标签: r dataframe data.table apply

我试图通过R中data.table中的键找到一组变量中最常见的事件。这是我尝试做的一个小例子:

library(data.table)

mydata <- data.table(mergedName=c("JOHNDOE","JOHNDOE","JOHNDOE","MARYWHITE","MARYWHITE","MARYWHITE","JOHNDOE","JOHNDOE","JOHNDOE","MARYWHITE","MARYWHITE","MARYWHITE"),
                     job=c("teacher","teacher","teacher","teacher","teacher","teacher","police","police","police","police","police","police"),
                     from=c("NYT","USAT","BG","NYT","USAT","BG","NYT","USAT","BG","NYT","USAT","BG"),
                     misspelled_NYT=c("John Doe", NA, NA, "Mary White", NA, NA,"John_Doe", NA, NA, "Mary*White", NA, NA),
                     misspelled_USAT=c(NA, "JohnDOE", NA, NA, "Mary White", NA, NA, "John Doe", NA, NA, "Mary White", NA),
                     misspelled_BG=c(NA, NA, "John Doe", NA, NA, "Mary-White", NA, NA, "John Doe", NA, NA, "Mary White"))

setkeyv(mydata, cols=c("mergedName","job"))

这里是data.table对象:

> mydata
    mergedName     job from misspelled_NYT misspelled_USAT misspelled_BG
 1:    JOHNDOE teacher  NYT       John Doe              NA            NA
 2:    JOHNDOE teacher USAT             NA         JohnDOE            NA
 3:    JOHNDOE teacher   BG             NA              NA      John Doe
 4:  MARYWHITE teacher  NYT     Mary White              NA            NA
 5:  MARYWHITE teacher USAT             NA      Mary White            NA
 6:  MARYWHITE teacher   BG             NA              NA    Mary-White
 7:    JOHNDOE  police  NYT       John_Doe              NA            NA
 8:    JOHNDOE  police USAT             NA        John Doe            NA
 9:    JOHNDOE  police   BG             NA              NA      John Doe
10:  MARYWHITE  police  NYT     Mary*White              NA            NA
11:  MARYWHITE  police USAT             NA      Mary White            NA
12:  MARYWHITE  police   BG             NA              NA    Mary White

这是我期望的输出(对于mergedNamejob的每个键控组合,三个来源中每个来源的最常见名称拼写:

    mergedName     job actualSpelling
 1:    JOHNDOE teacher       John Doe
 2:    JOHNDOE teacher       John Doe
 3:    JOHNDOE teacher       John Doe
 4:    JOHNDOE  police       John Doe
 5:    JOHNDOE  police       John Doe
 6:    JOHNDOE  police       John Doe
 7:  MARYWHITE teacher     Mary White
 8:  MARYWHITE teacher     Mary White
 9:  MARYWHITE teacher     Mary White
10:  MARYWHITE  police     Mary White
11:  MARYWHITE  police     Mary White
12:  MARYWHITE  police     Mary White

我已经能够以宽泛的形式使用数据框。以下是以宽泛形式执行此操作的代码的一个小示例---注意:由于某种原因,这看似仅适用于较大的数据帧,即使代码相同,它也不适用于下面的示例。跨行应用于此DF的table()输出与我期望的不同。:

mydataWide <- data.frame(mergedName=c("JOHNDOE","MARYWHITE","JOHNDOE","MARYWHITE"),
                         job=c("teacher","police","teacher","police"),
                         misspelled_NYT=c("John Doe", "Mary White", "John_Doe", "Mary*White"),
                         misspelled_USAT=c("JohnDOE", "Mary White", "John Doe", "Mary White"),
                         misspelled_BG=c("John Doe", "Mary-White", "John Doe", "Mary White"),
                         stringsAsFactors=FALSE)

nametable <- apply(mydataWide[,paste("misspelled", c("NYT","USAT","BG"), sep="_")], 1, function(x) sort(table(x), decreasing=TRUE))
mydataWide$actualSpelling <- names(sapply(nametable,`[`, 1) )

1 个答案:

答案 0 :(得分:3)

您可以先melt mydatalong表单,使用NA删除na.omit行,找到max个{使用actualSpellingmergedName {1}}(按jobwhich.max分组)。使用数字索引获取最大频率的条件。

table