我试图通过R中data.table
中的键找到一组变量中最常见的事件。这是我尝试做的一个小例子:
library(data.table)
mydata <- data.table(mergedName=c("JOHNDOE","JOHNDOE","JOHNDOE","MARYWHITE","MARYWHITE","MARYWHITE","JOHNDOE","JOHNDOE","JOHNDOE","MARYWHITE","MARYWHITE","MARYWHITE"),
job=c("teacher","teacher","teacher","teacher","teacher","teacher","police","police","police","police","police","police"),
from=c("NYT","USAT","BG","NYT","USAT","BG","NYT","USAT","BG","NYT","USAT","BG"),
misspelled_NYT=c("John Doe", NA, NA, "Mary White", NA, NA,"John_Doe", NA, NA, "Mary*White", NA, NA),
misspelled_USAT=c(NA, "JohnDOE", NA, NA, "Mary White", NA, NA, "John Doe", NA, NA, "Mary White", NA),
misspelled_BG=c(NA, NA, "John Doe", NA, NA, "Mary-White", NA, NA, "John Doe", NA, NA, "Mary White"))
setkeyv(mydata, cols=c("mergedName","job"))
这里是data.table
对象:
> mydata
mergedName job from misspelled_NYT misspelled_USAT misspelled_BG
1: JOHNDOE teacher NYT John Doe NA NA
2: JOHNDOE teacher USAT NA JohnDOE NA
3: JOHNDOE teacher BG NA NA John Doe
4: MARYWHITE teacher NYT Mary White NA NA
5: MARYWHITE teacher USAT NA Mary White NA
6: MARYWHITE teacher BG NA NA Mary-White
7: JOHNDOE police NYT John_Doe NA NA
8: JOHNDOE police USAT NA John Doe NA
9: JOHNDOE police BG NA NA John Doe
10: MARYWHITE police NYT Mary*White NA NA
11: MARYWHITE police USAT NA Mary White NA
12: MARYWHITE police BG NA NA Mary White
这是我期望的输出(对于mergedName
和job
的每个键控组合,三个来源中每个来源的最常见名称拼写:
mergedName job actualSpelling
1: JOHNDOE teacher John Doe
2: JOHNDOE teacher John Doe
3: JOHNDOE teacher John Doe
4: JOHNDOE police John Doe
5: JOHNDOE police John Doe
6: JOHNDOE police John Doe
7: MARYWHITE teacher Mary White
8: MARYWHITE teacher Mary White
9: MARYWHITE teacher Mary White
10: MARYWHITE police Mary White
11: MARYWHITE police Mary White
12: MARYWHITE police Mary White
我已经能够以宽泛的形式使用数据框。以下是以宽泛形式执行此操作的代码的一个小示例---注意:由于某种原因,这看似仅适用于较大的数据帧,即使代码相同,它也不适用于下面的示例。跨行应用于此DF的table()
输出与我期望的不同。:
mydataWide <- data.frame(mergedName=c("JOHNDOE","MARYWHITE","JOHNDOE","MARYWHITE"),
job=c("teacher","police","teacher","police"),
misspelled_NYT=c("John Doe", "Mary White", "John_Doe", "Mary*White"),
misspelled_USAT=c("JohnDOE", "Mary White", "John Doe", "Mary White"),
misspelled_BG=c("John Doe", "Mary-White", "John Doe", "Mary White"),
stringsAsFactors=FALSE)
nametable <- apply(mydataWide[,paste("misspelled", c("NYT","USAT","BG"), sep="_")], 1, function(x) sort(table(x), decreasing=TRUE))
mydataWide$actualSpelling <- names(sapply(nametable,`[`, 1) )
答案 0 :(得分:3)
您可以先melt
mydata
到long
表单,使用NA
删除na.omit
行,找到max
个{使用actualSpelling
和mergedName
{1}}(按job
和which.max
分组)。使用数字索引获取最大频率的条件。
table