我有一个数据框列表,我想在其中使用表格。该列表如下所示:
pronouns <- data.frame(pronounciation = c("juː","juː","juː","ju","ju","jə","jə","hɪm","hɪm","hɪm", "həm","ðɛm"), words = c("you","you","you","you","you","you","you","him","him","him","him","them"))
articles <- data.frame(pronounciation = c("ðiː","ði","ði","ðə","ðə","ði","ðə","eɪ","eɪ","æɪ","æɪ","eɪ","eɪ","eɪ","e"), words = c("the","the","the","the","the","the","the","a","a","a","a","a","a","a","a"))
numbers <- data.frame(pronounciation = c("wʌn","wʌn","wʌn","wʌn","wan","wa:n","tuː","tuː","tuː","tuː","tu","tu","tuː","tuː","θɹiː"), words = c("one","one","one","one","one","one","two","two","two","two","two","two","two","two","three"))
ls <- list(pronouns, articles, numbers)
ls[[1]]
pronounciation words
1 juː you
2 juː you
3 juː you
4 ju you
5 ju you
6 jə you
7 jə you
8 hɪm him
9 hɪm him
10 hɪm him
11 həm him
12 ðɛm them
从这个数据帧列表中,我想使用table()提取$ words的列联表,同时也选择每个单词最常见的发音。所需的结果是ls_out:
pronouns_out <- data.frame(pronounciation = c("juː","hɪm","ðɛm"), words = c("you","him","them"), occurence = c(7,4,1))
articles_out <- data.frame(pronounciation = c("ði","eɪ"), words = c("the","a"), occurence = c(7,8))
numbers_out <- data.frame(pronounciation = c("wʌn","tuː","θɹiː"), words = c("one","two","three"), occurence = c(6,8,1))
ls_out <- list(pronouns_out, articles_out, numbers_out)
ls_out[[1]]
pronounciation words occurence
1 juː you 7
2 hɪm him 4
3 ðɛm them 1
如果两个或多个发音具有相同的频率(如ls [[2]中的ði和ðə),则需要随机选择一个发音。
非常欢迎任何关于如何做到这一点的建议。
答案 0 :(得分:1)
使用table
(和lapply
):
ff = function(pronounce, word)
{
tab = table(word, pronounce)
data.frame(pronounciation = colnames(tab)[max.col(tab, "random")],
words = rownames(tab),
occurences = unname(rowSums(tab)))
}
lapply(ls, function(x) ff(x$pronounciation, x$words))
#[[1]]
# pronounciation words occurences
#1 h<U+026A>m him 4
#2 <U+00F0><U+025B>m them 1
#3 ju<U+02D0> you 7
#
#[[2]]
# pronounciation words occurences
#1 e<U+026A> a 8
#2 <U+00F0>i the 7
#
#[[3]]
# pronounciation words occurences
#1 w<U+028C>n one 6
#2 θ<U+0279>i<U+02D0> three 1
#3 tu<U+02D0> two 8
答案 1 :(得分:0)
使用data.table
库 -
library(data.table)
dtlist<-list(pronouns,articles,numbers)
lapply(dtlist,setDT)
# for each data.table in the dtlist, calculate frequency by pron, words
dtlistfreq1 <-
lapply(dtlist, function(x) x[,.(freq = .N), by = .(pronunciation,words)])
# for each data.table in the dtlistfreq, pick the highest freq by words
dtlistfreq2 <-
lapply(dtlistfreq1, function(x) x[,.SD[which.max(freq)], by = .(words)])
输出
> dtlistfreq2
[[1]]
words pronounciation freq
1: you ju? 3
2: him h?m 4
3: them ð?m 1
[[2]]
words pronounciation freq
1: the ði 3
2: a e? 5
[[3]]
words pronounciation freq
1: one w?n 4
2: two tu? 6
3: three ??i? 1
答案 2 :(得分:0)
以下是使用data.table
的解决方案,我认为该解决方案可以获得您最初使用的内容,其中occurrence
是每个word
的总出现次数,而不是(word
的出现次数{1}},pronunciation
)配对:
dtlist<-list(pronouns,articles,numbers)
lapply(dtlist,setDT)
common_r<-function(x){
t<-sort(table(x),decreasing=T)
n<-length(t[t==max(t)])
c<-if (n>1)names(t)[ceiling(n*runif(1))] else names(t)[1]
c
}
lapply(dtlist,function(x)setcolorder(x[,.(occurrence=.N,
pronunciation=common_r(pronunciation)),
by=words]),
c("pronunciation","words","occurrence")))
输出:
[[1]]
pronunciation words occurrence
1: juː you 7
2: hɪm him 4
3: ðɛm them 1
[[2]]
pronunciation words occurrence
1: ði the 7
2: eɪ a 8
[[3]]
pronunciation words occurrence
1: wʌn one 6
2: tuː two 8
3: θɹiː three 1
请注意,当最常见的发音不唯一时,我已经注意随机化;如果它始终是唯一的(或者如果你不关心在这种情况下选择哪个发音),这可以简化:
common_r<-function(x){names(sort(table(x),decreasing=T))[1]}
如果您不希望通过将lapply
包裹在rbindlist
中来为不同的单词类别附带3个单独的列表,则可以进一步简化输出:
pronunciation words occurrence
1: juː you 7
2: hɪm him 4
3: ðɛm them 1
4: ði the 7
5: eɪ a 8
6: wʌn one 6
7: tuː two 8
8: θɹiː three 1
我们还可以在这个新的category
中添加一个data.table
字段,说明每个字段的来源。