查找数据集中对的频率 - R.

时间:2017-02-19 19:45:16

标签: r data-manipulation data-science

我有以下数据:

Name    Event

John    EventA
Anna    EventA
Dave    EventA
Stew    EventB
John    EventB
Anna    EventB
John    EventC
Stew    EventC
Dave    EventC

我想知道谁最做同样的事情。因此,例如在上面的示例中,我希望它返回前3个最相似的对: 约翰&安娜,约翰&戴夫,约翰&炖。

我认为我需要制作一个如下所示的频率矩阵

Name    John    Anna    Dave     Stew
John     0       2       2        2
Anna     2       0       1        1
Dave     2       1       0        1
Stew     2       1       1        0

然后将其转换为类似的内容:

Pair          Frequency

John Anna         2
John Dave         2
John Stew         2
Anna Dave         1
Anna Stew         1
Dave Stew         1

但我不知道该怎么做。

我正在与R合作,所以如果有人知道这样做的方法,那将是一个巨大的帮助!

2 个答案:

答案 0 :(得分:2)

您可以使用table melt reshape2#DATA df = structure(list(Name = c("John", "Anna", "Dave", "Stew", "John", "Anna", "John", "Stew", "Dave"), Event = c("EventA", "EventA", "EventA", "EventB", "EventB", "EventB", "EventC", "EventC", "EventC" )), .Names = c("Name", "Event"), row.names = c(NA, -9L), class = "data.frame") #Get Pairwise Frequency a = table(df) %*% t(table(df)) a # Name #Name Anna Dave John Stew # Anna 2 1 2 1 # Dave 1 2 2 1 # John 2 2 3 2 # Stew 1 1 2 2 #If you want, set diagonal elements to zero (From Karthik's comment) #diag(a) <- 0 library(reshape2) output = data.frame(melt(a)) colnames(output) = c("Name1", "Name2", "Value") #Remove the pair with oneself output = output[-(which(output$Name1 == output$Name2)),] output # Name1 Name2 Value #2 Dave Anna 1 #3 John Anna 2 #4 Stew Anna 1 #5 Anna Dave 1 #7 John Dave 2 #8 Stew Dave 1 #9 Anna John 2 #10 Dave John 2 #12 Stew John 2 #13 Anna Stew 1 #14 Dave Stew 1 #15 John Stew 2 #YOU CAN PASTE 'NAME1' and 'NAME2' to a 'PAIR' if necessary #output$PAIR = apply(output, 1, function(x) paste(sort(x[1:2]), collapse = " ")) 包。

<div class="tag-box tag-box-v7 text-justify">
    <div class="text-center">
        <ul class="pagination" th:with="elementsperpage=2, blocksize=10, pages=${page2th.Number}/${elementsperpage}, wholepages=${format.format(pages)},
whole=(${page2th.Number}/${blocksize})+1, wholex=${format.format(whole)}, startnlockpage=${wholepages}*${blocksize+1}, endblockpage=${wholepages}*${blocksize+1}, 
startpage=${wholex-1}*${blocksize}, endpage=(${wholex}*${blocksize})+1">

            <li>
                <a th:if="${startpage gt 0}" th:href="@{${'/viewannouncements?p='}+${startpage}}">&lt;&lt;</a>
                <a th:if="${startpage eq 0}" href="javascript:void(0);">&lt;&lt;</a>
            </li>

            <li th:each="pageNo : ${#numbers.sequence(endpage-11, (endpage lt page2th.TotalPages)? endpage-2 : page2th.TotalPages-1)}" 
            th:class="${page2th.Number eq pageNo}? 'active' : ''">
                    <a th:if="${page2th.Number eq pageNo}" href="javascript:void(0);">
                        <span th:text="${pageNo + 1}"></span>
                    </a>
                    <a th:if="${not (page2th.Number  eq pageNo)}" th:href="@{${'/viewannouncements?p='}+${pageNo+1}}">
                        <span th:text="${pageNo + 1}"></span>
                    </a>
            </li>

            <li>
                <a th:if="${(endpage*elementsperpage) le (page2th.TotalElements)}" th:href="@{${'/viewannouncements?p='}+${endpage}}">&gt;&gt;</a>
                <a th:if="${(endpage*elementsperpage) le (page2th.TotalElements)}" href="javascript:void(0);"></a>
            </li>



        </ul>
    </div>
</div>

答案 1 :(得分:1)

这似乎与您要求的更接近,并且仅使用基础R中的函数。使用来自@ d.b的答案中的“df”:

x <- as.table(tcrossprod(table(df)))
x[lower.tri(x, diag = TRUE)] <- NA
na.omit(data.frame(x))
#    Name Name.1 Freq
# 5  Anna   Dave    1
# 9  Anna   John    2
# 10 Dave   John    2
# 13 Anna   Stew    1
# 14 Dave   Stew    1
# 15 John   Stew    2

使用NA diaglower.tri可以让我们轻松删除我们不感兴趣的值。