分组后只对一列进行排序

时间:2017-12-22 10:01:57

标签: r dplyr data.table

如何使用dplyr根据组在一列中排列单个条目? 我希望我可以通过电子邮件分组,然后对每个单独的列A-Z进行排序,但我无法弄清楚如何在不对整个数据帧进行排序的情况下执行此操作。 非常感谢你!

示例数据

df <- data.frame(
  cleanname = c("Steven Smith", "Rob Tan", 'Zachary', "Matthew"),
  dirtyname = c('rob Tan', 'stevesmith','zach', "Matthew"),
  email = c('hello@email.com', 'hello@email.com', 'email2@email.com', 'email2@email.com')
)

所需的最终结果

desireddf <- data.frame(
  cleanname = c("Rob Tan", "Steven Smith", "Zachary", "Matthew"),
  dirtyname = c('rob Tan', 'stevesmith','zach', 'Matthew'),
  email = c('hello@email.com', 'hello@email.com', 'email2@email.com', 'email2@email.com')
)

修改

感谢Sotos指出我的问题可以通过模糊名称匹配来解决。

2 个答案:

答案 0 :(得分:1)

您可以使用stringdist - 包中的library(stringdist) df %>% mutate(dirtyname = dirtyname[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)], email = email[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)]) - 函数:

     cleanname  dirtyname            email
1 Steven Smith stevesmith  hello@email.com
2      Rob Tan    rob Tan  hello@email.com
3      Zachary       zach email2@email.com
4      Matthew    Matthew email2@email.com

给出:

data.table

library(data.table) setDT(df)[, `:=` (dirtyname = dirtyname[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)], email = email[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)])]

一样适用的逻辑
SELECT cc.*,IFNULL(q.cnt,0) cnt
FROM 1097_course_cycle_tbl cс
JOIN 1097_courses_tbl с ON c.id=cc.course_id AND c.is_removed=2
LEFT JOIN
  (
    SELECT cs.course_cycl_id,COUNT(DISTINCT cs.client_id) cnt
    FROM 1097_course_students_tbl cs
    JOIN 1097_clients_tbl c ON cs.client_id=c.id AND c.is_removed=0
    WHERE cs.stts_id<>8
    GROUP BY cs.course_cycl_id
  ) q
ON q.course_cycl_id=cс.id
ORDER BY cc.start_date DESC

答案 1 :(得分:0)

如果数据框中的行代表不同的观察结果,则不宜单独对每列进行排序,因为独立的矢量排序将使行不再代表单独的观察。

可以通过多种方式对矢量进行排序,例如使用order()函数。

dirtyname <- dirtyname[order(dirtyname)]