如何使用dplyr根据组在一列中排列单个条目? 我希望我可以通过电子邮件分组,然后对每个单独的列A-Z进行排序,但我无法弄清楚如何在不对整个数据帧进行排序的情况下执行此操作。 非常感谢你!
示例数据
df <- data.frame(
cleanname = c("Steven Smith", "Rob Tan", 'Zachary', "Matthew"),
dirtyname = c('rob Tan', 'stevesmith','zach', "Matthew"),
email = c('hello@email.com', 'hello@email.com', 'email2@email.com', 'email2@email.com')
)
所需的最终结果
desireddf <- data.frame(
cleanname = c("Rob Tan", "Steven Smith", "Zachary", "Matthew"),
dirtyname = c('rob Tan', 'stevesmith','zach', 'Matthew'),
email = c('hello@email.com', 'hello@email.com', 'email2@email.com', 'email2@email.com')
)
修改
感谢Sotos指出我的问题可以通过模糊名称匹配来解决。
答案 0 :(得分:1)
您可以使用stringdist
- 包中的library(stringdist)
df %>%
mutate(dirtyname = dirtyname[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)],
email = email[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)])
- 函数:
cleanname dirtyname email
1 Steven Smith stevesmith hello@email.com
2 Rob Tan rob Tan hello@email.com
3 Zachary zach email2@email.com
4 Matthew Matthew email2@email.com
给出:
data.table
与library(data.table)
setDT(df)[, `:=` (dirtyname = dirtyname[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)],
email = email[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)])]
:
SELECT cc.*,IFNULL(q.cnt,0) cnt
FROM 1097_course_cycle_tbl cс
JOIN 1097_courses_tbl с ON c.id=cc.course_id AND c.is_removed=2
LEFT JOIN
(
SELECT cs.course_cycl_id,COUNT(DISTINCT cs.client_id) cnt
FROM 1097_course_students_tbl cs
JOIN 1097_clients_tbl c ON cs.client_id=c.id AND c.is_removed=0
WHERE cs.stts_id<>8
GROUP BY cs.course_cycl_id
) q
ON q.course_cycl_id=cс.id
ORDER BY cc.start_date DESC
答案 1 :(得分:0)
如果数据框中的行代表不同的观察结果,则不宜单独对每列进行排序,因为独立的矢量排序将使行不再代表单独的观察。
可以通过多种方式对矢量进行排序,例如使用order()
函数。
dirtyname <- dirtyname[order(dirtyname)]