查找具有分组变量的重复项

时间:2019-04-15 12:49:55

标签: r duplicates identify

我有一个如下所示的df:

我猜想它将与dplyr和重复项一起使用。但是我不知道如何在分组变量之间进行区分时处理多个列。

from  to  group

1     2   metro
2     4   metro
3     4   metro
4     5   train
6     1   train
8     7   train

我想找到存在于多个ids变量中的group

样本df的预期结果是:14。因为它们存在于地铁和火车组中。

提前谢谢!

3 个答案:

答案 0 :(得分:3)

使用基数R,我们可以基于public class FilesFromFolder { private Workbook writeWorkbook; public void ExportService(DBConnection con) { writeWorkbook = new XSSFWorkbook(); Sheet desSheet = writeWorkbook.createSheet("Data"); Statement stmt = null; ResultSet rs = null; int columnsNumber = 0; ResultSetMetaData rsmd = null; FileOutputStream fileOut = null; Connection cntn = null; String filePath = "C:\\Users\\Desktop\\OracleExport"; File files = new File(filePath); File[] file = files.listFiles(); String fileNameWithOutExt = null; if (file != null) { for (int i = 0; i < file.length; i++) { if (file[i].isFile()) { String tempFilename = file[i].getName(); fileNameWithOutExt = tempFilename.replaceFirst("[.][^.]+$", ""); File fileTemp = file[i]; try { String fileContent = FileUtils.readFileToString(fileTemp, "UTF8"); // System.out.println(fileContent); cntn = con.getConnection(); stmt = cntn.createStatement(); rs = stmt.executeQuery(fileContent); rsmd = rs.getMetaData(); columnsNumber = rsmd.getColumnCount(); Row desRow1 = desSheet.createRow(0); for (int col = 0; col < columnsNumber; col++) { Cell newpath = desRow1.createCell(col); newpath.setCellValue(rsmd.getColumnLabel(col + 1)); } while (rs.next()) { System.out.println("Row number -->" + rs.getRow()); Row desRow = desSheet.createRow(rs.getRow()); for (int col = 0; col < columnsNumber; col++) { Cell newpath = desRow.createCell(col); newpath.setCellValue(rs.getString(col + 1)); } String outputFile = "C:\\Users\\Desktop\\OracleExport\\" + fileNameWithOutExt + ".xlsx"; fileOut = new FileOutputStream(outputFile); writeWorkbook.write(fileOut); } System.out.println(fileNameWithOutExt + " export complete"); } catch (IOException e) { e.printStackTrace(); } catch (SQLException e) { e.printStackTrace(); } finally { if (fileOut!= null) { try { fileOut.close(); } catch (IOException e) { e.printStackTrace(); } } if (cntn != null) { con.closeConnection(); } } } } } } } split的前两列,并使用group找到各组之间的相交值

intersect

答案 1 :(得分:1)

我们gather将'from','to'列转换为'long'格式,并按'val'分组,filter这些组具有多个唯一元素,然后pull独特的“ val”元素

library(dplyr)
library(tidyr)
df1 %>% 
   gather(key, val, from:to) %>% 
   group_by(val) %>% 
   filter(n_distinct(group) > 1) %>%
   distinct(val) %>%
   pull(val)
#[1] 1 4

或者使用base R,我们可以table来找到频率,并从中获取ID

out <-  with(df1, colSums(table(rep(group, 2), unlist(df1[1:2])) > 0)) > 1
names(which(out))
#[1] "1" "4"

数据

df1 <- structure(list(from = c(1L, 2L, 3L, 4L, 6L, 8L), to = c(2L, 4L, 
 4L, 5L, 1L, 7L), group = c("metro", "metro", "metro", "train", 
 "train", "train")), class = "data.frame", row.names = c(NA, -6L
 ))

答案 2 :(得分:1)

使用data.table将数据转换为长格式并计算唯一值。 melt用于转换为长格式,数据表允许在i的{​​{1}}部分进行过滤,在df1[ i, j, k]部分进行分组,并kpull部分。

j