合并来自同一数据帧的行

时间:2016-04-06 13:19:25

标签: r merge

我想知道如果它们有一个共同的字段,是否可以在数据帧上合并不同的行:

输入:

df = rbind(c("01/01/2016",01:02:30,"100","character(0)","file A"),
           c("02/01/2016",9:02:30,"character(0)", 3, "file A"),
           c("02/01/2016",8:30:30,"200","character(0)","file B"),
           c("03/01/2016",8:25:30,"50","character(0)","file C"),
           c("04/01/2016",17:20:30,"character(0)","600","file B"))

输出:

df = rbind(c(01/01/2016,01:02:30,"100",3,"file A"),
           c(02/01/2016,8:30:30,"200",600,"file B"),
           c(03/01/2016,8:25:30,"50","character(0)","file C"))

因此,您可以看到我们根据最后一个值(文件A,文件B或文件C)合并行。 我需要保留最早的日期。例如,对于"文件A"我们有2个日期2016年1月1日和2016年1月2日,我们希望保持 我们不会为每个值合并超过2行

我们希望保持最早的日期

1 个答案:

答案 0 :(得分:2)

根据您的评论,您希望根据分组列找到每个列的非缺失值的第一个实例(按一列排序)(在您的情况下为"文件A / B / C&# 34;专栏)。

首先,您必须稍微清理一下数据。由于时间戳周围有一些错误的引号,您的数据加载步骤有问题。此外,我假设您要使用character(0)值表示缺失值。如果是这样,请使用NA s。这是数据初始化和清理步骤:

# prepare your data
df = data.frame(V1 = c("01/01/2016 01:02:30","02/01/2016 9:02:30","02/01/2016 8:30:30",
                       "03/01/2016 8:25:30","04/01/2016 17:20:30"),
                V2 = c("100","character(0)","200","50","character(0)"),
                V3 = c("character(0)", "3", "character(0)","character(0)", "600"),
                V4 = c("file A", "file A", "file B", "file C", "file B"))

# replace the character(0)s with NAs as they are missing values
df[df == "character(0)"] <- NA

# convert character dates to time
df$V1 <- strptime(as.character(df[ ,1]), format = "%d/%m/%Y %H:%M:%S")

我已将列V1..4命名为您,但您可能需要一些更具描述性的名称。为了满足您的需求,您可以使用zoo包的na.locf()函数来填充列的缺失值。消除跨越V4列I的不同值的数据交叉污染,循环数据。 (可能有更好的解决方案......) 这是一个执行自定义行合并的函数:

custom_row_merge <- function(df,
                             sort_by,
                             group_by){

    # sort by dates in decreasing order
    df <- df[order(df[,group_by], df[,sort_by]), ]

    # select the columns to merge
    columns_to_merge <- names(df)[!(names(df) %in% c(sort_by, group_by))]

    # fill data for each unique value of group by column
    for (file_type in unique(df[, group_by])){

        row_indices <- (df[,group_by] == file_type)

        # fill missing values for each column that is not group by or sort by
        for (column_name in columns_to_merge){

            df[row_indices, column_name] <- na.locf(df[row_indices, column_name],
                                                    na.rm = F,
                                                    fromLast = T)
        }    

    }

    # get first occurence of each file, now with the filled values
    return(df[!duplicated(df[, group_by]), ])

}

以下是原始数据框:

> df
                   V1   V2   V3     V4
1 2016-01-01 01:02:30  100 <NA> file A
2 2016-01-02 09:02:30 <NA>    3 file A
3 2016-01-02 08:30:30  200 <NA> file B
4 2016-01-03 08:25:30   50 <NA> file C
5 2016-01-04 17:20:30 <NA>  600 file B

由函数生成的函数,与您在问题中描述的内容相匹配:

> custom_row_merge(df, "V1", "V4")
                   V1  V2   V3     V4
1 2016-01-01 01:02:30 100    3 file A
3 2016-01-02 08:30:30 200  600 file B
4 2016-01-03 08:25:30  50 <NA> file C

如果您愿意,您当然可以使用characer(0)值填充缺失值。