我想知道如果它们有一个共同的字段,是否可以在数据帧上合并不同的行:
输入:
df = rbind(c("01/01/2016",01:02:30,"100","character(0)","file A"),
c("02/01/2016",9:02:30,"character(0)", 3, "file A"),
c("02/01/2016",8:30:30,"200","character(0)","file B"),
c("03/01/2016",8:25:30,"50","character(0)","file C"),
c("04/01/2016",17:20:30,"character(0)","600","file B"))
输出:
df = rbind(c(01/01/2016,01:02:30,"100",3,"file A"),
c(02/01/2016,8:30:30,"200",600,"file B"),
c(03/01/2016,8:25:30,"50","character(0)","file C"))
因此,您可以看到我们根据最后一个值(文件A,文件B或文件C)合并行。 我需要保留最早的日期。例如,对于"文件A"我们有2个日期2016年1月1日和2016年1月2日,我们希望保持 我们不会为每个值合并超过2行
我们希望保持最早的日期
答案 0 :(得分:2)
根据您的评论,您希望根据分组列找到每个列的非缺失值的第一个实例(按一列排序)(在您的情况下为"文件A / B / C&# 34;专栏)。
首先,您必须稍微清理一下数据。由于时间戳周围有一些错误的引号,您的数据加载步骤有问题。此外,我假设您要使用character(0)
值表示缺失值。如果是这样,请使用NA
s。这是数据初始化和清理步骤:
# prepare your data
df = data.frame(V1 = c("01/01/2016 01:02:30","02/01/2016 9:02:30","02/01/2016 8:30:30",
"03/01/2016 8:25:30","04/01/2016 17:20:30"),
V2 = c("100","character(0)","200","50","character(0)"),
V3 = c("character(0)", "3", "character(0)","character(0)", "600"),
V4 = c("file A", "file A", "file B", "file C", "file B"))
# replace the character(0)s with NAs as they are missing values
df[df == "character(0)"] <- NA
# convert character dates to time
df$V1 <- strptime(as.character(df[ ,1]), format = "%d/%m/%Y %H:%M:%S")
我已将列V1..4
命名为您,但您可能需要一些更具描述性的名称。为了满足您的需求,您可以使用zoo
包的na.locf()
函数来填充列的缺失值。消除跨越V4
列I的不同值的数据交叉污染,循环数据。 (可能有更好的解决方案......)
这是一个执行自定义行合并的函数:
custom_row_merge <- function(df,
sort_by,
group_by){
# sort by dates in decreasing order
df <- df[order(df[,group_by], df[,sort_by]), ]
# select the columns to merge
columns_to_merge <- names(df)[!(names(df) %in% c(sort_by, group_by))]
# fill data for each unique value of group by column
for (file_type in unique(df[, group_by])){
row_indices <- (df[,group_by] == file_type)
# fill missing values for each column that is not group by or sort by
for (column_name in columns_to_merge){
df[row_indices, column_name] <- na.locf(df[row_indices, column_name],
na.rm = F,
fromLast = T)
}
}
# get first occurence of each file, now with the filled values
return(df[!duplicated(df[, group_by]), ])
}
以下是原始数据框:
> df
V1 V2 V3 V4
1 2016-01-01 01:02:30 100 <NA> file A
2 2016-01-02 09:02:30 <NA> 3 file A
3 2016-01-02 08:30:30 200 <NA> file B
4 2016-01-03 08:25:30 50 <NA> file C
5 2016-01-04 17:20:30 <NA> 600 file B
由函数生成的函数,与您在问题中描述的内容相匹配:
> custom_row_merge(df, "V1", "V4")
V1 V2 V3 V4
1 2016-01-01 01:02:30 100 3 file A
3 2016-01-02 08:30:30 200 600 file B
4 2016-01-03 08:25:30 50 <NA> file C
如果您愿意,您当然可以使用characer(0)
值填充缺失值。