让我们假装我的数据集如下:
working_data <- dplyr::data_frame("Date" = c("2015-01-01", "2015-01-01", "2015-01-02", "2015-01-03", "2015-01-04", "2015-01-04", "2015-01-04"),
"Time" = c("15:01", "15:01", "21:04", "13:19", "07:15", "07:15", "07:15"),
"SeizureTime" = c("0:10", "0:07", "0:11", "0:04", "0:08", "0:06", "0:07"),
"ET" = c("0:35", "0:35", "0:04", "1:10", "3:35", "3:35", "3:35"),
"ONumber" = c("(123)555-1234", "(123)555-1234", "(123)555-9999", "(000)555-9876", "(123)555-1111", "(123)555-1111", "(123)555-1111"),
"TNumber" = c("(123)555-1234", "(123)555-1234", "(123)555-9999", "(000)555-9876", "(123)555-1111", "(123)555-1111", "(123)555-1111"),
"CT" = c("a", "a", "b", "a", "b", "b", "b"))
我想从这些数据中提取可能重复的行。我这样做的方法如下:
while (nrow(working_data) != 0) {
target_call <- working_data[1, ]
working_data <- working_data[-1, ]
similar_calls <- working_data %>% dplyr::filter(Date == target_call$Date,
Time == target_call$Time,
ET == target_call$ET,
ONumber == target_call$ONumber,
TNumber == target_call$TNumber)
第一个循环将target_call
设置为等于working_data
的第一行,并将similar_calls
设置为等于第二行。假设一切顺利......我遇到的问题是,一旦我在target_call
和similar_calls
上运行我的功能,我就不想再看到它们了。所以我想删除working_data
中被similar_calls
拉入的数据。
填写target_call
和similar_calls
之后,我需要决定哪些呼叫(如果有)与target_call
相同,然后进一步确定哪个呼叫是正确的resolved_calls
选择,一旦我选择了正确的呼叫,将其添加到名为similar_calls
的新数据集中。如果在resolved_calls
中有剩余电话,那么我需要重复选择电话的分析,并将其中一个电话添加到working_data$Group <- ifelse(working_data$Date == target_call$Date & ... & working_data$TNumber == target_call$TNumber, 1, 0)
similar_calls <- working_data %>% dplyr::filter(Group == 1)
working_data <- working_data %>% dplyr::filter(Group == 0)
。
我能想到的最好的方法是将数据分成两个独立的数据帧。但是当我处理多个列时,我不知道该怎么做。我唯一的选择是一个非常丑陋的ifelse声明,如:
nomader@ideapad:~$ adb devices
List of devices attached
有更好的方法吗?
答案 0 :(得分:1)
你还没有真正描述过你想对每个组做些什么,但是让我们假装你只想抓住每组类似呼叫中的第一个元素。然后像duplicated
函数这样的函数可以正常工作:
working_data[with(working_data, !duplicated(paste(Date, Time, ET, ONumber, TNumber))),]
# Source: local data frame [4 x 7]
#
# Date Time SeizureTime ET ONumber TNumber CT
# (chr) (chr) (chr) (chr) (chr) (chr) (chr)
# 1 2015-01-01 15:01 0:10 0:35 (123)555-1234 (123)555-1234 a
# 2 2015-01-02 21:04 0:11 0:04 (123)555-9999 (123)555-9999 b
# 3 2015-01-03 13:19 0:04 1:10 (000)555-9876 (000)555-9876 a
# 4 2015-01-04 07:15 0:08 3:35 (123)555-1111 (123)555-1111 b
在dplyr语法中,您可以使用group_by
按相应元素进行分组,然后您可以使用filter
和row_number
来抓取每个组中的第一个实例:
working_data %>%
group_by(Date, Time, ET, ONumber, TNumber) %>%
filter(row_number() == 1)
# Source: local data frame [4 x 7]
# Groups: Date, Time, ET, ONumber, TNumber [4]
#
# Date Time SeizureTime ET ONumber TNumber CT
# (chr) (chr) (chr) (chr) (chr) (chr) (chr)
# 1 2015-01-01 15:01 0:10 0:35 (123)555-1234 (123)555-1234 a
# 2 2015-01-02 21:04 0:11 0:04 (123)555-9999 (123)555-9999 b
# 3 2015-01-03 13:19 0:04 1:10 (000)555-9876 (000)555-9876 a
# 4 2015-01-04 07:15 0:08 3:35 (123)555-1111 (123)555-1111 b
如果您想更频繁地处理群组,可以使用group_by
然后summarize
以不同方式汇总群组:
# Take text data in format mm:ss and return the number of seconds
secs <- function(x) {
spl <- strsplit(x, ":")
60*as.numeric(sapply(spl, "[", 1)) + as.numeric(sapply(spl, "[", 2))
}
working_data %>%
group_by(Date, Time, ET, ONumber, TNumber) %>%
summarize(meanSeizure=mean(secs(SeizureTime)))
# Source: local data frame [4 x 6]
# Groups: Date, Time, ET, ONumber [?]
#
# Date Time ET ONumber TNumber meanSeizure
# (chr) (chr) (chr) (chr) (chr) (dbl)
# 1 2015-01-01 15:01 0:35 (123)555-1234 (123)555-1234 8.5
# 2 2015-01-02 21:04 0:04 (123)555-9999 (123)555-9999 11.0
# 3 2015-01-03 13:19 1:10 (000)555-9876 (000)555-9876 4.0
# 4 2015-01-04 07:15 3:35 (123)555-1111 (123)555-1111 7.0