Question

我有一些销售数据，这些数据会在销售点记录后被纠正，并且该数据集仍然包含初始错误的记录，然后是该错误的副本，但价格为负。如何有条件地删除这些观察值，以便“如果价格<0删除观察值和相应的观察值，其中价格=价格* -1，日期=日期，类型=类型，重量=重量”

数据的结构为

Date      Type     Weight     Price
5/5/16    A        15         34
5/5/16    A        15         -34
5/5/16    B        15         43

另一个问题是，编辑错误不仅会作为重复项存在，而且类型，重量和价格条目具有多个有效重复项。例如可以在同一日期以10磅的价格出售10个A型物品，每只售价34美元。 -我添加了一个列以计算重复的绝对值test2 <-测试％>％dplyr :: group_by（日期，类型，重量，ABS_Price）的数量>％dplyr :: mutate（replicate = seq（n（））））那么我该如何编码“如果观测值的价格为<0，然后删除其中copy = x-1的观测值”

Answer 1

一个简单的dplyr解决方案。通过定义重复项的键的组合来对行进行分组（注意，我们可以快速将转换应用于列）并过滤单例组。

library(dplyr)

with.dups <- read.csv(...)
without.dups <- with.dups %>% 
    group_by(Date, Type, Weight, abs(Price)) %>% 
    filter(n()==1) %>%
    as.data.frame  # you can omit this part if you don't need to transform the resulting tibble table to a vanilla data.frame

测试数据。

Date,Type,Weight,Price
5/5/16,A,15,34
5/5/16,A,15,-34
5/5/16,B,15,43

测试输出

    Date Type Weight Price abs(Price)
1 5/5/16    B     15    43         43

Answer 2

我在您的示例中增加了一行，以捕获具有匹配键的两个事务的可能边缘情况-我们可能只想删除第一个匹配项。

df <- read.table(
  header = T, 
  stringsAsFactors = F,
  text = "Date      Type     Weight     Price
5/5/16    A        15         34
5/5/16    A        15         34
5/5/16    A        15         -34
5/5/16    B        15         43")

我的方法是在所有条件相同（包括带有这些键值的交易编号）相同但价格为反向符号的匹配项中进行查找。如果是这样，请剪切：

library(dplyr)
df2 <- df %>%
  group_by(Date, Type, Weight, Price) %>%
  mutate(repeat_count = row_number()) %>%
  ungroup()

left_join(df2,
          df2 %>% mutate(Price = -Price, cut_flag = FALSE)) %>%
  filter(is.na(cut_flag)) %>%
  select(-cut_flag)

# A tibble: 2 x 5
  Date   Type  Weight Price repeat_count
  <chr>  <chr>  <int> <int>        <int>
1 5/5/16 A         15    34            2
2 5/5/16 B         15    43            1

Answer 3

我们可以使用duplicated在基数R中执行此操作。使用@Jon Spring的数据

df[!((duplicated(df[1:3]) | duplicated(df[1:3], fromLast = TRUE)) & 
     (duplicated(abs(df$Price)) | duplicated(abs(df$Price), fromLast = TRUE))), ]

#    Date Type Weight Price
#4 5/5/16    B     15    43

这是假设您在Date列分别有Type，Weight和1:3。如果位置不固定，也可以按名称选择它们

df[!((duplicated(df[c("Date", "Type", "Weight")]) | 
      duplicated(df[c("Date", "Type", "Weight")], fromLast = TRUE)) & 
      (duplicated(abs(df$Price)) | duplicated(abs(df$Price), fromLast = TRUE))), ]

Answer 4

与Ronak稍有不同，但使用 which（）

的前提类似

df$price <- abs(df$price)  #take absolute value, making all entries positive

dups <- which(duplicated(df)) #find place of duplicates, where all columns match


newdf <- df[-c(dups-1,dups),]

所有重复案例均已删除，并且每个重复案例之前的即时案例

如何编写R代码以删除重复的行，其中一个观察值是重复的负值？

4 个答案: