Question

请考虑以下由列名“ id”和“ x”组成的数据帧，其中每个id重复四次。数据如下：

import requests

files = {'files': ('fv.pdf', open(r"C:\python\API\fv.pdf", 'rb'))}
data = {"order_documents":[{'file_name':"fv.pdf", 'type_code':'CUSTOMER_INVOICE' }]}

header = {
    'Authorization': '###########################',
}
response = requests.post("https://######.com/api/orders/40100476277994-A/documents", headers=header, files = files, data = data)

print(response.status_code)
print(response.url)

问题在于如何按照以下标准对数据框进行子集化：

（1）保留每个id的所有条目，如果它在x列中的对应值不包含3或最后一个数字为3。

（2）对于x列中具有多个3的给定id，将所有数字保留为前3个，并删除其余3个。预期的输出如下所示：

df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
                "x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))

我熟悉dplyr包中的'filter'函数对子集数据的使用，但是由于上述标准的复杂性，这种特殊情况使我感到困惑。在这方面的任何帮助将得到极大的赞赏。

Answer 1

这是使用/创建一些新列来帮助您进行过滤的解决方案：

library(dplyr)

df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
               "x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))

df %>%
  group_by(id) %>%                                    # for each id
  mutate(num_threes = sum(x == 3),                    # count number of 3s
         flag = ifelse(unique(num_threes) > 0,        # if there is a 3
                        min(row_number()[x == 3]),    # keep the row of the first 3
                        0)) %>%                       # otherwise put a 0
  filter(num_threes == 0 | row_number() <= flag) %>%  # keep ids with no 3s or up to first 3
  ungroup() %>%
  select(-num_threes, -flag)                          # remove helpful columns

# # A tibble: 13 x 2
#      id     x
#   <dbl> <dbl>
# 1     1     2
# 2     1     2
# 3     1     1
# 4     1     1
# 5     2     2
# 6     2     3
# 7     3     1
# 8     3     2
# 9     3     2
# 10    3     3
# 11    4     2
# 12    4     2
# 13    4     3

Answer 2

这对我有用：

数据

df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
                "x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))

命令

library(dplyr)
df <- mutate(df, before = lag(x))

df$condition1 <- 1

df$condition1[df$x == 3 & df$before == 3] <- 0
final_df <- df[df$condition1 == 1, 1:2]

结果

Answer 3

一个想法是用x==3挑选行，并在它们上面使用unique()。然后，仅将单个3附加到唯一行到数据帧的其余部分，最后对行进行排序。

这是base R的上述想法的解决方案：

res <- (r <- with(df,rbind(df[x!=3,],unique(df[x==3,]))))[order(as.numeric(rownames(r))),]
rownames(res) <- seq(nrow(res))

给予

数据

df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
               "x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))

基于多个删除行的条件对数据框进行分组

3 个答案: