Question

我正在尝试过滤掉R中不需要的多行数据，但我不知道该怎么做。

我使用的数据看起来有点像这样：

  Category     Item Shop1 Shop2 Shop3
1    Fruit   Apples     4     6     0
2    Fruit  Oranges     0     2     7
3      Veg Potatoes     0     0     0
4      Veg   Onions     0     0     0
5      Veg  Carrots     0     0     0
6    Dairy  Yoghurt     0     0     0
7    Dairy     Milk     0     1     0
8    Dairy   Cheese     0     0     0

我只想保留至少有一个商品至少有一个商店具有正值的类别。

在这种情况下，我想摆脱所有Veg行，因为没有一家商店出售任何蔬菜。我希望保留所有Fruit行，并且我希望所有行Dairy行，即使是所有商店中值为零的行，因为其中一行{ {1}}行的值大于0.

我在使用Dairy之后尝试使用colSums，希望每次只能对类别的内容求和，但它不起作用。我还尝试在rowSums的末尾添加一个列并根据频率进行过滤，但我只能这样过滤掉各行，而不是基于整个类别的行。

虽然我可以过滤出值为零的单个行（例如第3行），但我的难点是保留在第6行和第8行等行中，每个商店的所有值都为零，但我想保留这些行，因为其他group_by(Category)行的值大于零。

Answer 1

1）subset / ave rowSums(...) > 0每行有一个元素。如果该行中存在非零，则该元素为TRUE。它假定负值不可能。（如果可能出现负值，则使用rowSums(DF[-1:-2]^2) > 0代替。）它还假设商店是前两个列的那些列。特别是，它适用于任何数量的商店。然后ave为这些值为any的组生成TRUE，而subset仅保留这些值。没有包使用。

subset(DF, ave(rowSums(DF[-1:-2]) > 0, Category, FUN = any))

，并提供：

  Category    Item Shop1 Shop2 Shop3
1    Fruit  Apples     4     6     0
2    Fruit Oranges     0     2     7
6    Dairy Yoghurt     0     0     0
7    Dairy    Milk     0     1     0
8    Dairy  Cheese     0     0     0

1a）如果您不介意对商店进行硬编码，可能会出现以下情况：

subset(DF, ave(Shop1 + Shop2 + Shop3 > 0, Category, FUN = any))

2）dplyr

library(dplyr)
DF %>% group_by(Category) %>% filter(any(Shop1, Shop2, Shop3)) %>% ungroup

，并提供：

# A tibble: 5 x 5
# Groups:   Category [2]
  Category    Item Shop1 Shop2 Shop3
    <fctr>  <fctr> <int> <int> <int>
1    Fruit  Apples     4     6     0
2    Fruit Oranges     0     2     7
3    Dairy Yoghurt     0     0     0
4    Dairy    Milk     0     1     0
5    Dairy  Cheese     0     0     0

3）过滤/拆分另一个基本解决方案是：

do.call("rbind", Filter(function(x) any(x[-1:-2]), split(DF, DF$Category)))

，并提供：

        Category    Item Shop1 Shop2 Shop3
Dairy.6    Dairy Yoghurt     0     0     0
Dairy.7    Dairy    Milk     0     1     0
Dairy.8    Dairy  Cheese     0     0     0
Fruit.1    Fruit  Apples     4     6     0
Fruit.2    Fruit Oranges     0     2     7

4）dplyr / tidyr 使用gather将数据转换为长格式，其中每个值都有一行，然后使用any过滤组。最后转换回广泛的形式。

library(dplyr)
library(tidyr)
DF %>% 
  gather(shop, value, -(Category:Item)) %>% 
  group_by(Category) %>% 
  filter(any(value)) %>% 
  ungroup %>% 
  spread(shop, value)

，并提供：

# A tibble: 5 x 5
  Category    Item Shop1 Shop2 Shop3
*   <fctr>  <fctr> <int> <int> <int>
1    Dairy  Cheese     0     0     0
2    Dairy    Milk     0     1     0
3    Dairy Yoghurt     0     0     0
4    Fruit  Apples     4     6     0
5    Fruit Oranges     0     2     7

注意：可重复形式的输入是：

Lines <- "  Category     Item Shop1 Shop2 Shop3
1    Fruit   Apples     4     6     0
2    Fruit  Oranges     0     2     7
3      Veg Potatoes     0     0     0
4      Veg   Onions     0     0     0
5      Veg  Carrots     0     0     0
6    Dairy  Yoghurt     0     0     0
7    Dairy     Milk     0     1     0
8    Dairy   Cheese     0     0     0"

DF <- read.table(text = Lines)

Answer 2

以下是基础R中包含rowSums，ave和[的方法。

dat[ave(rowSums(dat[grep("Shop", names(dat))]), dat$Category, FUN=max) > 0,]

rowSums计算商店变量中每一行的销售额（使用grep到子集）。生成的向量将输入ave，并按dat$Category分组，并返回每个向量的最大销售额。最后，原始data.frame是基于销售是否为正的子集。

返回

  Category    Item Shop1 Shop2 Shop3
1    Fruit  Apples     4     6     0
2    Fruit Oranges     0     2     7
6    Dairy Yoghurt     0     0     0
7    Dairy    Milk     0     1     0
8    Dairy  Cheese     0     0     0

数据

dat <- structure(list(Category = structure(c(2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L), .Label = c("Dairy", "Fruit", "Veg"), class = "factor"), Item = structure(c(1L, 6L, 7L, 5L, 2L, 8L, 4L, 3L), .Label = c("Apples", "Carrots", "Cheese", "Milk", "Onions", "Oranges", "Potatoes", "Yoghurt"), class = "factor"), Shop1 = c(4L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Shop2 = c(6L, 2L, 0L, 0L, 0L, 0L, 1L, 0L ), Shop3 = c(0L, 7L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("Category", "Item", "Shop1", "Shop2", "Shop3"), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8"))

根据多行中的值过滤R中的行

2 个答案: