Question

我有一个非常简单的问题，但我不知道如何获得理想的结果。

我有一个data.frame，其中有几列，我想在其中的四列中使用grep值，以获取data.frame的子集。

这是一个虚拟的例子

abc

我想基于V2,V3,V4,V5模式为列df2 <- df[grep('abc`, df$V1),]的data.frame子集

我知道我可以做一栏

>df2
V1  V2           V3           V4           V5
 a  abc|ccc|ggg  ttt|ccc|shg  yyy|lmn|trs  abc|ggt|hgy
 b  atc|cjc|ggg  ttt|ccc|shg  abc|lmn|trs  abc|opq|sss
 c  auc|chc|ggg  abc|ccc|shg  gtc|lmn|trs  hyt|lki|ddd

但是如何使用多列获取此结果？

images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
      [images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
  for i in xrange(FLAGS.num_gpus):
    with tf.device('/gpu:%d' % i):
      with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
        # Dequeues one batch for the GPU
        image_batch, label_batch = batch_queue.dequeue()
        # Calculate the loss for one tower of the CIFAR model. This function
        # constructs the entire CIFAR model but shares the variables across
        # all towers.
        loss = tower_loss(scope, image_batch, label_batch)

        # Reuse variables for the next tower.
        tf.get_variable_scope().reuse_variables()

        # Retain the summaries from the final tower.
        summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)

        # Calculate the gradients for the batch of data on this CIFAR tower.
        grads = opt.compute_gradients(loss)

        # Keep track of the gradients across all towers.
        tower_grads.append(grads)

我不想像这个问题grep one pattern over multiple columns那样获得额外的列，我想根据模式对data.frame进行子集化

谢谢

Answer 1

只需使用sapply()，它会按列应用grep()。这些值必须不列出和排序，以便您获得行。

df1[sort(unique(unlist(sapply(df1, function(x) grep('abc', x))))), ]

#   V1          V2          V3          V4          V5
# 1  a abc|ccc|ggg ttt|ccc|shg yyy|lmn|trs abc|ggt|hgy
# 2  b atc|cjc|ggg ttt|ccc|shg abc|lmn|trs abc|opq|sss
# 3  c auc|chc|ggg abc|ccc|shg gtc|lmn|trs hyt|lki|ddd

数据

df1 <- structure(list(V1 = structure(1:4, .Label = c("a", "b", "c", 
"d"), class = "factor"), V2 = structure(c(1L, 3L, 4L, 2L), .Label = c("abc|ccc|ggg", 
"aoc|cfc|ggg", "atc|cjc|ggg", "auc|chc|ggg"), class = "factor"), 
    V3 = structure(c(2L, 2L, 1L, 2L), .Label = c("abc|ccc|shg", 
    "ttt|ccc|shg"), class = "factor"), V4 = structure(c(3L, 1L, 
    2L, 3L), .Label = c("abc|lmn|trs", "gtc|lmn|trs", "yyy|lmn|trs"
    ), class = "factor"), V5 = structure(1:4, .Label = c("abc|ggt|hgy", 
    "abc|opq|sss", "hyt|lki|ddd", "rmn|wde|tre"), class = "factor")), class = "data.frame", row.names = c(NA, 
-4L))

Answer 2

我们可以使用sapply遍历列，该列将为每个元素返回一个逻辑向量，指示是否存在模式“ abc”，然后过滤出至少包含一个“ abc”的行

cols <- c("V2", "V3", "V4", "V5")
df[rowSums(sapply(df[cols], function(x) grepl("abc", x))) > 0, ]

#   V1          V2          V3          V4          V5
#1   a abc|ccc|ggg ttt|ccc|shg yyy|lmn|trs abc|ggt|hgy
#2   b atc|cjc|ggg ttt|ccc|shg abc|lmn|trs abc|opq|sss
#3   c auc|chc|ggg abc|ccc|shg gtc|lmn|trs hyt|lki|ddd

不是真正的data.table专家，但遵循我们可以做的相同逻辑

library(data.table)
dt[rowSums(dt[, lapply(.SD, function(x) grepl("abc", x))]) > 0, ]


#   V1          V2          V3          V4          V5
#1:  a abc|ccc|ggg ttt|ccc|shg yyy|lmn|trs abc|ggt|hgy
#2:  b atc|cjc|ggg ttt|ccc|shg abc|lmn|trs abc|opq|sss
#3:  c auc|chc|ggg abc|ccc|shg gtc|lmn|trs hyt|lki|ddd

Answer 3

这里有一些方法。

在第一个sapply中，通过将df1用于指示的模式，返回grepl每行一行一行的逻辑矩阵。然后使用rowSums查找哪些行具有TRUE。最后，我们以此为子集。

在第二个步骤中，我们将df1的指定列粘贴在一起，然后运行grepl并最终运行子集。

第三个与第二个相同，但是使用data.table。

第四个使用Reduce逐列工作。

# 1
df1[ rowSums(sapply(df1[-1], grepl, pattern = "abc")) > 0, ]

# 2
df1[grepl("abc", do.call("paste", c(df1[-1]))), ]

# 3
library(data.table)
dt1 <- as.data.table(df1)
dt1[grepl("abc", do.call("paste", dt1[, -1]))]

# 4
df1[Reduce(function(x, y) x | grepl("abc", y), init = FALSE, df1), ]

注意

可重复输入的形式是：

Lines <- "V1  V2           V3           V4           V5
 a  abc|ccc|ggg  ttt|ccc|shg  yyy|lmn|trs  abc|ggt|hgy
 b  atc|cjc|ggg  ttt|ccc|shg  abc|lmn|trs  abc|opq|sss
 c  auc|chc|ggg  abc|ccc|shg  gtc|lmn|trs  hyt|lki|ddd
 d  aoc|cfc|ggg  ttt|ccc|shg  yyy|lmn|trs  rmn|wde|tre"
df1 <- read.table(text = Lines, header = TRUE, as.is = TRUE)

Answer 4

您可以尝试：

df1 %>% filter_at(vars(V2:V5), any_vars(grepl("abc", .)))

如果您想要比grepl()更快的速度，请使用stringi::stri_detect_fixed()

big_df1 <- bind_rows(replicate(10e5, df1, simplify = FALSE))

mbm <- microbenchmark::microbenchmark(
  grepl = big_df1 %>% 
    filter_at(
      vars(V2:V5), 
      any_vars(grepl("abc", .))),
  stringi = big_df1 %>% 
    filter_at(
      vars(V2:V5), 
      any_vars(stringi::stri_detect_fixed(., "abc"))),
  times = 5L
)

哪个给：

#Unit: milliseconds
#    expr       min        lq      mean    median        uq      max neval
#   grepl 2603.2713 2613.4157 2665.3730 2646.4757 2709.6653 2754.037     5
# stringi  823.3735  832.9813  888.5228  901.2059  911.8805  973.173     5

来自r中不同列的grep

4 个答案:

注意