data.table

时间:2019-07-11 19:07:24

标签: r dplyr data.table

我一直在大量使用tidyverse,但是对于某些项目,我需要data.table的速度。到目前为止,我了解大多数DT语法,但是我想在不使用mutate_if的情况下将data.table中未使用的级别删除。

有了dplyr,我可以使用mutate_if(dataframe, is.factor, droplevels)就是这样。但是,我找不到关于data.table的方法。

我尝试使用dataframe[, (.SD) := droplevels(.SD), .SDcols = sapply(dataframe, is.factor)]申请this answer

它引发以下错误:Error in [。data.table (DT_, ,:= ((.SD), droplevels(.SD)), .SDcols = sapply(DT_, : LHS of := isn't column names ('character') or positions ('integer' or 'numeric')

我希望不使用tidyverse就能得到与mutate_if中相同的结果。

更新

我接受了G. Grothendieck's的答案,因为代码更像我期望的那样。

他使用的示例是这样:

library(data.table)
DT <- data.table(a = 1:5, 
                 b = factor(1:5, levels = 1:10), 
                 c = factor(6:10, levels = 1:10))

我在此示例中使用的数据如下:

set.seed(42)
DT1 = data.table(
  A = LETTERS[1:10],
  B = c(1:10),
  C = factor(sample(LETTERS, 10), levels = LETTERS),
  D = factor(sample(LETTERS, 10), levels = LETTERS)
)

感兴趣的列是:

> DT1[, C]
 [1] Q E A J D R Z O G V
Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
> DT1[, D]
 [1] Y E N T R O C I D Z
Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

结果是

# with base
DT1 = droplevels(DT1)

# or by reference
DT1[, (names(DT1)) := droplevels(.SD)]

具有以下输出:

> DT1[, C]
 [1] Q E A J D R Z O G V
Levels: A D E G J O Q R V Z
> DT1[, D]
 [1] Y E N T R O C I D Z
Levels: C D E I N O R T Y Z

感谢大家的回答,很快!

5 个答案:

答案 0 :(得分:5)

使用末尾注释中的数据

DT[, (names(DT)) := droplevels(.SD)]

DT <- droplevels(DT)

检查:

levels(DT$b)
## [1] "1" "2" "3" "4" "5"

levels(DT$c)
## [1] "6"  "7"  "8"  "9"  "10"

如果问题中的droplevels仅作为示例,而您使用的实函数没有data.frame方法,则使用与此对应的代码:

wx <- which(sapply(DT, is.factor))
DT[, (wx) := lapply(.SD, droplevels), .SDcols = wx]

注意

library(data.table)
DT <- data.table(a = 1:5, 
                 b = factor(1:5, levels = 1:10), 
                 c = factor(6:10, levels = 1:10))

更新

简体。

答案 1 :(得分:4)

这不是data.table解决方案,但是可以使用基数R的rapply来完成:

## data
data("iris")
## add dummy level
levels(iris$Species) <- c(levels(iris$Species), "dummy")
str(iris)
#> 'data.frame':    150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : Factor w/ 4 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

iris2 <- rapply(iris, f = droplevels, classes = "factor", how = "replace")
str(iris2)
#> 'data.frame':    150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

答案 2 :(得分:3)

另一个使用set()

的选项

输入数据

library(data.table)
DT <- as.data.table(iris)
DT[, Species := as.factor(Species)]
DT <- DT[Species == "setosa"]

DT[, levels(Species)]
#[1] "setosa"     "versicolor" "virginica"

获取构成因素的列名并替换为引用

cols <- DT[, names(Filter(is.factor, .SD))]
for(j in cols) {
  set(DT, j = j, value = droplevels(DT[[j]]))
}
# could also be written as a one-liner - thanks to @MattSummersgill
# for(j in cols) set(DT, j = j, value = droplevels(DT[[j]]))

给予

DT[, levels(Species)]
#[1] "setosa"

答案 3 :(得分:2)

要添加到我的评论中, 您可以尝试table.express, 尽管应该更新示例,因为它们可以简化。 这是一个等效于mutate_if的示例:

library(data.table)
library(table.express)

data("iris")

DT <- as.data.table(iris)

DT %>%
  start_expr %>%
  mutate(Species = as.factor(Species)) %>%
  mutate_sd(is.factor(.COL), droplevels) %>%
  end_expr

但是请检查整个小插图, 一些动词渴望而有些懒惰。

答案 4 :(得分:1)

怎么样?

#include <cmath>
#include <cstdio>


/*  Decode the IEEE-754 binary16 encoding into a floating-point value.
    Details of NaNs are not handled.
*/
static float InterpretAsBinary16(unsigned Bits)
{
    //  Extract the fields from the binary16 encoding.
    unsigned SignCode        = Bits >> 15;
    unsigned ExponentCode    = Bits >> 10 & 0x1f;
    unsigned SignificandCode = Bits       & 0x3ff;

    //  Interpret the sign bit.
    float Sign = SignCode ? -1 : +1;

    //  Partition into cases based on exponent code.

    float Significand, Exponent;

    //  An exponent code of all ones denotes infinity or a NaN.
    if (ExponentCode == 0x1f)
        return Sign * (SignificandCode == 0 ? INFINITY : NAN);

    //  An exponent code of all zeros denotes zero or a subnormal.
    else if (ExponentCode == 0)
    {
        /*  Subnormal significands have a leading zero, and the exponent is the
            same as if the exponent code were 1.
        */
        Significand = 0 + SignificandCode * 0x1p-10;
        Exponent    = 1 - 0xf;
    }

    //  Other exponent codes denote normal numbers.
    else
    {
        /*  Normal significands have a leading one, and the exponent is biased
            by 0xf.
        */
        Significand = 1 + SignificandCode * 0x1p-10;
        Exponent    = ExponentCode - 0xf;
    }

    //  Combine the sign, significand, and exponent, and return the result.
    return Sign * std::ldexp(Significand, Exponent);
}


int main(void)
{
    unsigned Bits = 0x7bff;
    std::printf(
        "Interpreting the bits 0x%x as an IEEE-754 binary16 yields %.99g.\n",
        Bits,
        InterpretAsBinary16(Bits));
}