有效地在因子向量上引入新的水平

时间:2017-05-26 20:39:28

标签: r performance vector na categorical-data

我有一个包含NA值的类因子的长向量。

# simple example
x <- factor(c(NA,'A','B','C',NA), levels=c('A','B','C'))

出于建模的目的,我希望用新的因子级别(例如,“未知”)替换这些NA值,并将此级别设置为参考级别。

由于替换级别不是现有级别,因此简单替换不起作用:

# this won't work, since the replacement value is not an existing level of the factor
x[is.na(x)] <- '?'
x # returns: [1] <NA> A    B    C    <NA> -- the NAs remain
# this doesn't work either:
replace(x, NA,'?')

我提出了几个解决方案,但两者都有点丑陋而且非常慢。

f1 <- function(x, uRep='?'){
  # convert to character, replace NAs with Unknown, and convert back to factor
  stopifnot(is.factor(x))
  newLevels <- c(uRep,levels(x))
  x <- as.character(x)
  x[is.na(x)] <- uRep
  factor(x, levels=newLevels)
}

f2 <- function(x, uRep='?'){
  # add new level for Unknown, replace NAs with Unknown, and make Unknown first level
  stopifnot(is.factor(x))
  levels(x) <- c(levels(x),uRep)
  x[is.na(x)] <- uRep
  relevel(x, ref=uRep)
}

f3 <- function(x, uRep='?'){ # thanks to @HongOoi
  y <- addNA(x)
  levels(y)[length(levels(y))]<-uRep
  relevel(y, ref=uRep)
}

#test
f1(x) # works
f2(x) # works
f3(x) # works

解决方案#2仅编辑(相对较小的)级别集合,再加上一个算术操作来重新定位。我原本期望它比#1更快,它正在转换为角色并回到因素。

然而,#2在10K元素和10%NA的10K元素的基准矢量上慢两倍。

x <- sample(factor(c(LETTERS[1:10],NA),levels=LETTERS[1:10]),10000,replace=TRUE)
library(microbenchmark)
microbenchmark(f1(x),f2(x),f3(x),times=500L) 
# Unit: microseconds
# expr     min       lq     mean   median        uq      max neval
# f1(x) 271.981 278.1825 322.4701 313.0360  360.7175  609.393   500
# f2(x) 651.728 703.2595 768.6756 747.9480  825.7800 1517.707   500
# f3(x) 808.246 883.2980 966.2374 927.5585 1061.1975 1779.424   500

解决方案#3,我的内置addNA的包装器(在下面的回答中提到)比其中任何一个慢。 addNANA值进行一些额外检查,并将新级别设置为最后一级(要求我重新定位)并命名为NA(然后需要在重新定位之前按索引重命名,因为NA很难访问 - relevel(addNA(x), ref=NA_character_))不起作用。)

有没有一种更有效的方式来写这个,或者我只是在冲洗?

2 个答案:

答案 0 :(得分:2)

如果您需要预制解决方案,则可以使用fct_explicit_na fct_relevel来自forcats包。f1它比library(forcats) x <- factor(c(NA,'A','B','C',NA), levels=c('A','B','C')) 函数慢,但在长度为100,000的向量上仍然会在几分之一秒内运行:

[1] <NA> A    B    C    <NA>
Levels: A B C
x = fct_relevel(fct_explicit_na(x, "Unknown"), "Unknown")
[1] Unknown A       B       C       Unknown
Levels: Unknown A B C
x <- sample(factor(c(LETTERS[1:10],NA), levels=LETTERS[1:10]), 1e5, replace=TRUE)

microbenchmark(forcats = fct_relevel(fct_explicit_na(x, "Unknown"), "Unknown"),
               f1 = f1(x), 
               unit="ms", times=100L)

长度为100,000的向量上的计时:

Unit: milliseconds
    expr      min        lq      mean    median        uq      max neval cld
 forcats 7.624158 10.634761 15.303339 12.162105 15.513846 250.0516   100   b
      f1 3.568801  4.226087  8.085532  5.321338  5.995522 235.2449   100   a
[root@spark-master ~]# pyspark
Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 12:22:00)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-
defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/05/26 21:19:10 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
17/05/26 21:19:10 WARN SparkContext: Another SparkContext is being 
constructed (or threw an exception in its constructor).  This may indicate 
an error, since only one SparkContext may be running in this JVM (see SPARK-
2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):File "/usr/local/spark/spark/python/pyspark/shell.py", line 43, in <module>
    spark = SparkSession.builder\
  File "/usr/local/spark/spark/python/pyspark/sql/session.py", line 169, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/usr/local/spark/spark/python/pyspark/context.py", line 310, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/usr/local/spark/spark/python/pyspark/context.py", line 118, in __init__
    conf, jsc, profiler_cls)
  File "/usr/local/spark/spark/python/pyspark/context.py", line 182, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/usr/local/spark/spark/python/pyspark/context.py", line 249, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/usr/local/spark/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
  File "/usr/local/spark/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.ExceptionInInitializerError
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:397)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:236)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: spark-master: spark-master: Temporary failure in name resolution
        at java.net.InetAddress.getLocalHost(InetAddress.java:1505)
        at org.apache.spark.util.Utils$.findLocalInetAddress(Utils.scala:870)
        at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress$lzycompute(Utils.scala:863)
        at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress(Utils.scala:863)
        at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:920)
        at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:920)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.util.Utils$.localHostName(Utils.scala:920)
        at org.apache.spark.internal.config.package$.<init>(package.scala:189)
        at org.apache.spark.internal.config.package$.<clinit>(package.scala)
        ... 13 more
Caused by: java.net.UnknownHostException: spark-master: Temporary failure in name resolution
        at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
        at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
        at java.net.InetAddress.getLocalHost(InetAddress.java:1500)
        ... 22 more


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/spark/spark/python/pyspark/shell.py", line 47, in <module>
    spark = SparkSession.builder.getOrCreate()
  File "/usr/local/spark/spark/python/pyspark/sql/session.py", line 169, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/usr/local/spark/spark/python/pyspark/context.py", line 310, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/usr/local/spark/spark/python/pyspark/context.py", line 118, in __init__
    conf, jsc, profiler_cls)
  File "/usr/local/spark/spark/python/pyspark/context.py", line 182, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/usr/local/spark/spark/python/pyspark/context.py", line 249, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/usr/local/spark/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
  File "/usr/local/spark/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.internal.config.package$
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:397)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:236)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)

>>>

答案 1 :(得分:1)

有一个内置函数addNA

来自?因素:

addNA(x, ifany = FALSE)
addNA modifies a factor by turning NA into an extra level (so that NA values are counted in tables, for instance).