我有一个包含NA
值的类因子的长向量。
# simple example
x <- factor(c(NA,'A','B','C',NA), levels=c('A','B','C'))
出于建模的目的,我希望用新的因子级别(例如,“未知”)替换这些NA
值,并将此级别设置为参考级别。
由于替换级别不是现有级别,因此简单替换不起作用:
# this won't work, since the replacement value is not an existing level of the factor
x[is.na(x)] <- '?'
x # returns: [1] <NA> A B C <NA> -- the NAs remain
# this doesn't work either:
replace(x, NA,'?')
我提出了几个解决方案,但两者都有点丑陋而且非常慢。
f1 <- function(x, uRep='?'){
# convert to character, replace NAs with Unknown, and convert back to factor
stopifnot(is.factor(x))
newLevels <- c(uRep,levels(x))
x <- as.character(x)
x[is.na(x)] <- uRep
factor(x, levels=newLevels)
}
f2 <- function(x, uRep='?'){
# add new level for Unknown, replace NAs with Unknown, and make Unknown first level
stopifnot(is.factor(x))
levels(x) <- c(levels(x),uRep)
x[is.na(x)] <- uRep
relevel(x, ref=uRep)
}
f3 <- function(x, uRep='?'){ # thanks to @HongOoi
y <- addNA(x)
levels(y)[length(levels(y))]<-uRep
relevel(y, ref=uRep)
}
#test
f1(x) # works
f2(x) # works
f3(x) # works
解决方案#2仅编辑(相对较小的)级别集合,再加上一个算术操作来重新定位。我原本期望它比#1更快,它正在转换为角色并回到因素。
然而,#2在10K元素和10%NA的10K元素的基准矢量上慢两倍。
x <- sample(factor(c(LETTERS[1:10],NA),levels=LETTERS[1:10]),10000,replace=TRUE)
library(microbenchmark)
microbenchmark(f1(x),f2(x),f3(x),times=500L)
# Unit: microseconds
# expr min lq mean median uq max neval
# f1(x) 271.981 278.1825 322.4701 313.0360 360.7175 609.393 500
# f2(x) 651.728 703.2595 768.6756 747.9480 825.7800 1517.707 500
# f3(x) 808.246 883.2980 966.2374 927.5585 1061.1975 1779.424 500
解决方案#3,我的内置addNA
的包装器(在下面的回答中提到)比其中任何一个慢。 addNA
对NA
值进行一些额外检查,并将新级别设置为最后一级(要求我重新定位)并命名为NA(然后需要在重新定位之前按索引重命名,因为NA很难访问 - relevel(addNA(x), ref=NA_character_))
不起作用。)
有没有一种更有效的方式来写这个,或者我只是在冲洗?
答案 0 :(得分:2)
如果您需要预制解决方案,则可以使用fct_explicit_na
fct_relevel
来自forcats
包。f1
它比library(forcats)
x <- factor(c(NA,'A','B','C',NA), levels=c('A','B','C'))
函数慢,但在长度为100,000的向量上仍然会在几分之一秒内运行:
[1] <NA> A B C <NA>
Levels: A B C
x = fct_relevel(fct_explicit_na(x, "Unknown"), "Unknown")
[1] Unknown A B C Unknown
Levels: Unknown A B C
x <- sample(factor(c(LETTERS[1:10],NA), levels=LETTERS[1:10]), 1e5, replace=TRUE) microbenchmark(forcats = fct_relevel(fct_explicit_na(x, "Unknown"), "Unknown"), f1 = f1(x), unit="ms", times=100L)
长度为100,000的向量上的计时:
Unit: milliseconds
expr min lq mean median uq max neval cld
forcats 7.624158 10.634761 15.303339 12.162105 15.513846 250.0516 100 b
f1 3.568801 4.226087 8.085532 5.321338 5.995522 235.2449 100 a
[root@spark-master ~]# pyspark Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 12:22:00) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j- defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/05/26 21:19:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/05/26 21:19:10 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK- 2243). The other SparkContext was created at: org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58) sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) java.lang.reflect.Constructor.newInstance(Constructor.java:423) py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) py4j.Gateway.invoke(Gateway.java:236) py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) py4j.GatewayConnection.run(GatewayConnection.java:214) java.lang.Thread.run(Thread.java:748) Traceback (most recent call last):File "/usr/local/spark/spark/python/pyspark/shell.py", line 43, in <module> spark = SparkSession.builder\ File "/usr/local/spark/spark/python/pyspark/sql/session.py", line 169, in getOrCreate sc = SparkContext.getOrCreate(sparkConf) File "/usr/local/spark/spark/python/pyspark/context.py", line 310, in getOrCreate SparkContext(conf=conf or SparkConf()) File "/usr/local/spark/spark/python/pyspark/context.py", line 118, in __init__ conf, jsc, profiler_cls) File "/usr/local/spark/spark/python/pyspark/context.py", line 182, in _do_init self._jsc = jsc or self._initialize_context(self._conf._jconf) File "/usr/local/spark/spark/python/pyspark/context.py", line 249, in _initialize_context return self._jvm.JavaSparkContext(jconf) File "/usr/local/spark/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__ File "/usr/local/spark/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.ExceptionInInitializerError at org.apache.spark.SparkContext.<init>(SparkContext.scala:397) at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:236) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.UnknownHostException: spark-master: spark-master: Temporary failure in name resolution at java.net.InetAddress.getLocalHost(InetAddress.java:1505) at org.apache.spark.util.Utils$.findLocalInetAddress(Utils.scala:870) at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress$lzycompute(Utils.scala:863) at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress(Utils.scala:863) at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:920) at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:920) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.util.Utils$.localHostName(Utils.scala:920) at org.apache.spark.internal.config.package$.<init>(package.scala:189) at org.apache.spark.internal.config.package$.<clinit>(package.scala) ... 13 more Caused by: java.net.UnknownHostException: spark-master: Temporary failure in name resolution at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) at java.net.InetAddress.getLocalHost(InetAddress.java:1500) ... 22 more During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/spark/spark/python/pyspark/shell.py", line 47, in <module> spark = SparkSession.builder.getOrCreate() File "/usr/local/spark/spark/python/pyspark/sql/session.py", line 169, in getOrCreate sc = SparkContext.getOrCreate(sparkConf) File "/usr/local/spark/spark/python/pyspark/context.py", line 310, in getOrCreate SparkContext(conf=conf or SparkConf()) File "/usr/local/spark/spark/python/pyspark/context.py", line 118, in __init__ conf, jsc, profiler_cls) File "/usr/local/spark/spark/python/pyspark/context.py", line 182, in _do_init self._jsc = jsc or self._initialize_context(self._conf._jconf) File "/usr/local/spark/spark/python/pyspark/context.py", line 249, in _initialize_context return self._jvm.JavaSparkContext(jconf) File "/usr/local/spark/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__ File "/usr/local/spark/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.internal.config.package$ at org.apache.spark.SparkContext.<init>(SparkContext.scala:397) at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:236) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) >>>
答案 1 :(得分:1)
有一个内置函数addNA
。
来自?因素:
addNA(x, ifany = FALSE)
addNA modifies a factor by turning NA into an extra level (so that NA values are counted in tables, for instance).