使用弹性包将弹性搜索中的数据提取到R中,加载到数据框中,由于命中不会扩展到相同的长度而导致错误

时间:2016-11-21 12:57:43

标签: r elasticsearch

我从elasticsearch中提取数据如下:

> packageVersion("elastic") [1] '0.7.8'   
# data extract  
body <- list(query=list(range=list(timestamp=list(gte="2016-10-13",  lte="2016-10-15"))))  
b3 <- Search(index="myIndex",  
        sort=c("timestamp:desc"),   
        fields=c('timestamp','A','B','C','D','E','F','G'),   
        body=body,  
        size=3)  

提取第一个和第二个元素确定(编辑以节省空间):
$ $命中次数[[1]] $ $领域F,E,B,G,C,A,d,时间戳
$ hits $ hits [[2]] $ fields $ F,E,B,G,C,A,D,timestamp

第三个元素未完全提取为:
$ hits $ hits [[3]] $ fields $ C,A,B,D,timestamp

== 我按照这篇文章将列表转换为数据框:
Convert in R output of package Elastic (nested list?) to data.frame or JSON
第一个和第二个元素完美加载 第三个元素加载不正确,因为没有提取完整元素,导致以下错误:

# (optional) verify that all hits expand to the same length
# (should be true for data intended to be in a table format)
stopifnot(
 sapply(
b3$hits$hits, 
function(x) {!(length(unlist(x)) - length(unlist(b3$hits$hits[[1]])))}
  )
)
Error: sapply(b3$hits$hits, function(x) { .... are not all TRUE

# load into the dataframe
# count number of columns, use unlist() to convert 
# nested lists to a vector, use the first hit as proxy
nColumns <- length(unlist(b3$hits$hits[[1]]))

# fetch column names ... as above
nNames <- names(unlist(b3$hits$hits[[1]]))

# unlist all hits and convert to matrix with ncol Columns, don't forget  byrow=TRUE!
df.b3 <- data.frame(matrix(unlist(b3$hits$hits), ncol=nColumns, byrow=TRUE))

Warning message:
In matrix(unlist(b3$hits$hits), ncol = nColumns, byrow = TRUE) :
data length [33] is not a sub-multiple or multiple of the number of columns  [12]
>

注意:变量D,E,F,G中的某些记录包含空(NULL)和' - '值。我怀疑这可能会导致提取问题。

如果你们中的任何人遇到类似问题并找到解决方案,我会喜欢一些反馈 非常感谢。

1 个答案:

答案 0 :(得分:1)

此处作者elastic

我们不会尝试将输出强制转换为data.frame,因为它可能变化很大,以至于我们经常会遇到错误。但是我们允许您将选项传递给jsonlite以强制转移到data.frame(通过asdf参数,作为data.frame ),因为它应该&永远都不会失败。

如果处理列表输出,如果返回列表,我会使用dplyrdata.table之一。

重现性:

library(elastic)
if (!index_exists("shakespeare")) {
  shakespeare <- system.file("examples", "shakespeare_data.json", package = "elastic")
  docs_bulk(shakespeare)
}
res <- Search(index="shakespeare", fields=c('play_name','speaker'))
out <- lapply(res$hits$hits, function(x) unlist(x$fields, FALSE))

dplyr

library(dplyr)
bind_rows(out)
#> # A tibble: 10 × 2
#>    play_name       speaker
#>        <chr>         <chr>
#> 1   Henry IV              
#> 2   Henry IV KING HENRY IV
#> 3   Henry IV KING HENRY IV
#> 4   Henry IV KING HENRY IV
#> 5   Henry IV KING HENRY IV
#> 6   Henry IV KING HENRY IV
#> 7   Henry IV KING HENRY IV
#> 8   Henry IV KING HENRY IV
#> 9   Henry IV  WESTMORELAND
#> 10  Henry IV  WESTMORELAND

data.table

library(data.table)
rbindlist(out, fill = TRUE, use.names = TRUE)
#>    play_name       speaker
#> 1:  Henry IV              
#> 2:  Henry IV KING HENRY IV
#> 3:  Henry IV KING HENRY IV
#> 4:  Henry IV KING HENRY IV
#> 5:  Henry IV KING HENRY IV
#> 6:  Henry IV KING HENRY IV
#> 7:  Henry IV KING HENRY IV
#> 8:  Henry IV KING HENRY IV
#> 9:  Henry IV  WESTMORELAND
#> 10:  Henry IV  WESTMORELAND

或者,使用asdf参数,如果可能,它会在内部指示jsonlite::fromJSON解析为data.frame。

res <- Search(index="shakespeare", fields=c('play_name','speaker'), asdf = TRUE)
res$hits$hits$fields
#>    play_name       speaker
#> 1   Henry IV              
#> 2   Henry IV KING HENRY IV
#> 3   Henry IV KING HENRY IV
#> 4   Henry IV KING HENRY IV
#> 5   Henry IV KING HENRY IV
#> 6   Henry IV KING HENRY IV
#> 7   Henry IV KING HENRY IV
#> 8   Henry IV KING HENRY IV
#> 9   Henry IV  WESTMORELAND
#> 10  Henry IV  WESTMORELAND

使用:

  • R v3.3.2
  • OSX
  • elastic v0.7.8.9000
  • Elasticsearch v2.3.4