尝试使用readtext library
中的quanteda library
(R
附带的Session info
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] C/C/C/C/C/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm.plugin.webmining_1.3 XML_3.98-1.7 readtext_0.50 RoogleVision_0.0.1.1
[5] outliers_0.14 stringdist_0.9.4.4 ltm_1.0-0 polycor_0.7-9
[9] msm_1.6.4 MASS_7.3-47 psych_1.7.5 WriteXLS_4.0.0
[13] plyr_1.8.4 metafor_2.0-0 Matrix_1.2-9 metaSEM_0.9.14
[17] OpenMx_2.7.12 xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-8
[21] readxl_1.0.0 quanteda_0.9.9-65 koRpus.lang.nl_0.01-3 koRpus_0.11-1
[25] sylly_0.1-1 jsonlite_1.5 httr_1.2.1
loaded via a namespace (and not attached):
[1] sylly.ru_0.1-1 splines_3.4.0 ellipse_0.3-8 RcppParallel_4.3.20 shiny_1.0.3
[6] sylly.it_0.1-1 expm_0.999-2 sylly.es_0.1-1 cellranger_1.1.0 slam_0.1-40
[11] yaml_2.1.14 backports_1.1.0 lattice_0.20-35 digest_0.6.12 googleAuthR_0.5.1
[16] colorspace_1.3-2 htmltools_0.3.6 httpuv_1.3.3 tm_0.7-1 devtools_1.13.2
[21] xtable_1.8-2 mvtnorm_1.0-6 scales_0.4.1 tibble_1.3.3 openssl_0.9.6
[26] ggplot2_2.2.1 withr_1.0.2 lazyeval_0.2.0 NLP_0.1-10 mnormt_1.5-5
[31] RJSONIO_1.3-0 survival_2.41-3 magrittr_1.5 mime_0.5 memoise_1.1.0
[36] evaluate_0.10 boilerpipeR_1.3 nlme_3.1-131 foreign_0.8-67 rsconnect_0.8
[41] tools_3.4.0 data.table_1.10.4 stringr_1.2.0 munsell_0.4.3 compiler_3.4.0
[46] rlang_0.1.1 grid_3.4.0 RCurl_1.95-4.8 bitops_1.0-6 rmarkdown_1.5
[51] gtable_0.2.0 curl_2.6 R6_2.2.2 sylly.en_0.1-1 knitr_1.16
[56] fastmatch_1.1-0 sylly.fr_0.1-1 rprojroot_1.2 stringi_1.1.5 parallel_3.4.0
[61] sylly.de_0.1-1 Rcpp_0.12.11
来解析超过7000个txt文件,我收到了以下警告。
警告消息:In(function(...,deparse.level = 1):number of 结果列不是矢量长度的倍数(arg 2030)
如何找出导致警告的txt文件?
如果出现警告,则不显示verbose-option。为了您的信息,尝试解析两个文件,我得到以下信息(如果我一次只解析1个文档,则为b2w,警告未显示)。
阅读文本 /用户/ OS / surfdrive / Competenties /数据步-1 / BinnenlandsBestuur / 1982 / 9-12 /办公 Lens 20170308-102311.jpg.txt阅读文本 /用户/ OS / surfdrive / Competenties /数据步-1 / BinnenlandsBestuur / 1983 /办公 镜头20170308-103518.jpg.txt,使用glob模式...阅读(txt) 文件:Office Lens 20170308-102311.jpg.txt,使用glob模式... 阅读(txt)文件:Office Lens 20170308-103518.jpg.txt阅读2 文档。警告消息:1:In(函数(...,deparse.level = 1) :结果列数不是矢量长度的倍数 (arg 2)2:在if(verbosity == 2& nchar(msg)> 70)pad&lt ;- paste0(“\ n”,pad):条件长度> 1而且只有第一个 元素将被使用
title = data.at("//h1[@itemprop = 'title']").children.text
addressLocality = data.at("//span[@itemprop = 'addressLocality']").children.text
addressRegion = data.at("//span[@itemprop = 'addressRegion']").children.text
addressCountry = data.at("//span[@itemprop = 'addressCountry']").children.text
谢谢你, 彼得
PS。如果此信息不足,我将在github页面上发布一个可重现的示例。
答案 0 :(得分:0)
您可以使用purrr
查找与您想要的内容不匹配的列。
首先让我们创建一些演示数据,其中一个文件的名称与其他三个文件的名称不同......
library(tidyverse)
library(purrr)
library(stringr)
old_wd <- getwd()
setwd(tempdir())
demo_data <- tibble(x = rnorm(327),
y = rnorm(327),
z = rnorm(327))
write_csv(demo_data, "demo1.csv")
write_csv(demo_data, "demo2.csv")
write_csv(demo_data, "demo3.csv")
bad_data <-
tibble(
x = rnorm(327),
y = rnorm(327),
z = rnorm(327),
extra_column = rnorm(327)
)
write_csv(bad_data, "demo4.csv")
现在定义列名称应该是什么。对于此示例,正确的名称为x
,y
和z
,
correct_names <- c("x", "y", "z")
此函数将读取csv并检查所有名称是否与correct_names
中的列名相匹配。
get_csv_names <- function(path){
c(path, all(names(read_csv(path)) == correct_names))
}
我假设您要处理工作目录中的所有csv文件。否则,您将要从我下面的内容中更改files
的值...
files <- list.files() %>%
tbl_df() %>%
filter(str_detect(value, ".csv")) %>%
pull()
现在只需将files
映射到函数get_csv_names
即可。请注意demo4.csv的值为FALSE
,这意味着它的列名与您在correct_names
中指定的列名不匹配...
map(files, get_csv_names)
# [[1]]
# [1] "demo1.csv" "TRUE"
#
# [[2]]
# [1] "demo2.csv" "TRUE"
#
# [[3]]
# [1] "demo3.csv" "TRUE"
#
# [[4]]
# [1] "demo4.csv" "FALSE"
由于我们在开始时更改了工作目录,所以最后重置它是个好主意。
setwd(old_wd)