Question

我有一个包含数据的文件，我需要将其导入到数据框中，但文件的设置非常糟糕。

我要导入的文件是344个字符的列表（32个列，445k行）。每列都是特定的字符空间范围。

第1列是字符空格1：2

第2列是字符空格3：6

第3列是7:20的字符空格等等。

数据示例：

the.data <- list("32154The street", "12546The clouds", "23236The jungle")

我需要它看起来像

col1   col2   col3
 32    154    The street
 12    546    The Clouds
 23    236    The jungle

我尝试过的事情：

substr(the.data, 1,2)
substr(the.data, 3,6)
substr(the.data, 7,20)

并将它们绑定在一起

我想找到更好的解决方案

我还尝试在正确的字符空间插入逗号，将其导出为csv并重新导入（或使用textConnection），但在那里遇到了问题。

Answer 1

readr中的

tidyverse可以读取固定宽度的数据。

library('tidyverse')

read_fwf(paste(the.data, collapse='\n'), fwf_widths(c(2,3,15)))
#> # A tibble: 3 x 3
#>      X1    X2         X3
#>   <int> <int>      <chr>
#> 1    32   154 The street
#> 2    12   546 The clouds
#> 3    23   236 The jungle

Answer 2

一种选择是使用sub在read.csv/read.table ed数据中插入分隔符，然后使用read.csv(text=sub("^(\\d{2})(\\d{3})(.*)", "\\1,\\2,\\3", unlist(the.data)), header = FALSE, col.names = paste0("col", 1:3), stringsAsFactors = FALSE) # col1 col2 col3 #1 32 154 The street #2 12 546 The clouds #3 23 236 The jungle

进行阅读

separate

或者我们可以根据职位使用library(dplyr) library(tidyr) unlist(the.data) %>% as_tibble %>% separate(value, into = paste0("col", 1:3), sep= c(3, 5)) # A tibble: 3 x 3 # col1 col2 col3 #* <chr> <chr> <chr> #1 321 54 The street #2 125 46 The clouds #3 232 36 The jungle

{{1}}

Answer 3

这样的东西？

->attributesToArray()

r - 导入数据，列是字符空格

3 个答案: