处理可变空间分隔文件限制为2列

时间:2016-01-16 16:02:42

标签: r

无论出于何种原因,数据以下列格式提供:

0001 This is text for 0001
0002 This has spaces in between
0003 Yet this is only supposed to be two columns
0009 Why didn't they just comma delimit you may ask?
0010 Or even use quotations?
001  Who knows
0012 But now I'm here with his file
0013 And hoping someone has an elegant solution?

所以上面应该是两列。我想要的是第一个条目的列,即0001,0002,0003,0009,0010,001,0012,0013和其他所有条目的列。

4 个答案:

答案 0 :(得分:5)

您可以使用 tidyr 程序包中的public/views功能(将我的评论提升为答案)。您指定了两个列名,并使用separate参数确保第一个空格后的所有内容都放入第二列:

extra = "merge"

你得到:

library(tidyr)
separate(mydf, V1, c("nr","text"), sep = " ", extra = "merge")
# or:
mydf %>% separate(V1, c("nr","text"), sep = " ", extra = "merge")

使用过的数据:

    nr                                           text
1 0001                          This is text for 0001
2 0002                     This has spaces in between
3 0003    Yet this is only supposed to be two columns
4 0009 Why didnt they just comma delimit you may ask?
5 0010                        Or even use quotations?
6  001                                      Who knows
7 0012                  But now Im here with his file
8 0013    And hoping someone has an elegant solution?

答案 1 :(得分:3)

我会推荐来自" iotools"的input.file功能。封装

用法如下:

library(iotools)
input.file("yourfile.txt", formatter = dstrsplit, nsep = " ", col_types = "character")

这是一个例子。 (为了便于说明,我在我的工作区中创建了一个虚拟临时文件。)

x <- tempfile()
writeLines(c("0001 This is text for 0001",
             "0002 This has spaces in between",
             "0003 Yet this is only supposed to be two columns",
             "0009 Why didn't they just comma delimit you may ask?",
             "0010 Or even use quotations?",
             "001  Who knows",
             "0012 But now I'm here with his file",
             "0013 And hoping someone has an elegant solution?"), con = x)

library(iotools)
input.file(x, formatter = dstrsplit, nsep = " ", col_types = "character")
#   rowindex                                              V1
# 1     0001                           This is text for 0001
# 2     0002                      This has spaces in between
# 3     0003     Yet this is only supposed to be two columns
# 4     0009 Why didn't they just comma delimit you may ask?
# 5     0010                         Or even use quotations?
# 6      001                                       Who knows
# 7     0012                  But now I'm here with his file
# 8     0013     And hoping someone has an elegant solution?

Elegant够了吗? ; - )

更新1

如果您已经将数据作为单列data.frame读取(如在@ Jaap的回答中),您仍然可以从&#34;的极速中受益。 iotools&#34;直接使用格式化程序包,而不是在input.file函数中调用它。

换句话说,使用:

dstrsplit(as.character(mydf$V1), nsep = " ", col_types = "character")

更新2

如果有人感兴趣,我会对Jaap提出的解决方案进行基准测试,并对akrun&#34; iotools&#34;做法。您可以在this Gist找到结果。简介:无论是处理磁盘上的文件还是处理内存中的文件列,&#34; iotoos&#34;是表现最好的。我没有测试tomtom的解决方案,因为它需要从他们的答案中进一步处理。

答案 2 :(得分:0)

您可能希望使用以下内容(例如在lapply循环中):

unlist(strsplit(gsub("([0-9]{1,}) ","\\1~",x), "~" ))

它的作用如下:gsub保留括号()之间的任何内容,并将其存储在变量\\ 1中。 [0-9]找到任何数字,并且它后面的{1,}允许一个或多个出现。因此,您首先使用波形符(或文本中不包含的任何内容)替换数字和文本之间的空格,然后根据该文本进行strsplit。

答案 3 :(得分:0)

我们可以使用tstrsplit中的data.table。我们将“data.frame”转换为“data.table”(setDT(mydf)),在“V1”列上使用tstrsplit,我们按照数字后面的空格分割(正则表达式)。

library(data.table)
res <- setDT(mydf)[, tstrsplit(V1, "(?<=\\d)\\s+", perl=TRUE)]
res
#     V1                                             V2
#1: 0001                          This is text for 0001
#2: 0002                     This has spaces in between
#3: 0003    Yet this is only supposed to be two columns
#4: 0009 Why didnt they just comma delimit you may ask?
#5: 0010                        Or even use quotations?
#6:  001                                      Who knows
#7: 0012                  But now Im here with his file
#8: 0013    And hoping someone has an elegant solution?

如果需要,可以使用setnames

更改名称
setnames(res, c("nr", "text"))