无论出于何种原因,数据以下列格式提供:
0001 This is text for 0001
0002 This has spaces in between
0003 Yet this is only supposed to be two columns
0009 Why didn't they just comma delimit you may ask?
0010 Or even use quotations?
001 Who knows
0012 But now I'm here with his file
0013 And hoping someone has an elegant solution?
所以上面应该是两列。我想要的是第一个条目的列,即0001,0002,0003,0009,0010,001,0012,0013
和其他所有条目的列。
答案 0 :(得分:5)
您可以使用 tidyr 程序包中的public/views
功能(将我的评论提升为答案)。您指定了两个列名,并使用separate
参数确保第一个空格后的所有内容都放入第二列:
extra = "merge"
你得到:
library(tidyr)
separate(mydf, V1, c("nr","text"), sep = " ", extra = "merge")
# or:
mydf %>% separate(V1, c("nr","text"), sep = " ", extra = "merge")
使用过的数据:
nr text
1 0001 This is text for 0001
2 0002 This has spaces in between
3 0003 Yet this is only supposed to be two columns
4 0009 Why didnt they just comma delimit you may ask?
5 0010 Or even use quotations?
6 001 Who knows
7 0012 But now Im here with his file
8 0013 And hoping someone has an elegant solution?
答案 1 :(得分:3)
我会推荐来自" iotools"的input.file
功能。封装
用法如下:
library(iotools)
input.file("yourfile.txt", formatter = dstrsplit, nsep = " ", col_types = "character")
这是一个例子。 (为了便于说明,我在我的工作区中创建了一个虚拟临时文件。)
x <- tempfile()
writeLines(c("0001 This is text for 0001",
"0002 This has spaces in between",
"0003 Yet this is only supposed to be two columns",
"0009 Why didn't they just comma delimit you may ask?",
"0010 Or even use quotations?",
"001 Who knows",
"0012 But now I'm here with his file",
"0013 And hoping someone has an elegant solution?"), con = x)
library(iotools)
input.file(x, formatter = dstrsplit, nsep = " ", col_types = "character")
# rowindex V1
# 1 0001 This is text for 0001
# 2 0002 This has spaces in between
# 3 0003 Yet this is only supposed to be two columns
# 4 0009 Why didn't they just comma delimit you may ask?
# 5 0010 Or even use quotations?
# 6 001 Who knows
# 7 0012 But now I'm here with his file
# 8 0013 And hoping someone has an elegant solution?
Elegant够了吗? ; - )
如果您已经将数据作为单列data.frame
读取(如在@ Jaap的回答中),您仍然可以从&#34;的极速中受益。 iotools&#34;直接使用格式化程序包,而不是在input.file
函数中调用它。
换句话说,使用:
dstrsplit(as.character(mydf$V1), nsep = " ", col_types = "character")
如果有人感兴趣,我会对Jaap提出的解决方案进行基准测试,并对akrun&#34; iotools&#34;做法。您可以在this Gist找到结果。简介:无论是处理磁盘上的文件还是处理内存中的文件列,&#34; iotoos&#34;是表现最好的。我没有测试tomtom的解决方案,因为它需要从他们的答案中进一步处理。
答案 2 :(得分:0)
您可能希望使用以下内容(例如在lapply循环中):
unlist(strsplit(gsub("([0-9]{1,}) ","\\1~",x), "~" ))
它的作用如下:gsub保留括号(
和)
之间的任何内容,并将其存储在变量\\ 1中。 [0-9]找到任何数字,并且它后面的{1,}允许一个或多个出现。因此,您首先使用波形符(或文本中不包含的任何内容)替换数字和文本之间的空格,然后根据该文本进行strsplit。
答案 3 :(得分:0)
我们可以使用tstrsplit
中的data.table
。我们将“data.frame”转换为“data.table”(setDT(mydf)
),在“V1”列上使用tstrsplit
,我们按照数字后面的空格分割(正则表达式)。
library(data.table)
res <- setDT(mydf)[, tstrsplit(V1, "(?<=\\d)\\s+", perl=TRUE)]
res
# V1 V2
#1: 0001 This is text for 0001
#2: 0002 This has spaces in between
#3: 0003 Yet this is only supposed to be two columns
#4: 0009 Why didnt they just comma delimit you may ask?
#5: 0010 Or even use quotations?
#6: 001 Who knows
#7: 0012 But now Im here with his file
#8: 0013 And hoping someone has an elegant solution?
如果需要,可以使用setnames
setnames(res, c("nr", "text"))