我有一个很长的电话记录作为文本文件,我试图将其读入R但是它确实没有用。文本有一个结构,但肯定不是一个表。其结构如下
readLines
或scan
将起作用以下是一个例子:
TheInstitute 5467
telephone line 4125526987 x 4567
datetime 2011110516 12:56
blay blay blah who knows what, but anyway it may have a comma
TheInstitute 5467
telephone line 4125526987 x 4567
datetime 2011110516 12:58
blay blay blah who knows what
TheInstitute 5467
telephone line 412552999 x 4999
bump phone line 4125527777
datetime 2011110516 12:59
blay blay blah who knows what
TheInstitute 5467
telephone line 4125526987 x 4567
bump phone line 4125527777
datetime 2011110516 13:51
blay blay blah who knows what, but anyway it may have a comma
TheInstitute 5467
telephone line 4125526987 x 4567
datetime 2011110516 14:56
blay blay blah who knows what
我怎样才能在R中这样做?我已经尝试了扫描,粘贴,strsplit的技巧,但我在旋转。我可能必须将它放入列表中,因为它可以处理不相等数量的元素。我想让所有记录具有相同数量的字段,对于那些没有一个字段的记录(这里称为凹凸电话),我希望它们只是将NA作为该字段中的值。即使只是开始,我也会感谢帮助。从那里我可以玩和玩具。
答案 0 :(得分:15)
在scan
函数中使用multi.line = TRUE时,记录应该以两个行尾结束。我在文件周围使用textConnection执行此操作,但您将使用有效的文件名:
inp <- scan(textConnection(txt), multi.line=TRUE,
what=list(place="character", tline1="character",
cline1="character", cline2 ="character", cline3="character"), sep="\n")
Read 5 records
> str(as.data.frame(inp))
'data.frame': 5 obs. of 5 variables:
$ place : Factor w/ 1 level "TheInstitute 5467": 1 1 1 1 1
$ tline1: Factor w/ 2 levels " telephone line 4125526987 x 4567",..: 1 1 2 1 1
$ cline1: Factor w/ 4 levels " bump phone line 4125527777",..: 2 3 1 1 4
$ cline2: Factor w/ 4 levels " blay blay blah who knows what",..: 2 1 3 4 1
$ cline3: Factor w/ 3 levels ""," blay blay blah who knows what",..: 1 1 2 3 1
> as.data.frame(inp)
place tline1
1 TheInstitute 5467 telephone line 4125526987 x 4567
2 TheInstitute 5467 telephone line 4125526987 x 4567
3 TheInstitute 5467 telephone line 412552999 x 4999
4 TheInstitute 5467 telephone line 4125526987 x 4567
5 TheInstitute 5467 telephone line 4125526987 x 4567
cline1
1 datetime 2011110516 12:56
2 datetime 2011110516 12:58
3 bump phone line 4125527777
4 bump phone line 4125527777
5 datetime 2011110516 14:56
cline2
1 blay blay blah who knows what, but anyway it may have a comma
2 blay blay blah who knows what
3 datetime 2011110516 12:59
4 datetime 2011110516 13:51
5 blay blay blah who knows what
cline3
1
2
3 blay blay blah who knows what
4 blay blay blah who knows what, but anyway it may have a comma
5