R - 将多个值的1列列拆分为数据框

时间:2016-03-31 08:32:03

标签: r

我对R比较陌生,所以请你放轻松。

我正试图解决这个问题。

我有一个巨大的文本块,我已经从网站上删除了。它看起来像这样(我为了隐私而改变了一些信息):

> theText99 

499737 2016-03-31 10:37:29 00:00:32 SALES WORD INITIATIVE 160915 123456789101 
Person Name Completed\n499731 2016-03-31 10:36:50 00:13:50 SALES NON WORD 
INITIATIVE 160915 1234567891013 Woman Name Completed\n499726 2016-03-31 
10:36:29 00:07:57 SALES NON WORD INITIATIVE 160915 123456789101 Someone Berry 
Completed\n499672 2016-03-31 10:29:13 00:00:09 SALES WORD INITIATIVE 160915 
123456789101 Person Carr Completed\n499654 2016-03-31 10:27:16 00:00:09 SALES 
WORD INITIATIVE 160915 123456789101 Person Carr Completed\n499609 2016-03-31 
10:18:36 00:11:06 SALES WORD INITIATIVE 160915 123456789101 Person Carr 
Completed\n499601 2016-03-31 10:16:29 00:10:34 SALES WORD INITIATIVE 160915 
123456789101 FirstName Kang Completed\n499568 2016-03-31 10:10:39 00:02:31 
SALES NON WORD INITIATIVE 160915 123456789101 Person Carr Completed\n499548 
2016-03-31 10:06:40 00:07:15 SALES WORD INITIATIVE 160915 1234567891011 Pat 
Laugh Completed\n499508 2016-03-31 09:56:34 00:02:51 SALES WORD INITIATIVE 
160915 123456789101 Mark LastName Completed\n499499 2016-03-31 09:54:33 
00:00:08 SALES WORD INITIATIVE 160915 123456789101 Woman Name 
Completed\n499490 2016-03-31 09:53:04 00:04:28 SALES WORD INITIATIVE 160915 
123456789101 Person Name Completed

我的目标是将这些数据解析为数据框。

我完成了这么多:

> library(stringr)
> t <- str_split(theText99, "\\n")

这导致了一组很好的衬里文字..

[1] "499737 2016-03-31 10:37:29 00:00:32 SALES THING INITIATIVE 160915 123456789101 First Name Completed"
[2] "499731 2016-03-31 10:36:50 00:13:50 SALES THINGY INITIATIVE 160915 123456789101 Chelsea Hello Completed"
[3] "499726 2016-03-31 10:36:29 00:07:57 SALES THINGY INITIATIVE 160915 123456789101 Nice Name Completed"

把它放到一个数据框架中,以为我到了某个地方:

> x <- as.data.frame(t)
> t <- x[1,] # To Test on the first row
> library(stringi)
> library(stringr)
> t <- as.character(t)
> callId <- str_extract(t, "^[0-9]{6}")
> callId
[1] "499737"
> callDate <- str_extract(t, "[0-9\\-]{10}")
> callDate
[1] "2016-03-31"
> callDuration <- str_extract(t, "[0-9\\:?]{8}")
> callDuration
[1] "10:37:29"
> callInitiative <- str_extract(t, "([A-Z]...+[A-Z]+...[0-9]+)")
> callInitiative
[1] "SALES BLAHBLAH INITIATIVE 160915"
> phoneNumber <- str_extract(t, "(\\d){7,}")
> phoneNumber
[1] 123456789101
> agentName <- str_extract(t, "([A-Z][a-z]+ [A-Z][a-z]+)")
> agentName
> FirstName LastName

谁知道这段代码是否会成功......一些变量的长度经常变化。

我的问题 每行的最后一个文本经常更改:

例如: [3]&#34; 499726 2016-03-31 10:36:29 00:07:57 SALES THINGY INITIATIVE 160915 123456789101不错的名字已完成&#34;

[3]&#34; 499726 2016-03-31 10:36:29 00:07:57 SALES THINGY INITIATIVE 160915 123456789101好名字仍有待决定&#34;

[3]&#34; 499726 2016-03-31 10:36:29 00:07:57 SALES THINGY INITIATIVE 160915 123456789101不错的名字寻找另一个东西来源&#34;

拆分所有这些信息的最佳方式是什么?

我想我可能会在分裂字符串方面做太多工作......这是一种更好的方法吗?

大部分商品的长度都相同:

499726 - 6个数字 2016-03-31 - 相同日期格式 10:36:29 - 同时格式 00:07:57 - 同时格式 SALES THINGY INITIATIVE 160915 - 这种情况发生了变化,但所有的TEXT都带有一个数字 123456789101 - 电话号码,保持相同的长度 好名字 - 人名。名字,姓氏 已完成 - 此字段已更改。从1个单词到5个单词。

任何建议都会非常赞赏。

谢谢!

修改

我正在寻找信息进入专栏:

示例字符串: 499726 2016-03-31 10:36:29 00:07:57 SALES THINGY INITIATIVE 160915 123456789101好名字已经完成

列:

df <- data.frame(callID = 499726,
callDate = "2016-03-31",
callTime = "10:36:29",
callDuration = "00:07:57",
callInitiative = "SALES THINGY INITIATIVE 160915", 
phoneNumber = "123456789101",
agentName = "Nice Name",
callStatus = "Completed") 
## REemember, the data in this column could be anything from 'completed' to 
## Awaiting More Info' to 'Call Back Tomorrow' to 'Is Unaware of Anything 
## We're Saying' (etc)...From a string splitting perspective, this is 
## the last one that's giving me issues.`

1 个答案:

答案 0 :(得分:0)

<强>解决

x&lt; - str_locate(t,agentName) callStatus&lt; - substr(t,(x [2] +2),nchar(t))