我对R比较陌生,所以请你放轻松。
我正试图解决这个问题。
我有一个巨大的文本块,我已经从网站上删除了。它看起来像这样(我为了隐私而改变了一些信息):
> theText99
499737 2016-03-31 10:37:29 00:00:32 SALES WORD INITIATIVE 160915 123456789101
Person Name Completed\n499731 2016-03-31 10:36:50 00:13:50 SALES NON WORD
INITIATIVE 160915 1234567891013 Woman Name Completed\n499726 2016-03-31
10:36:29 00:07:57 SALES NON WORD INITIATIVE 160915 123456789101 Someone Berry
Completed\n499672 2016-03-31 10:29:13 00:00:09 SALES WORD INITIATIVE 160915
123456789101 Person Carr Completed\n499654 2016-03-31 10:27:16 00:00:09 SALES
WORD INITIATIVE 160915 123456789101 Person Carr Completed\n499609 2016-03-31
10:18:36 00:11:06 SALES WORD INITIATIVE 160915 123456789101 Person Carr
Completed\n499601 2016-03-31 10:16:29 00:10:34 SALES WORD INITIATIVE 160915
123456789101 FirstName Kang Completed\n499568 2016-03-31 10:10:39 00:02:31
SALES NON WORD INITIATIVE 160915 123456789101 Person Carr Completed\n499548
2016-03-31 10:06:40 00:07:15 SALES WORD INITIATIVE 160915 1234567891011 Pat
Laugh Completed\n499508 2016-03-31 09:56:34 00:02:51 SALES WORD INITIATIVE
160915 123456789101 Mark LastName Completed\n499499 2016-03-31 09:54:33
00:00:08 SALES WORD INITIATIVE 160915 123456789101 Woman Name
Completed\n499490 2016-03-31 09:53:04 00:04:28 SALES WORD INITIATIVE 160915
123456789101 Person Name Completed
我的目标是将这些数据解析为数据框。
我完成了这么多:
> library(stringr)
> t <- str_split(theText99, "\\n")
这导致了一组很好的衬里文字..
[1] "499737 2016-03-31 10:37:29 00:00:32 SALES THING INITIATIVE 160915 123456789101 First Name Completed"
[2] "499731 2016-03-31 10:36:50 00:13:50 SALES THINGY INITIATIVE 160915 123456789101 Chelsea Hello Completed"
[3] "499726 2016-03-31 10:36:29 00:07:57 SALES THINGY INITIATIVE 160915 123456789101 Nice Name Completed"
把它放到一个数据框架中,以为我到了某个地方:
> x <- as.data.frame(t)
> t <- x[1,] # To Test on the first row
> library(stringi)
> library(stringr)
> t <- as.character(t)
> callId <- str_extract(t, "^[0-9]{6}")
> callId
[1] "499737"
> callDate <- str_extract(t, "[0-9\\-]{10}")
> callDate
[1] "2016-03-31"
> callDuration <- str_extract(t, "[0-9\\:?]{8}")
> callDuration
[1] "10:37:29"
> callInitiative <- str_extract(t, "([A-Z]...+[A-Z]+...[0-9]+)")
> callInitiative
[1] "SALES BLAHBLAH INITIATIVE 160915"
> phoneNumber <- str_extract(t, "(\\d){7,}")
> phoneNumber
[1] 123456789101
> agentName <- str_extract(t, "([A-Z][a-z]+ [A-Z][a-z]+)")
> agentName
> FirstName LastName
谁知道这段代码是否会成功......一些变量的长度经常变化。
我的问题 每行的最后一个文本经常更改:
例如: [3]&#34; 499726 2016-03-31 10:36:29 00:07:57 SALES THINGY INITIATIVE 160915 123456789101不错的名字已完成&#34;
[3]&#34; 499726 2016-03-31 10:36:29 00:07:57 SALES THINGY INITIATIVE 160915 123456789101好名字仍有待决定&#34;
[3]&#34; 499726 2016-03-31 10:36:29 00:07:57 SALES THINGY INITIATIVE 160915 123456789101不错的名字寻找另一个东西来源&#34;
拆分所有这些信息的最佳方式是什么?
我想我可能会在分裂字符串方面做太多工作......这是一种更好的方法吗?
大部分商品的长度都相同:
499726 - 6个数字 2016-03-31 - 相同日期格式 10:36:29 - 同时格式 00:07:57 - 同时格式 SALES THINGY INITIATIVE 160915 - 这种情况发生了变化,但所有的TEXT都带有一个数字 123456789101 - 电话号码,保持相同的长度 好名字 - 人名。名字,姓氏 已完成 - 此字段已更改。从1个单词到5个单词。
任何建议都会非常赞赏。
谢谢!
修改
我正在寻找信息进入专栏:
示例字符串: 499726 2016-03-31 10:36:29 00:07:57 SALES THINGY INITIATIVE 160915 123456789101好名字已经完成
列:
df <- data.frame(callID = 499726,
callDate = "2016-03-31",
callTime = "10:36:29",
callDuration = "00:07:57",
callInitiative = "SALES THINGY INITIATIVE 160915",
phoneNumber = "123456789101",
agentName = "Nice Name",
callStatus = "Completed")
## REemember, the data in this column could be anything from 'completed' to
## Awaiting More Info' to 'Call Back Tomorrow' to 'Is Unaware of Anything
## We're Saying' (etc)...From a string splitting perspective, this is
## the last one that's giving me issues.`
答案 0 :(得分:0)
<强>解决强>
x&lt; - str_locate(t,agentName) callStatus&lt; - substr(t,(x [2] +2),nchar(t))