编辑---我已经将问题清理得更小了。
我正在尝试以下列形式聚合数据框,但已陷入困境。
这是来自电话系统的isdn日志输出,因此它包含在整个日志中同时发生的呼叫。这些电话是传入的,而不是传出的。
数据框如下所示:
"V1" "V2""V3""V4" "V5" "V6" "V7" "V8"
"1" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189056:" "Oct 2 00:00:01.326 AEDST: ISDN Se0/0/0:15 Q931: RX <- SETUP pd = 8 callref = 0x174E "
"2" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189057:" " Bearer Capability i = 0x8090A3 "
"3" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189058:" " Standard = CCITT "
"4" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189059:" " Transfer Capability = Speech "
"5" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189060:" " Transfer Mode = Circuit "
"6" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189061:" " Transfer Rate = 64 kbit/s "
"7" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189062:" " Channel ID i = 0xA1839B "
"8" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189063:" " Preferred, Channel 27 "
"9" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189064:" " Calling Party Number i = 0x2183, '00123456789' "
"10" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189065:" " Plan:ISDN, Type:National "
"11" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189066:" " Called Party Number i = 0xC1, '0123456' "
"12" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189067:" " Plan:ISDN, Type:Subscriber(local) "
"13" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189068:" " Sending Complete"
"14" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189069:" "Oct 2 00:00:01.334 AEDST: ISDN Se0/0/0:15 Q931: TX -> CALL_PROC pd = 8 callref = 0x974E "
"15" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189070:" " Channel ID i = 0xA9839B "
"16" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189071:" " Exclusive, Channel 27"
"17" "Oct" "" "2" "00:00:02" "10.20.5.31" "82189072:" "Oct 2 00:00:01.350 AEDST: ISDN Se0/0/0:15 Q931: TX -> ALERTING pd = 8 callref = 0x974E "
"18" "Oct" "" "2" "00:00:02" "10.20.5.31" "82189073:" " Progress Ind i = 0x8088 - In-band info or appropriate now available "
"19" "Oct" "" "2" "00:00:02" "10.20.5.31" "82189074:" "Oct 2 00:00:01.358 AEDST: ISDN Se0/0/0:15 Q931: TX -> CONNECT pd = 8 callref = 0x974E"
"20" "Oct" "" "2" "00:00:02" "10.20.5.31" "82189075:" "Oct 2 00:00:01.382 AEDST: ISDN Se0/0/0:15 Q931: RX <- CONNECT_ACK pd = 8 callref = 0x174E"
"21" "Oct" "" "2" "00:00:19" "10.20.5.30" "81488302:" "Oct 2 00:00:18.210 AEDST: ISDN Se0/0/0:15 Q931: TX -> DISCONNECT pd = 8 callref = 0x9AC7 "
"22" "Oct" "" "2" "00:00:19" "10.20.5.30" "81488303:" " Cause i = 0x8090 - Normal call clearing"
"23" "Oct" "" "2" "00:00:19" "10.20.5.30" "81488304:" "Oct 2 00:00:18.290 AEDST: ISDN Se0/0/0:15 Q931: RX <- RELEASE pd = 8 callref = 0x1AC7"
"24" "Oct" "" "2" "00:00:19" "10.20.5.30" "81488305:" "Oct 2 00:00:18.314 AEDST: ISDN Se0/0/0:15 Q931: TX -> RELEASE_COMP pd = 8 callref = 0x9AC7"
"25" "Oct" "" "2" "00:00:21" "10.20.5.31" "82189076:" "Oct 2 00:00:21.053 AEDST: ISDN Se0/1/0:15 Q931: RX <- SETUP pd = 8 callref = 0x093A "
我希望数据集如下所示:
"V1" "V2""V3""V4" "V5" "V6" "V7" "UniqueId" "V8"
"1" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189056:" "0x174E" "Oct 2 00:00:01.326 AEDST: ISDN Se0/0/0:15 Q931: RX <- SETUP pd = 8 callref = 0x174E "
"2" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189057:" "0x174E" " Bearer Capability i = 0x8090A3 "
"3" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189058:" "0x174E" " Standard = CCITT "
....
"21" "Oct" "" "2" "00:00:19" "10.20.5.30" "81488302:" "0x9AC7" "Oct 2 00:00:18.210 AEDST: ISDN Se0/0/0:15 Q931: TX -> DISCONNECT pd = 8 callref = 0x9AC7 "
重新迭代:
调用引用是识别此数据集的唯一方法,也是已知的 as callref例如0x174E(这是查找唯一调用的唯一方法 在数据集内)。 这是请求的数据框中的新列(UniqueId)。
下面的任何行也会在新列中粘贴相同的callref id,直到它遇到另一行,该行指出同一个callref或另一个call ref。
每次显示callref时,可以将这些行折叠为一行的任何人的奖励积分。请注意,这可能发生在几个不同的状态(当包含callref的行也包含TX - &gt; CALL_PROC,TX - &gt; ALERTING,TX - &gt; CONNECT,RX&lt; - CONNECT_ACK和其他几个。)
例如,我已将第1,2和3行的V7列合并为属于同一个callref
"V1" "V2""V3""V4" "V5" "V6" "V7" "UniqueId" "V8"
"1" "Oct" "" "2" "00:00:01" "10.20.5.31" "82189056:" "0x174E" "Oct 2 00:00:01.326 AEDST: ISDN Se0/0/0:15 Q931: RX <- SETUP pd = 8 callref = 0x174E \n Bearer Capability i = 0x8090A3 \n Standard = CCITT"
感谢任何答案。
答案 0 :(得分:1)
所以这个答案有点乱,但我尽了最大努力。
您可以跳过我的read.fwf
,因为您对str_split
做了同样的事情。我只是想以可行的格式获取数据。
我首先阅读了信息,将一些列分开
example1 <- read.fwf("ex.csv", widths = c(1, 6, 10, 10, 10, 1000), strip.white = T)
将所有内容转换为字符串而不是因素,删除第一行标题,然后重命名列。
example <- example1 %>%
mutate_all(.funs = as.character) %>%
slice(-1) %>%
select(-1,
Date = 2,
Time = 3,
IP = 4,
id = 5,
Description = 6)
然后,我将callref发生的第一个点编入索引,然后按这些文本块进行分组。
x <- which(grepl("callref", example$Description))
example <- example %>%
mutate(callref = ifelse(grepl("callref", Description), 1, 0),
group = rep(x, c(diff(c(x, x))[1:length(x)-1], nrow(.) - x[length(x)]+1)))
在example
df分组后,我总结了文本,超过了组内的描述。我认为这是你要做的主要事情吗?
example2 <- example %>%
group_by(group) %>%
summarise(text = paste(Description, collapse = "*"))
之后我将其加入主example
df,并使用单独的内容分离出一些重要信息。我们可以通过这种方式获取RX_TX,以及callref id。如果需要,您可以拆分任何其他重要信息,然后我建议使用tidyr的spread
函数将该信息转换为列,以便您可以进一步清理它以进行分析。
example3 <- example %>%
filter(callref == 1) %>%
left_join(example2, by = "group") %>%
select(-Description) %>%
rename(Description = text) %>%
separate(Description, into = c("firstpart", "RX_TX"), sep = "Q931: ") %>%
separate(RX_TX, into = c("RX_TX", "Info"), sep = "pd = 8") %>%
mutate(Call_Ref = substr(gsub("callref \\= ", "", Info), 1, 8))