我有一个txt
文件,其中包含患者账单的副本。我想从结算信息中收集有关给患者的一个特定阻力的信息。
文本文件包含有关患者日期的所有信息,并且列出了与购买日期相关的账单(因为他是一名内向患者,会有更多账单)。
现在,我使用以下代码提取了一个特定拖动的结算信息。
library(readr)
library(dplyr)
data = grep("CAR016", readLines("ip.txt"), value = TRUE)%>% as.data.frame
head(data)
str(data)
,输出如下:
> head(data)
.
1 4 14/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET
2 5 15/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET
3 6 16/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET
4 7 18/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 Suji
5 8 19/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 NISHAN
6 9 20/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 mam
> str(data)
'data.frame': 38 obs. of 1 variable:
$ .: Factor w/ 38 levels " 4 14/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET",..: 1 2 3 4 5 6 7 8 9 10 ...
可以看出,输出中有38行,但它只显示一个变量。现在我需要将这些行拆分为列(10列)。
我已经使用stringr
包剥离了while空格。但在那之后,我不知道如何继续分裂
library(readr)
library(stringr)
data = grep("CAR016", readLines("ip.txt"), value = TRUE)
for (i in seq(1:length(data))) {
data[i] = str_replace_all(data[i],pattern='\\s+' , repl=" ")
}
head(data)
> head(data)
[1] " 4 14/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET"
[2] " 5 15/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET"
[3] " 6 16/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET"
[4] " 7 18/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 Suji"
[5] " 8 19/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 NISHAN"
[6] " 9 20/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 mam"
任何提示都会非常感激。
提前致谢。
答案 0 :(得分:1)
如果文件的格式为fixed
(如示例所示),则可以使用tidyr::extract
和regex
选择10列作为选项:
library(tidyverse)
grep("CAR016", readLines("ip.txt"), value = TRUE)%>%
as.data.frame() %>% # Assuming 10 columns will be part of data
extract(., ., paste("Col",1:10,sep="_"),
regex = "(^\\d+)\\s(\\d{2}/\\d{2}/\\d{4})\\s([:alnum:]+)\\s+([A-Z :]+)\\s+(\\w+)\\s+([0-9.]+)\\s+(\\d+)\\s+([:alnum:]+)\\s+([0-9.]+)\\s+(.*$)")
结果:
# Col_1 Col_2 Col_3 Col_4 Col_5 Col_6 Col_7 Col_8 Col_9 Col_10
# 1 4 14/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET
# 2 5 15/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET
# 3 6 16/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET
# 4 7 18/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 Suji
# 5 8 19/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 NISHAN
# 6 9 20/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 mam
正则表达式说明:
我们需要10个组来表示tidyr::extract
所期望的10列。
(^\\d+) -- Group1 : Start with digits of any size \\s -- space (\\d{2}/\\d{2}/\\d{4}) -- Group2 : Date \\s -- space ([:alnum:]+) -- Group3 : Any number of continuous alpha-numeric \\s+ -- 1+ number of spaces ([A-Z :]+) -- Group4 : Any number of Character in upper case, : or space \\s+ -- 1+ number of spaces (\\w+) -- Group5 : 1+ number of word characters \\s+ -- 1+ number of spaces ([0-9.]+) -- Group6 : Digits with . \\s+ (\\d+) -- Group7 : 1+ digits \\s+ ([:alnum:]+) -- Group8 : 1+ numbers of continuous alpha-numeric \\s+ ([0-9.]+) -- Group9 : Digits with . \\s+ (.*$) -- Group10 : Anything left till end character.
已编辑:选项#2
根据OP的请求,用单个空格替换多个空格。之后,可以使用tidyr::separate
(固定列数)来拆分空间(sep = " ")
分隔符上的列。最后,需要unite
第4到第8列。解决方案如下:
library(tidyverse)
data <-
grep("CAR016", readLines("d:\\ip.txt"), value = TRUE)%>%
as.data.frame() %>% rename(., V1 = .) %>%
mutate(V1 = gsub("\\s+", " ",V1)) %>%
separate("V1", sprintf("Col_%02d",1:14), sep = " ") %>%
unite(V1_04, c("Col_04", "Col_05", "Col_06", "Col_07", "Col_08"), sep = " ")
data
# Col_01 Col_02 Col_03 V1_04 Col_09 Col_10 Col_11 Col_12 Col_13 Col_14
# 1 4 14/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET
# 2 5 15/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET
# 3 6 16/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET
# 4 7 18/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 Suji
# 5 8 19/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 NISHAN
# 6 9 20/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 mam