如何将txt中的行拆分为R中的csv?

时间:2018-06-11 17:34:21

标签: r csv dataframe text split

我有一个txt文件,其中包含患者账单的副本。我想从结算信息中收集有关给患者的一个特定阻力的信息。

文本文件包含有关患者日期的所有信息,并且列出了与购买日期相关的账单(因为他是一名内向患者,会有更多账单)。

现在,我使用以下代码提取了一个特定拖动的结算信息。

library(readr)
library(dplyr)
data = grep("CAR016", readLines("ip.txt"), value = TRUE)%>% as.data.frame
head(data)
str(data)

,输出如下:

> head(data)
                                                                                                                 .
1      4 14/03/2018 CAR016     CARDIAC MONITOR : PER DAY                 OTH         750.00  1 GEN     750.00 SGET
2      5 15/03/2018 CAR016     CARDIAC MONITOR : PER DAY                 OTH         750.00  1 GEN     750.00 SGET
3      6 16/03/2018 CAR016     CARDIAC MONITOR : PER DAY                 OTH         750.00  1 GEN     750.00 SGET
4      7 18/03/2018 CAR016     CARDIAC MONITOR : PER DAY                 OTH         750.00  1 GEN     750.00 Suji
5    8 19/03/2018 CAR016     CARDIAC MONITOR : PER DAY                 OTH         750.00  1 GEN     750.00 NISHAN
6       9 20/03/2018 CAR016     CARDIAC MONITOR : PER DAY                 OTH         750.00  1 GEN     750.00 mam
> str(data)
'data.frame':   38 obs. of  1 variable:
 $ .: Factor w/ 38 levels "   4 14/03/2018 CAR016     CARDIAC MONITOR : PER DAY                 OTH         750.00  1 GEN     750.00 SGET",..: 1 2 3 4 5 6 7 8 9 10 ...

可以看出,输出中有38行,但它只显示一个变量。现在我需要将这些行拆分为列(10列)。

怎么做?

更新

我已经使用stringr包剥离了while空格。但在那之后,我不知道如何继续分裂

代码:

library(readr)
library(stringr)

data = grep("CAR016", readLines("ip.txt"), value = TRUE) 


for (i in seq(1:length(data))) {
  data[i] =  str_replace_all(data[i],pattern='\\s+' , repl=" ")
}

head(data)

输出:

> head(data)
[1] " 4 14/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET"  
[2] " 5 15/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET"  
[3] " 6 16/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 SGET"  
[4] " 7 18/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 Suji"  
[5] " 8 19/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 NISHAN"
[6] " 9 20/03/2018 CAR016 CARDIAC MONITOR : PER DAY OTH 750.00 1 GEN 750.00 mam" 

任何提示都会非常感激。

提前致谢。

1 个答案:

答案 0 :(得分:1)

如果文件的格式为fixed(如示例所示),则可以使用tidyr::extractregex选择10列作为选项:

 library(tidyverse)
 grep("CAR016", readLines("ip.txt"), value = TRUE)%>% 
 as.data.frame() %>%  # Assuming 10 columns will be part of data
 extract(., ., paste("Col",1:10,sep="_"), 
   regex = "(^\\d+)\\s(\\d{2}/\\d{2}/\\d{4})\\s([:alnum:]+)\\s+([A-Z :]+)\\s+(\\w+)\\s+([0-9.]+)\\s+(\\d+)\\s+([:alnum:]+)\\s+([0-9.]+)\\s+(.*$)")

结果:

#   Col_1      Col_2  Col_3                                     Col_4 Col_5  Col_6 Col_7 Col_8  Col_9 Col_10
# 1     4 14/03/2018 CAR016 CARDIAC MONITOR : PER DAY                   OTH 750.00     1   GEN 750.00   SGET
# 2     5 15/03/2018 CAR016 CARDIAC MONITOR : PER DAY                   OTH 750.00     1   GEN 750.00   SGET
# 3     6 16/03/2018 CAR016 CARDIAC MONITOR : PER DAY                   OTH 750.00     1   GEN 750.00   SGET
# 4     7 18/03/2018 CAR016 CARDIAC MONITOR : PER DAY                   OTH 750.00     1   GEN 750.00   Suji
# 5     8 19/03/2018 CAR016 CARDIAC MONITOR : PER DAY                   OTH 750.00     1   GEN 750.00 NISHAN
# 6     9 20/03/2018 CAR016 CARDIAC MONITOR : PER DAY                   OTH 750.00     1   GEN 750.00    mam

正则表达式说明:

我们需要10个组来表示tidyr::extract所期望的10列。

(^\\d+)                 -- Group1 : Start with digits of any size
\\s                     -- space 
(\\d{2}/\\d{2}/\\d{4})  -- Group2 : Date 
\\s                     -- space 
([:alnum:]+)            -- Group3 : Any number of continuous alpha-numeric
\\s+                    -- 1+ number of spaces 
([A-Z :]+)              -- Group4 : Any number of Character in upper case, : or space
\\s+                    -- 1+ number of spaces 
(\\w+)                  -- Group5 : 1+ number of word characters
\\s+                    -- 1+ number of spaces 
([0-9.]+)               -- Group6 : Digits with .
\\s+
(\\d+)                  -- Group7 : 1+ digits
\\s+
([:alnum:]+)            -- Group8 : 1+ numbers of continuous alpha-numeric 
\\s+
([0-9.]+)               -- Group9 : Digits with .
\\s+
(.*$)                   -- Group10 : Anything left till end character.

已编辑:选项#2

根据OP的请求,用单个空格替换多个空格。之后,可以使用tidyr::separate(固定列数)来拆分空间(sep = " ")分隔符上的列。最后,需要unite第4到第8列。解决方案如下:

library(tidyverse)


data <- 
  grep("CAR016", readLines("d:\\ip.txt"), value = TRUE)%>% 
  as.data.frame() %>% rename(., V1 = .) %>%
  mutate(V1 = gsub("\\s+", " ",V1)) %>%
  separate("V1", sprintf("Col_%02d",1:14), sep = " ") %>%
  unite(V1_04, c("Col_04", "Col_05",  "Col_06", "Col_07", "Col_08"), sep = " ")
data
#   Col_01     Col_02 Col_03                     V1_04 Col_09 Col_10 Col_11 Col_12 Col_13 Col_14
# 1      4 14/03/2018 CAR016 CARDIAC MONITOR : PER DAY    OTH 750.00      1    GEN 750.00   SGET
# 2      5 15/03/2018 CAR016 CARDIAC MONITOR : PER DAY    OTH 750.00      1    GEN 750.00   SGET
# 3      6 16/03/2018 CAR016 CARDIAC MONITOR : PER DAY    OTH 750.00      1    GEN 750.00   SGET
# 4      7 18/03/2018 CAR016 CARDIAC MONITOR : PER DAY    OTH 750.00      1    GEN 750.00   Suji
# 5      8 19/03/2018 CAR016 CARDIAC MONITOR : PER DAY    OTH 750.00      1    GEN 750.00 NISHAN
# 6      9 20/03/2018 CAR016 CARDIAC MONITOR : PER DAY    OTH 750.00      1    GEN 750.00    mam