对于一个研究项目,我们将大型SAP数据库转储为CSV文件。分隔符是逗号(“,”)。 问题是有一些存储一些文本的列。这弄乱了我的数据导入。只有一列包含这些多个逗号。
我已经尝试将整个文件作为字符串读取,然后使用str_split()拆分行。我认为更合适的方法是使用一些正则表达式。
“常规”数据如下:
010,0040,0000399500,2018,KX,01/17/2015 00:00:00,01/17/2015 00:00:00,,ZAR,,2,,40,S,S,13860.00,VOUCHERS 126,,1000,0004301410,,0000669010,,,,0.000,,,0,0.00,ZAR,VOUCHERS,20180117,,
“损坏的”数据记录如下所示。 单元格,停车,空中交通是一个单元格的内容,但将被分为三个...
010,0040,0000399500,2018,KX,01/17/2015 00:00:00,01/23/2015 00:00:00,,ZAR,,2,,40,S,S,482.46,CELL,PARKING,AIRFARE,,1000,0004300010,,0000682110,,,,0.000,,,0,0.00,ZAR,CELL PARKING,20180123,,
我的生殖代码段非常有限。
mydata = read.delim("SAP_input_file.csv", sep = ",")
答案 0 :(得分:2)
这里有两个选择。
1)gsubfn 最后,使用注释中的input
,假设每行中有35个字段,第17个字段可能是有问题的字段。在第17个字段中可以有任意多个逗号,包括零。现在使用捕获组(即括号)创建一个与此类行匹配的模式,以包围字段。将gsubfn中的read.pattern
与该模式一起使用以读取它。
library(gsubfn)
pat <- paste0("^", strrep("([^,]*),", 16), "(.*)", strrep(",([^,]*)", 18), "$")
read.pattern(text = input, pat = pat)
给予:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 10 40 399500 2018 KX 01/17/2015 00:00:00 01/17/2015 00:00:00 NA ZAR NA 2
2 10 40 399500 2018 KX 01/17/2015 00:00:00 01/23/2015 00:00:00 NA ZAR NA 2
V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23
1 NA 40 S S 13860.00 VOUCHERS 126 NA 1000 4301410 NA 669010 NA
2 NA 40 S S 482.46 CELL,PARKING,AIRFARE NA 1000 4300010 NA 682110 NA
V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35
1 NA NA 0 NA NA 0 0 ZAR VOUCHERS 20180117 NA NA
2 NA NA 0 NA NA 0 0 ZAR CELL PARKING 20180123 NA NA
2)基本R 此解决方案仅使用基本R。我们用分号替换前16个逗号,然后用分号替换后18个逗号。然后阅读。
ss <- input
for(i in 1:16) ss <- sub(",", ";", ss)
for(i in 1:18) ss <- sub("(.*),", "\\1;", ss)
read.table(text = ss, sep = ";")
s1 <- "010,0040,0000399500,2018,KX,01/17/2015 00:00:00,01/17/2015 00:00:00,,ZAR,,2,,40,S,S,13860.00,VOUCHERS 126,,1000,0004301410,,0000669010,,,,0.000,,,0,0.00,ZAR,VOUCHERS,20180117,,"
s2 <- "010,0040,0000399500,2018,KX,01/17/2015 00:00:00,01/23/2015 00:00:00,,ZAR,,2,,40,S,S,482.46,CELL,PARKING,AIRFARE,,1000,0004300010,,0000682110,,,,0.000,,,0,0.00,ZAR,CELL PARKING,20180123,,"
input <- c(s1, s2)
将原始解决方案替换为(1)中更短的解决方案。简化后的原始解给出(2)。
答案 1 :(得分:0)
也许某些正则表达式可以提供帮助。但是,我的代码并未被概括,它适用于您的特定示例,其中三个完整的单词之间用逗号分隔。但是也许您可以扩展逻辑以使其适合您的数据:)
x <- "010,0040,0000399500,2018,KX,01/17/2015 00:00:00,01/23/2015 00:00:00,,ZAR,,2,,40,S,S,482.46,CELL,PARKING,AIRFARE,,1000,0004300010,,0000682110,,,,0.000,,,0,0.00,ZAR,CELL PARKING,20180123,,"
library(stringr)
# regex to find three words separated by commas
pattern <- "[a-zA-Z]+,[a-zA-Z]+,[a-zA-Z]+"
# extract the pattern and replace commas with space
correct_substring <- str_extract_all(x, pattern) %>%
str_replace_all(",", " ")
# Insert the manipulated string into the original string
new_string <- str_replace_all(x, "[a-zA-Z]+,[a-zA-Z]+,[a-zA-Z]+", correct_string)
# Now we can split the string by commas
str_split(new_string, pattern = ",")
结果
[[1]]
[1] "010" "0040" "0000399500" "2018" "KX"
[6] "01/17/2015 00:00:00" "01/23/2015 00:00:00" "" "ZAR" ""
[11] "2" "" "40" "S" "S"
[16] "482.46" "CELL PARKING AIRFARE" "" "1000" "0004300010"
[21] "" "0000682110" "" "" ""
[26] "0.000" "" "" "0" "0.00"
[31] "ZAR" "CELL PARKING" "20180123" "" ""