从链接[http://www.portaltransparencia.gov.br/download-de-dados/viagens][1]的透明度门户网站下载2019年每日价格和票证数据库时,我验证了列分隔符中是否有错误。有些行有更多“;”相对于其它的。编译时如何检查终端:
cat 2019_Viagem.csv | awk -F ";" '{print NF-1}' | sort | uniq -c
如何删除所有包含15个以上分隔符的行,并以.csv
格式保存新数据库以进行统计分析?
这是我的初始代码:
library("tidyverse")
library("readr")
library("data.table")
library("stringr")
library("lubridate")
#unzip("2019_20190630_Viagens.zip")
options(datatable.fread.input.cmd.message=FALSE)
Diaria2019_Via <- "iconv -f ISO-8859-1 -t UTF-8 2019_Viagem.csv"
Diaria2019 <- data.table::fread(Diaria2019_Via,dec = ",")
Warning messages:
1: In data.table::fread(Diaria2019_Via, dec = ",") :
Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
2: In data.table::fread(Diaria2019_Via, dec = ",") :
Stopped early on line 7378. Expected 16 fields but found 18. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<"0000000000015769552";"Realizada";"53000";"Ministério do Desenvolvimento Regional";"53000";"Ministério do Desenvolvimento Regional - Unidades com vínculo direto";"***.043.57*-**";"ARMIN AUGUSTO BRAUN";"";"20190115";"20190116";"São Paulo/SP";"Representar a Secretaria Nacional de Proteção e Defesa Civil - SEDEC, no Seminário "Proteção e Defesa Civil Aplicada", onde Ministrará palestra sobre "Apoio Federal na Resposta a Desastres"; participará reunião com pessoal do Hospital Albert Eins>>
以上消息建议使用quote =" "
和fill = NULL
,但是它们都不起作用。下面的代码错误地读取了每日价值列。我无法将该列的数据结构转换为数值。
Diaria_2019 <- read_delim("2019_Viagem.csv",
";", escape_double = FALSE, locale = locale(decimal_mark = ".",encoding = "ISO-8859-1"),
trim_ws = TRUE)
下面的代码可能会开始删除类型超过;
的15个分隔符的行,但还是行不通!
teste <- readLines("2019_Viagem.csv")
count <- str_count(teste, ';')
teste <- teste[count==15]
write.csv2(teste,"plan2019.csv",row.names = FALSE)
Diaria2019_Via <- "iconv -f ISO-8859-1 -t UTF-8 plan2019.csv"
Diaria2019 <- data.table::fread(Diaria2019_Via, dec = ",")
答案 0 :(得分:1)
使用'readLines'读取输入,对每行应用正则表达式以计算分隔符,删除具有15个以上分隔符的行,然后使用'read_delim'读取清理后的输入。