如何清理数据库并直接在R中读取它

时间:2019-07-20 02:23:57

标签: r dataframe awk

从链接[http://www.portaltransparencia.gov.br/download-de-dados/viagens][1]的透明度门户网站下载2019年每日价格和票证数据库时,我验证了列分隔符中是否有错误。有些行有更多“;”相对于其它的。编译时如何检查终端:

cat 2019_Viagem.csv | awk -F ";" '{print NF-1}' | sort | uniq -c

如何删除所有包含15个以上分隔符的行,并以.csv格式保存新数据库以进行统计分析?

这是我的初始代码:

    library("tidyverse")
    library("readr")
    library("data.table")
    library("stringr")
    library("lubridate")
    #unzip("2019_20190630_Viagens.zip")
    options(datatable.fread.input.cmd.message=FALSE)
    Diaria2019_Via <- "iconv -f ISO-8859-1 -t UTF-8 2019_Viagem.csv"
    Diaria2019 <- data.table::fread(Diaria2019_Via,dec = ",")


    Warning messages:
    1: In data.table::fread(Diaria2019_Via, dec = ",") :
      Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
    2: In data.table::fread(Diaria2019_Via, dec = ",") :
      Stopped early on line 7378. Expected 16 fields but found 18. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<"0000000000015769552";"Realizada";"53000";"Ministério do Desenvolvimento Regional";"53000";"Ministério do Desenvolvimento Regional - Unidades com vínculo direto";"***.043.57*-**";"ARMIN AUGUSTO BRAUN";"";"20190115";"20190116";"São Paulo/SP";"Representar a Secretaria Nacional de Proteção e Defesa Civil - SEDEC, no Seminário "Proteção e Defesa Civil Aplicada", onde Ministrará palestra sobre "Apoio Federal na Resposta a Desastres"; participará reunião com pessoal do Hospital Albert Eins>>

以上消息建议使用quote =" "fill = NULL,但是它们都不起作用。下面的代码错误地读取了每日价值列。我无法将该列的数据结构转换为数值。

Diaria_2019 <- read_delim("2019_Viagem.csv", 
                          ";", escape_double = FALSE, locale = locale(decimal_mark = ".",encoding = "ISO-8859-1"), 
                          trim_ws = TRUE)

下面的代码可能会开始删除类型超过;的15个分隔符的行,但还是行不通!

teste <- readLines("2019_Viagem.csv")
count <- str_count(teste, ';')
teste <- teste[count==15]
write.csv2(teste,"plan2019.csv",row.names = FALSE)
Diaria2019_Via <- "iconv -f ISO-8859-1 -t UTF-8 plan2019.csv"
Diaria2019 <- data.table::fread(Diaria2019_Via, dec = ",")

1 个答案:

答案 0 :(得分:1)

使用'readLines'读取输入,对每行应用正则表达式以计算分隔符,删除具有15个以上分隔符的行,然后使用'read_delim'读取清理后的输入。