在R

时间:2016-07-29 11:52:29

标签: r csv import data-processing data-cleaning

我是一名向空间数据分析迁移的城市规划师。我并没有忘记R和编程,但由于我没有经过适当的训练,我的技能有时会受到限制。

目前我正在尝试分析大约50个包含有关公共拍卖的财务数据的CSV文件,这些数据的长度为60000到300000行,共有39个字段。这些文件是罗马尼亚国家公共拍卖系统的出口,这是一个形式平台。

问题是某些行在地址字段中间被CRLF行结尾打破。我怀疑当人们在表格中输入他们的地址时,他们会将其从其他多行文件中复制/粘贴。

Find& Replace无法解决此问题,因为这也会替换该行末尾的正确CRLF

作为一个例子,数据的格式是这样的,每行之后有一个CRLF(他们使用^作为分隔符):

Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1
Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15; 
Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^

为了正确处理数据,我需要像这样读取CSV,只删除那些断行的CRLF - Find& Replace不能这样做:

Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1 Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15; Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^

我找到了一个可能的解决方案(Is there a way in R to join broken lines of csv file?),但需要进行一些调整以满足我的需求。最终结果是下面的代码挂起并且没有到达进程的末尾,即使是在小样本文件上也是如此。

我对上述帖子中接受的解决方案代码进行了更改:

dat <- readLines("filename.csv") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], "^", fixed = TRUE)) # extract variable names
nvar <- length(varnames)

k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))

while(k <= length(dat)){
  k <- k + 1
  if(dat[k] == "") {k <- k + 1
  print(paste("data line", k, "is an empty string"))
  if(k > length(dat)) {break}
  }
  temp <- dat[k]
  # checks if there are enough commas or if the line was broken
  while(length(gregexpr("^", temp)[[1]]) < nvar-1){
    k <- k + 1
    temp <- paste0(temp, dat[k])
  }
  temp <- unlist(strsplit(temp, "^"))
  message(k)
  dat1 <- rbind(dat1, temp)
}

dat1 = dat1[-1,] # delete the empty initial row    

在分隔符之间计算字段似乎是一个很好的解决方案,但我无法找到一个很好的方法来做到这一点,我的R编程技能显然不够。

那么有没有办法在R中修复这种类型的损坏的CSV文件?

可以在此处访问工作文件示例:http://data.gv.ro/dataset/4a4903c4-b1e3-46d1-82a5-238287f9496c/resource/c6abc0ef-3efb-4aef-bc0a-411f8cab2a28/download/contracte-2007.csv

感谢您提供任何帮助!

2 个答案:

答案 0 :(得分:1)

麻烦似乎是^是一个特殊的角色。如果单步执行代码,您将看到有627个变量而不是39个。它使每个字符成为变量。试试这个:

dat <- readLines("filename.csv") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], "\\^"))  # extract variable names
nvar <- length(varnames)

k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))

while(k <= length(dat)){
  k <- k + 1
  #if(dat[k] == "") {k <- k + 1
  #print(paste("data line", k, "is an empty string"))
  if(k > length(dat)) {break}
  #}
  temp <- dat[k]
  # checks if there are enough commas or if the line was broken
  while(length(gregexpr("\\^", temp)[[1]]) < nvar-1){
    k <- k + 1
    temp <- paste0(temp, dat[k])
  }
  temp <- unlist(strsplit(temp, "\\^"))
  message(k)
  dat1 <- rbind(dat1, temp)
}

dat1 = dat1[-1,] # delete the empty initial row    

很抱歉错过了您和我的代码的差异。你不想要fixed = true。将其更改为上面的内容可以为您提供:

> varnames
 [1] "Castigator"                       "CastigatorCUI"                    "CastigatorTara"                  
 [4] "CastigatorLocalitate"             "CastigatorAdresa"                 "Tip"                             
 [7] "TipContract"                      "TipProcedura"                     "AutoritateContractanta"          
[10] "AutoritateContractantaCUI"        "TipAC"                            "TipActivitateAC"                 
[13] "NumarAnuntAtribuire"              "DataAnuntAtribuire"               "TipIncheiereContract"            
[16] "TipCriteriiAtribuire"             "CuLicitatieElectronica"           "NumarOfertePrimite"              
[19] "Subcontractat"                    "NumarContract"                    "DataContract"                    
[22] "TitluContract"                    "Valoare"                          "Moneda"                          
[25] "ValoareRON"                       "ValoareEUR"                       "CPVCodeID"                       
[28] "CPVCode"                          "NumarAnuntParticipare"            "DataAnuntParticipare"            
[31] "ValoareEstimataParticipare"       "MonedaValoareEstimataParticipare" "FonduriComunitare"               
[34] "TipFinantare"                     "TipLegislatieID"                  "FondEuropean"                    
[37] "ContractPeriodic"                 "DepoziteGarantii"                 "ModalitatiFinantare" 

答案 1 :(得分:0)

我们可以通过检查它是否以数字字段结尾来确定每条记录的最后一行。然后使用cumsum我们可以使用1,2,3,...来标记同一记录中的行。最后将它们粘贴在一起

cnt <- count.fields(textConnection(L), sep = "^") - 1
g <- rev(cumsum(rev(cumsum(cnt) %% 4 == 0)))

上述内容适用于问题中显示的数据,但如果实际数据与您显示的数据存在差异,则可能需要进行一些修改,具体取决于这些差异。

注意:如果每条记录中总共有四个^字符,请尝试将标记为##的行替换为:

Lines <- "Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1
Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15; 
Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^"

L <- readLines(textConnection(Lines))

cnt <- count.fields(textConnection(L), sep = "^") - 1   # 38 4 34 4 34
g <- rev(cumsum(rev(cumsum(cnt) %% 38 == 0)))
g <- max(g) - g + 1   # 1 2 2 3 3
L2 <- tapply(L, g, paste, collapse = " ")
DF <- read.table(text = L2, sep = "^")
dim(DF)
## [1]  3 39

更新问题已更改为提供新的样本数据。请注意,发布的答案适用于它但当然您需要将38替换为38,因为新数据每个记录有38个分隔符,而旧数据有4个。此外,旧数据有一个标题而新数据不是这样我们有删除了用于删除标题的-1次出现。这是一个自包含的示例,可以复制并粘贴到R。

comment.char = "", quote = ""

示例数据不包含注释字符(#)或单引号或双引号,但如果确实包含这些是数据的一部分,则将count.fields添加到read.table和{{1}}需要打电话。