我刚写了一个4D for循环。我相信有更好的方法可以做到这一点

时间:2018-03-16 20:09:42

标签: r loops vector

我确定我没有使用/理解R的一些矢量特征。这段代码有效,但你可能会猜到它的速度非常慢。

该代码用于识别3个不同脏地址列中的城市,州,邮政编码和/或国家/地区名称,并将其解析为单独的城市/州/国家/邮政列。所以我得到了geonames dataset世界城市,我正在使用grepl来缓慢而痛苦地遍历几列地址和几列地理名称数据。

我知道R是专为处理大型数据集而构建的,而且迭代可能非常非常错误,但我不确定如何实现这样的东西。在你嘲笑我创造的这个怪物之后,我会很感激有关如何修复它的一些建议。

addresses <- c('addressColumn1', 'addressColumn2', 'addressColumn3')
matches = 0

#loop through each row of the dirty dataset
for(i in 1:nrow(dirty_data) {

  #Used to move on to next row if a match is found
  stopLooking = FALSE

  #loop through each of the address columns in the dirty dataset
  for(addr in addresses){

    matches=0

    #Loop through each row of the geonames data
    for(z in 1:nrow(geonames)) {

      #Skip iteration if a match has been found
      if(stopLooking)
        break

      #For each data point in geonames such as city/state/country/zip
      for(place in names(geonames)) {

        #Quick and dirty handling of NA values from grepl()
        if(is.na(grepl(pattern=toupper(geonames[z, place]), x=toupper(dirty_data[i ,addr]), fixed=TRUE)))
          skip = TRUE

        #if city or state or zip or country name/code is found in address
        if(grepl(pattern=toupper(geonames[z, place]), x=toupper(dirty_data[i ,addr]), fixed=TRUE) && !skip) {
          matches = matches + 1
        }

        #If all the geonames columns have been checked, determine if match found
        if(place == names(geonames)[length(names(geonames))]) {
          if(matches > 1) {
            print(paste('Match found z=', z, ' i=', i, ' addr=', addr, sep=''))
            dirty_data[i, paste(addr, 'address_city', sep='_')] <- geonames[z, 'place_name']
            dirty_data[i, paste(addr, 'address_state', sep='_')] <- geonames[z, 'admin_name1']
            dirty_data[i, paste(addr, 'address_postal_code', sep='_')] <- geonames[z, 'postal_code']
            dirty_data[i, paste(addr, 'address_country', sep='_')] <- geonames[z, 'country_name']
            stopLooking = TRUE
          }
        }
      }
    }
  }
}

示例地理名称数据:

country_code <- c('US', 'US', 'US', 'US', 'US', 'US')
postal_code <- c(99553, 99571, 99583, 99612, 99661, 99546)
place_name <- c('Akutan', 'Cold Bay', 'False Pass', 'King Cove', 'Sand Point', 'Adak')
admin_code1 <- c('AK', 'AK', 'AK', 'AK', 'AK', 'AK')
admin_name1 <- c('Alaska', 'Alaska', 'Alaska', 'Alaska', 'Alaska', 'Alaska')
country_name <- c('United States', 'United States', 'United States', 'United States', 'United States', 'United States')

geonames <- data.frame(country_code, postal_code, place_name, admin_code1, admin_name1, country_name)

示例脏数据:

addressColumn1 <- c('1 Salmon Ln, Akutan, AK 99553', 'Sandon Rd, Stone ST15 8XS, UK',',,,,,,London')
addressColumn2 <- c(' Mexico', 'Holger-Börner-Platz 1, 34119 Kassel, Germany', 'Atlanta, GA')
addressColumn3 <- c('London', 'Iceland', '123 Madeup Street, Minneapolis')

dirty_data <- data.frame(addressColumn1, addressColumn2, addressColumn3)

示例输出:

上面的代码将起作用并产生所需的输出,该输出填充现有数据帧中的列。我在下面添加了更多详细信息。

使用匹配的信息填写最后一个for循环底部的列(它们输出列名+ _address_city等)

例如,dirty_data[1, 'addressColumn1] == '1 Salmon Ln, Akutan, AK 99553'geonames[1, 'postal_code] == 99553geonames[1, 'place_name'] == 'Akutan'geonames[1, 'admin_code1'] == 'AK'匹配。

由于这是两个或更多匹配,它将填充列(使用粘贴命令创建列名,此处显示它们实际上是如何输出的以简化)

dirty_data[1, 'addressColumn1_address_postal_code'] <- 99553 
dirty_data[1, 'addressColumn1_address_place_name'] <- 'Akutan' 
dirty_data[1, 'addressColumn1_address_admin_code1'] <- 'AK'
dirty_data[1, 'addressColumn1_address_admin_name1'] <- 'Alaska'
dirty_data[1, 'addressColumn1_address_country_name'] <- 'United States'

0 个答案:

没有答案