我确定我没有使用/理解R的一些矢量特征。这段代码有效,但你可能会猜到它的速度非常慢。
该代码用于识别3个不同脏地址列中的城市,州,邮政编码和/或国家/地区名称,并将其解析为单独的城市/州/国家/邮政列。所以我得到了geonames dataset世界城市,我正在使用grepl来缓慢而痛苦地遍历几列地址和几列地理名称数据。
我知道R是专为处理大型数据集而构建的,而且迭代可能非常非常错误,但我不确定如何实现这样的东西。在你嘲笑我创造的这个怪物之后,我会很感激有关如何修复它的一些建议。
addresses <- c('addressColumn1', 'addressColumn2', 'addressColumn3')
matches = 0
#loop through each row of the dirty dataset
for(i in 1:nrow(dirty_data) {
#Used to move on to next row if a match is found
stopLooking = FALSE
#loop through each of the address columns in the dirty dataset
for(addr in addresses){
matches=0
#Loop through each row of the geonames data
for(z in 1:nrow(geonames)) {
#Skip iteration if a match has been found
if(stopLooking)
break
#For each data point in geonames such as city/state/country/zip
for(place in names(geonames)) {
#Quick and dirty handling of NA values from grepl()
if(is.na(grepl(pattern=toupper(geonames[z, place]), x=toupper(dirty_data[i ,addr]), fixed=TRUE)))
skip = TRUE
#if city or state or zip or country name/code is found in address
if(grepl(pattern=toupper(geonames[z, place]), x=toupper(dirty_data[i ,addr]), fixed=TRUE) && !skip) {
matches = matches + 1
}
#If all the geonames columns have been checked, determine if match found
if(place == names(geonames)[length(names(geonames))]) {
if(matches > 1) {
print(paste('Match found z=', z, ' i=', i, ' addr=', addr, sep=''))
dirty_data[i, paste(addr, 'address_city', sep='_')] <- geonames[z, 'place_name']
dirty_data[i, paste(addr, 'address_state', sep='_')] <- geonames[z, 'admin_name1']
dirty_data[i, paste(addr, 'address_postal_code', sep='_')] <- geonames[z, 'postal_code']
dirty_data[i, paste(addr, 'address_country', sep='_')] <- geonames[z, 'country_name']
stopLooking = TRUE
}
}
}
}
}
}
示例地理名称数据:
country_code <- c('US', 'US', 'US', 'US', 'US', 'US')
postal_code <- c(99553, 99571, 99583, 99612, 99661, 99546)
place_name <- c('Akutan', 'Cold Bay', 'False Pass', 'King Cove', 'Sand Point', 'Adak')
admin_code1 <- c('AK', 'AK', 'AK', 'AK', 'AK', 'AK')
admin_name1 <- c('Alaska', 'Alaska', 'Alaska', 'Alaska', 'Alaska', 'Alaska')
country_name <- c('United States', 'United States', 'United States', 'United States', 'United States', 'United States')
geonames <- data.frame(country_code, postal_code, place_name, admin_code1, admin_name1, country_name)
示例脏数据:
addressColumn1 <- c('1 Salmon Ln, Akutan, AK 99553', 'Sandon Rd, Stone ST15 8XS, UK',',,,,,,London')
addressColumn2 <- c(' Mexico', 'Holger-Börner-Platz 1, 34119 Kassel, Germany', 'Atlanta, GA')
addressColumn3 <- c('London', 'Iceland', '123 Madeup Street, Minneapolis')
dirty_data <- data.frame(addressColumn1, addressColumn2, addressColumn3)
示例输出:
上面的代码将起作用并产生所需的输出,该输出填充现有数据帧中的列。我在下面添加了更多详细信息。
使用匹配的信息填写最后一个for循环底部的列(它们输出列名+ _address_city等)
例如,dirty_data[1, 'addressColumn1] == '1 Salmon Ln, Akutan, AK 99553'
与geonames[1, 'postal_code] == 99553
,geonames[1, 'place_name'] == 'Akutan'
和geonames[1, 'admin_code1'] == 'AK'
匹配。
由于这是两个或更多匹配,它将填充列(使用粘贴命令创建列名,此处显示它们实际上是如何输出的以简化)
dirty_data[1, 'addressColumn1_address_postal_code'] <- 99553
dirty_data[1, 'addressColumn1_address_place_name'] <- 'Akutan'
dirty_data[1, 'addressColumn1_address_admin_code1'] <- 'AK'
dirty_data[1, 'addressColumn1_address_admin_name1'] <- 'Alaska'
dirty_data[1, 'addressColumn1_address_country_name'] <- 'United States'