R中的地理编码IP地址

时间:2017-08-14 12:40:04

标签: r ip rcurl geocode

我已经制作了这个简短的代码,通过使用freegeoip.net自动对IP地址进行地理编码(默认情况下每小时15,000次查询;优质的服务!):

> library(RCurl)
Loading required package: bitops
> ip.lst = 
c("193.198.38.10","91.93.52.105","134.76.194.180","46.183.103.8")
> q = do.call(rbind, lapply(ip.lst, function(x){ 
  try( data.frame(t(strsplit(getURI(paste0("freegeoip.net/csv/", x)), ",")[[1]]), stringsAsFactors = FALSE) ) 
}))
> names(q) = c("ip","country_code","country_name","region_code","region_name","city","zip_code","time_zone","latitude","longitude","metro_code")
> str(q)
'data.frame':   4 obs. of  11 variables:
$ ip          : chr  "193.198.38.10" "91.93.52.105" "134.76.194.180" "46.183.103.8"
$ country_code: chr  "HR" "TR" "DE" "DE"
$ country_name: chr  "Croatia" "Turkey" "Germany" "Germany"
$ region_code : chr  "" "06" "NI" ""
$ region_name : chr  "" "Ankara" "Lower Saxony" ""
$ city        : chr  "" "Ankara" "Gottingen" ""
$ zip_code    : chr  "" "06450" "37079" ""
$ time_zone   : chr  "Europe/Zagreb" "Europe/Istanbul" "Europe/Berlin" ""
$ latitude    : chr  "45.1667" "39.9230" "51.5333" "51.2993"
$ longitude   : chr  "15.5000" "32.8378" "9.9333" "9.4910"
$ metro_code  : chr  "0\r\n" "0\r\n" "0\r\n" "0\r\n"

在三行代码中,您可以获得所有IP的坐标,包括城市/国家/地区代码。我想知道这是否可以并行化,以便它运行得更快?要进行地理编码> 10,000个IP,否则可能需要数小时。

2 个答案:

答案 0 :(得分:3)

library(rgeolocate)

ip_lst = c("193.198.38.10", "91.93.52.105", "134.76.194.180", "46.183.103.8")

maxmind(ip_lst, "~/Data/GeoLite2-City.mmdb", 
        fields=c("country_code", "country_name", "region_name", "city_name", 
                 "timezone", "latitude", "longitude"))

##   country_code country_name            region_name  city_name        timezone latitude longitude
## 1           HR      Croatia                   <NA>       <NA>   Europe/Zagreb  45.1667   15.5000
## 2           TR       Turkey               Istanbul   Istanbul Europe/Istanbul  41.0186   28.9647
## 3           DE      Germany           Lower Saxony Bilshausen   Europe/Berlin  51.6167   10.1667
## 4           DE      Germany North Rhine-Westphalia     Aachen   Europe/Berlin  50.7787    6.1085

包中有指令用于获取必要的数据文件。你所拉的一些领域非常不准确(比任何geoip供应商都愿意承认的更多)。如果您确实需要那些不可用的文件,请提交issue,我们会添加它们。

答案 1 :(得分:2)

我发现multidplyr是进行并行服务器调用的绝佳方案。这是我发现的最佳指南,我强烈建议您阅读整篇文章以更好地了解该软件包的工作原理:http://www.business-science.io/code-tools/2016/12/18/multidplyr.html

library("devtools")
devtools::install_github("hadley/multidplyr")
library(parallel)
library(multidplyr)
library(RCurl)
library(tidyverse)

# Convert your example into a function
get_ip <- function(ip) {
  do.call(rbind, lapply(ip, function(x) {
    try(data.frame(t(strsplit(getURI(
      paste0("freegeoip.net/csv/", x)
    ), ",")[[1]]), stringsAsFactors = FALSE))
  })) %>% nest(X1:X11)
}

# Made ip.lst into a Tibble to make it work better with dplyr
ip.lst =
  tibble(
    ip = c(
      "193.198.38.10",
      "91.93.52.105",
      "134.76.194.180",
      "46.183.103.8",
      "193.198.38.10",
      "91.93.52.105",
      "134.76.194.180",
      "46.183.103.8"
    )
  )

# Create a cluster based on how many cores your machine has
cl <- detectCores()
cluster <- create_cluster(cores = cl)

# Create a partitioned tibble
by_group  <- partition(ip.lst, cluster = cluster)

# Send libraries and the function get_ip() to each cluster
by_group %>%
  cluster_library("tidyverse") %>%
  cluster_library("RCurl") %>%
  cluster_assign_value("get_ip", get_ip)

# Send parallel requests to the website and parse the results
q <- by_group %>%
  do(get_ip(.$ip)) %>% 
  collect() %>% 
  unnest() %>% 
  tbl_df() %>% 
  select(-PARTITION_ID)

# Set names of the results
names(q) = c(
  "ip",
  "country_code",
  "country_name",
  "region_code",
  "region_name",
  "city",
  "zip_code",
  "time_zone",
  "latitude",
  "longitude",
  "metro_code"
)