Question

我有一个包含一组IP地址的表格列，我需要找到它的Region / Continent，如下所述。

------------------------------------------------------
ip_address      |    region
------------------------------------------------------
217.100.34.222  |   North Holland

为此，我从ip2location.com下载了一个IP-Country-Region-City数据库，但其表和值显示如下。

-----------------------------------------------------
ip_from  | ip_to  | country_code  |  country_name  | region_name  |  city_name
-----------------------------------------------------
16777216 | 16777471 | AU          | Australia      | Queensland   | Brisbane

如何将我的ip_address列转换为decimal number，如ip2location数据库所示，并从中检索数据，或者是否有更好的方法按顺序执行此过程从geo location

中检索ip address

感谢。

Answer 1

更好的方法来执行此过程以检索地理位置从使用SparkSQL的IP地址？

选项1 ：

正如databricks对广告分析所描述的那样，它是一种方式。请查看完整的文章 - an-illustrated-guide-to-advertising-analytics.html

直接从Spark调用Web服务：

# Obtain the unique agents from the accesslog table
ipaddresses = sqlContext.sql("select distinct ip1 from \
 accesslog where ip1 is not null").rdd
# getCCA2: Obtains two letter country code based on IP address
def getCCA2(ip):
  url = 'http://freegeoip.net/csv/' + ip
  str = urllib2.urlopen(url).read()
  return str.split(",")[1]
# Loop through distinct IP addresses and obtain two-letter country codes
mappedIPs = ipaddresses.map(lambda x: (x[0], getCCA2(x[0])))

以后可以通过查找扩展两个字母的国家/地区代码

选项2 ：Hive表方法就像使用scala伪代码（而不是Web服务方法）一样。

将数据提取到已下载的hive表中。

val ipsdf = hiveContext.sql(s"select ip from iptable ")
val countriesWithIp = hiveContext.sql(s"select countryname,ip from countriesWithIPs")

countriesWithIpAddrMapped = ipsdf.join(countriesWithIp , ipsdf("ip")===countriesWithIp("ip"), "inner" )

countriesWithIpAddrMapped.show();

使用SparkSQL从IP地址检索地理位置（区域或大陆）

1 个答案: