Question

我正在编写MapReduce作业来分析Web日志。我的代码旨在将IP地址映射到地理位置，我正在使用Maxmind Geo API（https://github.com/maxmind/geoip-api-java）来实现此目的。我的代码有一个LookupService方法，需要带有ip到位置匹配的数据库文件。我试图使用分布式缓存传递此数据库文件。我试着以两种不同的方式做到这一点

情况1：

运行从HDFS传递文件的作业，但它总是抛出一个错误，说＆＃34; FILE NOT FOUND ＆＃34;

sudo -u hdfs hadoop jar \
 WebLogProcessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
GeoLocationDatasetDriver /user/hdfs/input /user/hdfs/out_put \
/user/hdfs/GeoLiteCity.dat

OR

sudo -u hdfs hadoop jar \
WebLogProcessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
GeoLocationDatasetDriver /user/hdfs/input /user/hdfs/out_put \
hdfs://sandbox.hortonworks.com:8020/user/hdfs/GeoLiteCity.dat

驱动程序类代码：

Configuration conf = getConf();
Job job = Job.getInstance(conf);
job.addCacheFile(new Path(args[2]).toUri());

Mapper类代码：

public void setup(Context context) throws IOException
{
URI[] uriList = context.getCacheFiles();
Path database_path = new Path(uriList[0].toString());
LookupService cl = new LookupService(database_path.toString(),
            LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE);
}

案例2： 通过-files选项从本地文件系统传递文件来运行代码。 LookupService行中的错误：空指针异常 cl = new LookupService（database_path）

sudo -u hdfs hadoop jar  \
WebLogProcessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
com.prithvi.mapreduce.logprocessing.ipgeo.GeoLocationDatasetDriver \
-files /tmp/jobs/GeoLiteCity.dat /user/hdfs/input /user/hdfs/out_put \
GeoLiteCity.dat

驱动程序代码：

Configuration conf = getConf();
Job job = Job.getInstance(conf);
String dbfile = args[2];
conf.set("maxmind.geo.database.file", dbfile);

映射器代码：

public void setup(Context context) throws IOException
{
  Configuration conf = context.getConfiguration();
  String database_path = conf.get("maxmind.geo.database.file");
  LookupService cl = new LookupService(database_path,
            LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE);
}

我需要在所有任务跟踪器中使用此数据库文件来完成这项工作。任何人都可以建议我这样做的正确方法吗？

Answer 1

尝试这样做：

从驱动程序指定HDFS中文件的位置，使用Job对象：

job.addCacheFile(new URI("hdfs://localhot:8020/GeoLite2-City.mmdb#GeoLite2-City.mmdb"));

其中，#表示由hadoop创建的别名（符号链接）

之后，您可以使用setup()方法从Mapper访问该文件：

@Override
protected void setup(Context context) {
  File file = new File("GeoLite2-City.mmdb");
}

以下是一个例子：

驱动程序代码：http://goo.gl/COqysa
映射器代码：http://goo.gl/0SbQQP

使用分布式缓存在Hadoop中访问Maxmind Geo API

1 个答案: