Question

我有很多带有存储在Azure Blob中的IP（weblog）的avro文件。我想将IP映射到位置。如何使用Azure Data Lake Analytics（ADLA）？

现在我有一个使用Maxmind IP数据库和一个java库的spark作业，该库读取一个包含所有ip-location数据的113MB大.mmdb文件来执行此查找。我现在正在调查是否可以将这项工作转移到ADLA

Maxmind也提供了一个c＃库，因此该部分没有问题。但是，对我来说，如何处理需要读取然后用于查找的大mmdb文件并不明显。显然，为每次IP查找读取文件并不快。如何用ADLA处理这个（和类似情况），或ADLA不适合这种操作？

如果我正常运行程序，我会像这样进行查找：

using (var reader = new Reader("GeoIP2-City.mmdb"))
{
    foreach(var ip in ips)
    {
        var data = reader.Find<Dictionary<string, object>>(ip);
        ...
    }
}

这里有maxmind数据库：https://dev.maxmind.com/geoip/geoip2/downloadable/（请注意我已经购买了我目前正在使用的数据库）和c＃库在这里阅读：https://github.com/maxmind/MaxMind-DB-Reader-dotnet

Answer 1

您可以使用U-SQL的DEPLOY RESOURCE语句和UDO来实现此目的。

首先，必须将文件上传到datalake商店。然后使用DEPLOY RESOURCE告诉U-SQL系统将该文件复制到脚本运行的每个顶点。然后您的脚本使用C＃代码来读取文件。

DEPLOY RESOURCE "/helloworld.txt";

@departments =
  SELECT * 
  FROM (VALUES
      (31, "Sales"),
      (33, "Engineering"),
      (34, "Clerical"),
      (35, "Marketing")
    ) AS D( DepID, DepName );


@departments =
     PROCESS @departments
     PRODUCE DepID int,
             DepName string,
             HelloWorld string
     USING new Demo.HelloWorldProcessor();

OUTPUT @departments 
    TO "/departments.tsv"
    USING Outputters.Tsv();

这是U-SQL处理器UDO。

using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace Demo
{
    [SqlUserDefinedProcessor]
    public class HelloWorldProcessor : IProcessor
    {
        private string hw;

        public HelloWorldProcessor()
        {
            this.hw = System.IO.File.ReadAllText("helloworld.txt");
        }

        public override IRow Process(IRow input, IUpdatableRow output)
        {
            output.Set<int>("DepID", input.Get<int>("DepID"));
            output.Set<string>("DepName", input.Get<string>("DepName"));
            output.Set<string>("HelloWorld", hw);
            return output.AsReadOnly();
        }
    }
}

使用Azure数据湖分析将IP映射到位置

1 个答案: