Question

我们正在为HDInsight做一个POC。我对这项技术很陌生。我们正在做的是，尝试将一些数据发送到Azure并编写一些Hive查询。我们能够实现第一部分：我们可以使用AzCopy将一些测试数据推送到Azure blob。（我知道有Azure表和Azure队列）。但对于POC，Azure blob就好了。

我们可以使用Visual Studio与此blob对话。但是，我们还想检查HDinsight及其MapReduce功能。

在此背景下，这里有几个问题：

 1. Do I need to copy data from Azure Blob to Anywhere else for writing
    Hive queries in Ambari? Or Can Ambari directly talk to data stored
    in Azure blob? 
 2. Is this the right way to process data? (Keep data in
        Azure blob, and use HDInsight/Ambari to process the data)
 3. If point 2 is correct, that means HDInsight is used only for
    parallel processing with MapReducing feature. Is this correct?

非常感谢，任何见解。

Answer 1

是的，HDInsight可以读取存储在BLOB存储中的数据。示例：

https://docs.microsoft.com/en-us/azure/hdinsight/hadoop/apache-hadoop-linux-tutorial-get-started https://blogs.msdn.microsoft.com/azuredatalake/2017/04/06/azure-hdinsight-3-6-five-things-that-will-make-data-developer-happy/

是的，根据您的目的，您可以使用Spark，MR，Pig或Hive来处理数据好的起点是https://www.edx.org/course/processing-big-data-with-hadoop-in-azure-hdinsight

3：是的，使用Spark，Map Reduce，Hive或Pig之类的分布式框架处理数据

HDInsight和Hive查询

1 个答案: