Question

我的群集应该读取位于我的azure存储中的一些输入文件。我通过livy将我的.jar提交到集群，但它总是因为我无法找到我的文件而死 - ＆gt;用户类抛出异常：java.io.FileNotFoundException。我错过了什么？ 我不想使用sc.textFile打开文件，因为它会使它们成为RDD结构，我需要它们的结构正确。

- Use key word searches to produce a few thousand examples of your two categories
- Put those sentences in a file with a label based on the OpenNLP format (label |space| sentence | newline )
- Train a classifier with the OpenNLP DocumentClassifier, and I recommend stemming for one of your feature generators
- after you have the model, use it in Java and classify each sentence.
- Keep track of the scores, and quarantine low scores (you will have ambiguous classes I'm sure)

我认为我试图从错误的地点或错误的方法，任何想法中读取？

谢谢！

Answer 1

根据您的描述，根据我的理解，我认为您希望使用在HDInsight上运行的Scala在Azure存储上加载纯文本文件。

根据我的经验，您可以通过两种方式来实现您的需求。

在Azure Java Storage SDK中使用Scala获取文本blob的内容，请参阅教程How to use Blob storage from Java，我认为使用Scala重写教程中的示例代码非常简单
在Hadoop Azure Support library中使用Hadoop Filesystem API加载文件数据，请参阅hadoop示例wiki https://wiki.apache.org/hadoop/HadoopDfsReadWriteExample以在Scala中编写代码。

通过scala应用程序读取azure wasbs中的文件

1 个答案: