如何使用Scala中的StanfordNLP工具从德语文本中提取名词?

时间:2016-12-07 14:51:37

标签: scala maven nlp stanford-nlp

我想用StanfordNLP工具提取德语文本的名词。因此,我添加了德语文本的依赖项。

我的依赖项:

     <dependency>
                <groupId>edu.stanford.nlp</groupId>
                <artifactId>stanford-corenlp</artifactId>
                <version>3.6.0</version>
            </dependency>

            <!--BEGIN: NLP For German Text -->
        <dependency>
              <groupId>edu.stanford.nlp</groupId>
              <artifactId>stanford-corenlp</artifactId>
              <version>3.6.0</version>
              <classifier>models-german</classifier>
            </dependency>
            <!--END: NLP For German Text -->

            <!-- Other dependencies -->

在我的Scala课程中,我想提取推文的名词。以下代码剪断了:

// Start a new processor to use the NLP tools
    val proc: Processor = new FastNLPProcessor

    // TODO: Explain the val doc
    val doc = proc.annotate(text)

    // Is a String where the keywords are stored that we want to extract (e.g. Nouns - "N"; etc.)
    var keywords: String = ""

    // Iterate throgh each sentence
    for (sentence <- doc.sentences) {

      // i - contains the word of each sentence in a text in the current loop
      // x - saves the position of the word in the current loop
      // E.g. for tweet text :: "new scala update xyz" -> first loop: i = new ; x = 0
      for ((i, x) <- sentence.tags.get.view.zipWithIndex) {
        // "N" - is the abbreviation for Nouns
        if (i.toString().startsWith("N")) {
          // Append to keywords the noun of a text
          keywords = keywords + " " + sentence.words.array(x)
          // Print the current state of the keyword string
          println(keywords)
        }
      }
    }

但它只适用于英文文本。这些是我的进口商品:

import org.clulab.processors.Processor
import org.clulab.processors.fastnlp.FastNLPProcessor
import play.api.libs.json._
import scala.util.parsing.json.JSONObject
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkConf
import com.mongodb.casbah.Imports._
import com.mongodb.casbah.MongoConnection
import com.mongodb.casbah.commons.MongoDBObject
import org.clulab.struct.DirectedGraphEdgeIterator

完整的Scala类:

object KeywordExtractor {

   // Creates a connection to the MongoDB client
   val mongoConn = MongoClient("localhost", 27017)

   // Names the DB where the data should be saved
   val mongoDB = mongoConn("dbtest")

   // Defines the collection in which the text should be stored
   val mongoColl = mongoDB("testcollection")

  def extractKey(tweet: String) = {

    // tweet is sent as a Json-String. jsonObject -> stores the sent tweet as a Json Object
    val jsonObject = Json.parse(tweet)

    // text String to store the text from the tweet
    var text = ""
    try {
      // try to parse the text from the jsonObject into the text variable
      text = (jsonObject \ "text").as[String]
    } catch {
      case e: JsResultException => println("Limit reached")
    }

    // Start a new processor to use the NLP tools
    val proc: Processor = new FastNLPProcessor

    // TODO: Explain the val doc
    val doc = proc.annotate(text)

    // Is a String where the keywords are stored that we want to extract (e.g. Nouns - "N"; etc.)
    var keywords: String = ""

    // Iterate throgh each sentence
    for (sentence <- doc.sentences) {

      // i - contains the word of each sentence in a text in the current loop
      // x - saves the position of the word in the current loop
      // E.g. for tweet text :: "new scala update xyz" -> first loop: i = new ; x = 0
      for ((i, x) <- sentence.tags.get.view.zipWithIndex) {
        // "N" - is the abbreviation for Nouns
        if (i.toString().startsWith("N")) {
          // Append to keywords the noun of a text
          keywords = keywords + " " + sentence.words.array(x)
          // Print the current state of the keyword string
          println(keywords)
        }
      }
    }

    // Create a new MongoDB builder
    val builder = MongoDBObject.newBuilder

    // Creates a monogDb object to store the keywords (a bison file is created and text is the key for the value keywords)
    builder += "text" -> keywords

    // TODO: .result not clear
    val newObj = builder.result

    // Insert the new Object to the MonogDB
    mongoColl.insert(newObj)

    // Clear the memory of the doc to avoid a out of memory error (doc allocates a lot of memory)
    doc.clear();
  }
}

1 个答案:

答案 0 :(得分:0)

从2016年8月起,CLU Lab Processors库似乎不支持多种语言(请参阅此issue)。 作者没有优先添加新语言,但如果其他人有时间,则有兴趣添加它。

请注意,您可以通过changing the properties of the StanfordCoreNLP object告诉vanilla CoreNLP POS Taggers使用德语模型。