我想用StanfordNLP工具提取德语文本的名词。因此,我添加了德语文本的依赖项。
我的依赖项:
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.6.0</version>
</dependency>
<!--BEGIN: NLP For German Text -->
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.6.0</version>
<classifier>models-german</classifier>
</dependency>
<!--END: NLP For German Text -->
<!-- Other dependencies -->
在我的Scala课程中,我想提取推文的名词。以下代码剪断了:
// Start a new processor to use the NLP tools
val proc: Processor = new FastNLPProcessor
// TODO: Explain the val doc
val doc = proc.annotate(text)
// Is a String where the keywords are stored that we want to extract (e.g. Nouns - "N"; etc.)
var keywords: String = ""
// Iterate throgh each sentence
for (sentence <- doc.sentences) {
// i - contains the word of each sentence in a text in the current loop
// x - saves the position of the word in the current loop
// E.g. for tweet text :: "new scala update xyz" -> first loop: i = new ; x = 0
for ((i, x) <- sentence.tags.get.view.zipWithIndex) {
// "N" - is the abbreviation for Nouns
if (i.toString().startsWith("N")) {
// Append to keywords the noun of a text
keywords = keywords + " " + sentence.words.array(x)
// Print the current state of the keyword string
println(keywords)
}
}
}
但它只适用于英文文本。这些是我的进口商品:
import org.clulab.processors.Processor
import org.clulab.processors.fastnlp.FastNLPProcessor
import play.api.libs.json._
import scala.util.parsing.json.JSONObject
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkConf
import com.mongodb.casbah.Imports._
import com.mongodb.casbah.MongoConnection
import com.mongodb.casbah.commons.MongoDBObject
import org.clulab.struct.DirectedGraphEdgeIterator
完整的Scala类:
object KeywordExtractor {
// Creates a connection to the MongoDB client
val mongoConn = MongoClient("localhost", 27017)
// Names the DB where the data should be saved
val mongoDB = mongoConn("dbtest")
// Defines the collection in which the text should be stored
val mongoColl = mongoDB("testcollection")
def extractKey(tweet: String) = {
// tweet is sent as a Json-String. jsonObject -> stores the sent tweet as a Json Object
val jsonObject = Json.parse(tweet)
// text String to store the text from the tweet
var text = ""
try {
// try to parse the text from the jsonObject into the text variable
text = (jsonObject \ "text").as[String]
} catch {
case e: JsResultException => println("Limit reached")
}
// Start a new processor to use the NLP tools
val proc: Processor = new FastNLPProcessor
// TODO: Explain the val doc
val doc = proc.annotate(text)
// Is a String where the keywords are stored that we want to extract (e.g. Nouns - "N"; etc.)
var keywords: String = ""
// Iterate throgh each sentence
for (sentence <- doc.sentences) {
// i - contains the word of each sentence in a text in the current loop
// x - saves the position of the word in the current loop
// E.g. for tweet text :: "new scala update xyz" -> first loop: i = new ; x = 0
for ((i, x) <- sentence.tags.get.view.zipWithIndex) {
// "N" - is the abbreviation for Nouns
if (i.toString().startsWith("N")) {
// Append to keywords the noun of a text
keywords = keywords + " " + sentence.words.array(x)
// Print the current state of the keyword string
println(keywords)
}
}
}
// Create a new MongoDB builder
val builder = MongoDBObject.newBuilder
// Creates a monogDb object to store the keywords (a bison file is created and text is the key for the value keywords)
builder += "text" -> keywords
// TODO: .result not clear
val newObj = builder.result
// Insert the new Object to the MonogDB
mongoColl.insert(newObj)
// Clear the memory of the doc to avoid a out of memory error (doc allocates a lot of memory)
doc.clear();
}
}
答案 0 :(得分:0)
从2016年8月起,CLU Lab Processors库似乎不支持多种语言(请参阅此issue)。 作者没有优先添加新语言,但如果其他人有时间,则有兴趣添加它。
请注意,您可以通过changing the properties of the StanfordCoreNLP
object告诉vanilla CoreNLP POS Taggers使用德语模型。