前几天,我正在开发一个java服务器来保存一堆数据并识别它的语言,所以我决定使用lingpipe来完成这样的任务。但是我在训练代码并用两种语言(英语和西班牙语)进行评估之后遇到了一个问题,因为我无法识别西班牙语文本,但我得到了英语和法语的成功结果。
我为完成此任务而遵循的教程是: http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html
我为完成任务而采取的后续步骤: 训练语言分类器的步骤
~1。首先将英语和西班牙语元数据解压缩到名为leipzig的文件夹中,如下所示(注意:元数据和句子来自http://wortschatz.uni-leipzig.de/en/download):
leipzig //Main folder
1M sentences //Folder with data of the last trial
eng_news_2015_1M
eng_news_2015_1M.tar.gz
spa-hn_web_2015_1M
spa-hn_web_2015_1M.tar.gz
ClassifyLang.java //Custom program to try the trained code
dist //Folder
eng_news_2015_300K.tar.gz //unpackaged english sentences
spa-hn_web_2015_300K.tar.gz //unpackaged spanish sentences
EvalLanguageId.java
langid-leipzig.classifier //trained code
lingpipe-4.1.2.jar
munged //Folder
eng //folder containing the sentences.txt for english
sentences.txt
spa //folder containing the sentences.txt for spanish
sentences.txt
Munge.java
TrainLanguageId.java
unpacked //Folder
eng_news_2015_300K //Folder with the english metadata
eng_news_2015_300K-co_n.txt
eng_news_2015_300K-co_s.txt
eng_news_2015_300K-import.sql
eng_news_2015_300K-inv_so.txt
eng_news_2015_300K-inv_w.txt
eng_news_2015_300K-sources.txt
eng_news_2015_300K-words.txt
sentences.txt
spa-hn_web_2015_300K //Folder with the spanish metadata
sentences.txt
spa-hn_web_2015_300K-co_n.txt
spa-hn_web_2015_300K-co_s.txt
spa-hn_web_2015_300K-import.sql
spa-hn_web_2015_300K-inv_so.txt
spa-hn_web_2015_300K-inv_w.txt
spa-hn_web_2015_300K-sources.txt
spa-hn_web_2015_300K-words.txt
~2.Second将压缩后的语言元数据解压缩到解压缩文件夹
unpacked //Folder
eng_news_2015_300K //Folder with the english metadata
eng_news_2015_300K-co_n.txt
eng_news_2015_300K-co_s.txt
eng_news_2015_300K-import.sql
eng_news_2015_300K-inv_so.txt
eng_news_2015_300K-inv_w.txt
eng_news_2015_300K-sources.txt
eng_news_2015_300K-words.txt
sentences.txt
spa-hn_web_2015_300K //Folder with the spanish metadata
sentences.txt
spa-hn_web_2015_300K-co_n.txt
spa-hn_web_2015_300K-co_s.txt
spa-hn_web_2015_300K-import.sql
spa-hn_web_2015_300K-inv_so.txt
spa-hn_web_2015_300K-inv_w.txt
spa-hn_web_2015_300K-sources.txt
spa-hn_web_2015_300K-words.txt
~3。然后Munge每个句子的句子,以删除行号,制表符和替换换行符与单个空格字符。输出使用UTF-8 unicode编码统一编写(注意:Lingpipe站点的munge.java)。
/-----------------Command line----------------------------------------------/
javac -cp lingpipe-4.1.2.jar: Munge.java
java -cp lingpipe-4.1.2.jar: Munge /home/samuel/leipzig/unpacked /home/samuel/leipzig/munged
----------------------------------------Results-----------------------------
spa
reading from=/home/samuel/leipzig/unpacked/spa-hn_web_2015_300K/sentences.txt charset=iso-8859-1
writing to=/home/samuel/leipzig/munged/spa/spa.txt charset=utf-8
total length=43267166
eng
reading from=/home/samuel/leipzig/unpacked/eng_news_2015_300K/sentences.txt charset=iso-8859-1
writing to=/home/samuel/leipzig/munged/eng/eng.txt charset=utf-8
total length=35847257
/---------------------------------------------------------------/
<---------------------------------Folder------------------------------------->
munged //Folder
eng //folder containing the sentences.txt for english
sentences.txt
spa //folder containing the sentences.txt for spanish
sentences.txt
<-------------------------------------------------------------------------->
~4。接下来我们开始训练语言(注意:Lingpipe LanguageId教程中的TrainLanguageId.java)。
/---------------Command line--------------------------------------------/
javac -cp lingpipe-4.1.2.jar: TrainLanguageId.java
java -cp lingpipe-4.1.2.jar: TrainLanguageId /home/samuel/leipzig/munged /home/samuel/leipzig/langid-leipzig.classifier 100000 5
-----------------------------------Results-----------------------------------
nGram=100000 numChars=5
Training category=eng
Training category=spa
Compiling model to file=/home/samuel/leipzig/langid-leipzig.classifier
/----------------------------------------------------------------------------/
〜5。我们使用下一个结果评估了我们训练有素的代码,在混淆矩阵上有一些问题(注意:Lingpipe LanguageId教程中的EvalLanguageId.java)。
/------------------------Command line---------------------------------/
javac -cp lingpipe-4.1.2.jar: EvalLanguageId.java
java -cp lingpipe-4.1.2.jar: EvalLanguageId /home/samuel/leipzig/munged /home/samuel/leipzig/langid-leipzig.classifier 100000 50 1000
-------------------------------Results-------------------------------------
Reading classifier from file=/home/samuel/leipzig/langid-leipzig.classifier
Evaluating category=eng
Evaluating category=spa
TEST RESULTS
BASE CLASSIFIER EVALUATION
Categories=[eng, spa]
Total Count=2000
Total Correct=1000
Total Accuracy=0.5
95% Confidence Interval=0.5 +/- 0.02191346617949794
Confusion Matrix
reference \ response
,eng,spa
eng,1000,0 <---------- not diagonal sampling
spa,1000,0
Macro-averaged Precision=NaN
Macro-averaged Recall=0.5
Macro-averaged F=NaN
Micro-averaged Results
the following symmetries are expected:
TP=TN, FN=FP
PosRef=PosResp=NegRef=NegResp
Acc=Prec=Rec=F
Total=4000
True Positive=1000
False Negative=1000
False Positive=1000
True Negative=1000
Positive Reference=2000
Positive Response=2000
Negative Reference=2000
Negative Response=2000
Accuracy=0.5
Recall=0.5
Precision=0.5
Rejection Recall=0.5
Rejection Precision=0.5
F(1)=0.5
Fowlkes-Mallows=2000.0
Jaccard Coefficient=0.3333333333333333
Yule's Q=0.0
Yule's Y=0.0
Reference Likelihood=0.5
Response Likelihood=0.5
Random Accuracy=0.5
Random Accuracy Unbiased=0.5
kappa=0.0
kappa Unbiased=0.0
kappa No Prevalence=0.0
chi Squared=0.0
phi Squared=0.0
Accuracy Deviation=0.007905694150420948
Random Accuracy=0.5
Random Accuracy Unbiased=0.625
kappa=0.0
kappa Unbiased=-0.3333333333333333
kappa No Prevalence =0.0
Reference Entropy=1.0
Response Entropy=NaN
Cross Entropy=Infinity
Joint Entropy=1.0
Conditional Entropy=0.0
Mutual Information=0.0
Kullback-Liebler Divergence=Infinity
chi Squared=NaN
chi-Squared Degrees of Freedom=1
phi Squared=NaN
Cramer's V=NaN
lambda A=0.0
lambda B=NaN
ONE VERSUS ALL EVALUATIONS BY CATEGORY
CATEGORY[0]=eng VERSUS ALL
First-Best Precision/Recall Evaluation
Total=2000
True Positive=1000
False Negative=0
False Positive=1000
True Negative=0
Positive Reference=1000
Positive Response=2000
Negative Reference=1000
Negative Response=0
Accuracy=0.5
Recall=1.0
Precision=0.5
Rejection Recall=0.0
Rejection Precision=NaN
F(1)=0.6666666666666666
Fowlkes-Mallows=1414.2135623730949
Jaccard Coefficient=0.5
Yule's Q=NaN
Yule's Y=NaN
Reference Likelihood=0.5
Response Likelihood=1.0
Random Accuracy=0.5
Random Accuracy Unbiased=0.625
kappa=0.0
kappa Unbiased=-0.3333333333333333
kappa No Prevalence=0.0
chi Squared=NaN
phi Squared=NaN
Accuracy Deviation=0.011180339887498949
CATEGORY[1]=spa VERSUS ALL
First-Best Precision/Recall Evaluation
Total=2000
True Positive=0
False Negative=1000
False Positive=0
True Negative=1000
Positive Reference=1000
Positive Response=0
Negative Reference=1000
Negative Response=2000
Accuracy=0.5
Recall=0.0
Precision=NaN
Rejection Recall=1.0
Rejection Precision=0.5
F(1)=NaN
Fowlkes-Mallows=NaN
Jaccard Coefficient=0.0
Yule's Q=NaN
Yule's Y=NaN
Reference Likelihood=0.5
Response Likelihood=0.0
Random Accuracy=0.5
Random Accuracy Unbiased=0.625
kappa=0.0
kappa Unbiased=-0.3333333333333333
kappa No Prevalence=0.0
chi Squared=NaN
phi Squared=NaN
Accuracy Deviation=0.011180339887498949
/-----------------------------------------------------------------------/
~6。然后我们尝试用西班牙语文本进行真正的评估:
/-------------------Command line----------------------------------/
javac -cp lingpipe-4.1.2.jar: ClassifyLang.java
java -cp lingpipe-4.1.2.jar: ClassifyLang
/-------------------------------------------------------------------------/
<---------------------------------Result------------------------------------>
Text: Yo soy una persona increíble y muy inteligente, me admiro a mi mismo lo que me hace sentir ansiedad de lo que viene, por que es algo grandioso lleno de cosas buenas y de ahora en adelante estaré enfocado y optimista aunque tengo que aclarar que no lo haré por querer algo, sino por que es mi pasión.
Best Language: eng <------------- Wrong Result
<----------------------------------------------------------------------->
ClassifyLang.java的代码:
import com.aliasi.classify.Classification;
import com.aliasi.classify.Classified;
import com.aliasi.classify.ConfusionMatrix;
import com.aliasi.classify.DynamicLMClassifier;
import com.aliasi.classify.JointClassification;
import com.aliasi.classify.JointClassifier;
import com.aliasi.classify.JointClassifierEvaluator;
import com.aliasi.classify.LMClassifier;
import com.aliasi.lm.NGramProcessLM;
import com.aliasi.util.AbstractExternalizable;
import java.io.File;
import java.io.IOException;
import com.aliasi.util.Files;
public class ClassifyLang {
public static String text = "Yo soy una persona increíble y muy inteligente, me admiro a mi mismo"
+ " estoy ansioso de lo que viene, por que es algo grandioso lleno de cosas buenas"
+ " y de ahora en adelante estaré enfocado y optimista"
+ " aunque tengo que aclarar que no lo haré por querer algo, sino por que no es difícil serlo. ";
private static File MODEL_DIR
= new File("/home/samuel/leipzig/langid-leipzig.classifier");
public static void main(String[] args)
throws ClassNotFoundException, IOException {
System.out.println("Text: " + text);
LMClassifier classifier = null;
try {
classifier = (LMClassifier) AbstractExternalizable.readObject(MODEL_DIR);
} catch (IOException | ClassNotFoundException ex) {
// Handle exceptions
System.out.println("Problem with the Model");
}
Classification classification = classifier.classify(text);
String bestCategory = classification.bestCategory();
System.out.println("Best Language: " + bestCategory);
}
}
~7。我尝试了100万个元数据文件,但它得到了相同的结果,并且通过获得相同的结果也改变了ngram数。 我非常感谢你的帮助。