TweetNLP为推文提供了标记器和词性标注器,这真的很酷。现在,我想知道我是否可以更进一步并提取首字母缩略词。例如,当我收到推文" ikr"时,我能够查找并得到#34;我知道,对吧?"。我想我可以写自己的字典,但似乎应该已经有了一个字典?
答案 0 :(得分:1)
从他们的网站下载StanfordNLP或将其用作maven依赖。我使用3.1.1版本
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.3.1</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.3.1</version>
<classifier>models</classifier>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-parser</artifactId>
<version>3.3.1</version>
<classifier>models</classifier>
</dependency>
将其添加到您的属性文件
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
props.put("pos.model", "gate-EN-twitter.model");
props.put("dcoref.score", true);
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
并运行POS
答案 1 :(得分:0)
我不知道这样的语料库;但是你可以从这些网站上获取所需的信息: http://www.allacronyms.com/twitter/topic http://www.abbreviations.com/acronyms/TWITTER
答案 2 :(得分:0)
所以我最终做的是将StanfordNLP与GATE高音扬声器模型一起使用。
示例推文:
ikr smh他问了你的名字,所以他可以加上你的fb lololol
没有gate-EN-twitter.model的结果
word: ikr :: pos: NN :: ne:O
word: smh :: pos: NN :: ne:O
word: he :: pos: PRP :: ne:O
word: asked :: pos: VBD :: ne:O
word: fir :: pos: NNP :: ne:O
word: yo :: pos: NNP :: ne:O
word: last :: pos: JJ :: ne:O
word: name :: pos: NN :: ne:O
word: so :: pos: IN :: ne:O
word: he :: pos: PRP :: ne:O
word: can :: pos: MD :: ne:O
word: add :: pos: VB :: ne:O
word: u :: pos: NN :: ne:O
word: on :: pos: IN :: ne:O
word: fb :: pos: NN :: ne:O
word: lololol :: pos: NN :: ne:O
使用gate-EN-twitter.model
的结果word: ikr :: pos: UH :: ne:O
word: smh :: pos: UH :: ne:O
word: he :: pos: PRP :: ne:O
word: asked :: pos: VBD :: ne:O
word: fir :: pos: IN :: ne:O
word: yo :: pos: PRP$ :: ne:O
word: last :: pos: JJ :: ne:O
word: name :: pos: NN :: ne:O
word: so :: pos: IN :: ne:O
word: he :: pos: PRP :: ne:O
word: can :: pos: MD :: ne:O
word: add :: pos: VB :: ne:O
word: u :: pos: PRP :: ne:O
word: on :: pos: IN :: ne:O
word: fb :: pos: NNP :: ne:O
word: lololol :: pos: UH :: ne:O
现在,我可以通过查看UH的标签来识别俚语并违反我的自定义词典。
仍然感到疑惑为什么它还没有出现,但它现在解决了我的问题。