在TweetNLP中获取缩写词的全文

时间:2014-07-16 21:16:42

标签: nlp stanford-nlp tweets

TweetNLP为推文提供了标记器和词性标注器,这真的很酷。现在,我想知道我是否可以更进一步并提取首字母缩略词。例如,当我收到推文" ikr"时,我能够查找并得到#34;我知道,对吧?"。我想我可以写自己的字典,但似乎应该已经有了一个字典?

3 个答案:

答案 0 :(得分:1)

从他们的网站下载StanfordNLP或将其用作maven依赖。我使用3.1.1版本

    <dependency>
        <groupId>edu.stanford.nlp</groupId>
        <artifactId>stanford-corenlp</artifactId>
        <version>3.3.1</version>
    </dependency>
    <dependency>
        <groupId>edu.stanford.nlp</groupId>
        <artifactId>stanford-corenlp</artifactId>
        <version>3.3.1</version>
        <classifier>models</classifier>
    </dependency>
    <dependency>
        <groupId>edu.stanford.nlp</groupId>
        <artifactId>stanford-parser</artifactId>
        <version>3.3.1</version>
        <classifier>models</classifier>
    </dependency>

下载Gate tweeter model

将其添加到您的属性文件

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
props.put("pos.model", "gate-EN-twitter.model");
props.put("dcoref.score", true);
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

并运行POS

答案 1 :(得分:0)

我不知道这样的语料库;但是你可以从这些网站上获取所需的信息: http://www.allacronyms.com/twitter/topic http://www.abbreviations.com/acronyms/TWITTER

答案 2 :(得分:0)

所以我最终做的是将StanfordNLP与GATE高音扬声器模型一起使用。

示例推文:

  

ikr smh他问了你的名字,所以他可以加上你的fb lololol

没有gate-EN-twitter.model的结果

word: ikr :: pos: NN :: ne:O
word: smh :: pos: NN :: ne:O
word: he :: pos: PRP :: ne:O
word: asked :: pos: VBD :: ne:O
word: fir :: pos: NNP :: ne:O
word: yo :: pos: NNP :: ne:O
word: last :: pos: JJ :: ne:O
word: name :: pos: NN :: ne:O
word: so :: pos: IN :: ne:O
word: he :: pos: PRP :: ne:O
word: can :: pos: MD :: ne:O
word: add :: pos: VB :: ne:O
word: u :: pos: NN :: ne:O
word: on :: pos: IN :: ne:O
word: fb :: pos: NN :: ne:O
word: lololol :: pos: NN :: ne:O

使用gate-EN-twitter.model

的结果
word: ikr :: pos: UH :: ne:O
word: smh :: pos: UH :: ne:O
word: he :: pos: PRP :: ne:O
word: asked :: pos: VBD :: ne:O
word: fir :: pos: IN :: ne:O
word: yo :: pos: PRP$ :: ne:O
word: last :: pos: JJ :: ne:O
word: name :: pos: NN :: ne:O
word: so :: pos: IN :: ne:O
word: he :: pos: PRP :: ne:O
word: can :: pos: MD :: ne:O
word: add :: pos: VB :: ne:O
word: u :: pos: PRP :: ne:O
word: on :: pos: IN :: ne:O
word: fb :: pos: NNP :: ne:O
word: lololol :: pos: UH :: ne:O

现在,我可以通过查看UH的标签来识别俚语并违反我的自定义词典。

仍然感到疑惑为什么它还没有出现,但它现在解决了我的问题。