英语以外语言的POS

时间:2016-12-09 22:52:42

标签: python nltk

我对nltk很新。

这允许我根据其词性标记句子。但是,在为其他语言执行此操作时会涉及哪些步骤?

package com.example.rssreader;

import android.animation.ObjectAnimator;
import android.content.Context;
import android.support.v7.widget.CardView;
import android.support.v7.widget.RecyclerView;
import android.view.LayoutInflater;
import android.view.View;
import android.view.ViewGroup;
import android.widget.ImageView;
import android.widget.TextView;

import com.bumptech.glide.Glide;
import com.daimajia.androidanimations.library.Techniques;
import com.daimajia.androidanimations.library.YoYo;
import com.squareup.picasso.Picasso;

import java.util.ArrayList;

/**
 * Created by Efrain on 26-02-2016.
 */
public class MyAdapter extends RecyclerView.Adapter<MyAdapter.MyViewHolder> {
    ArrayList<FeedItem>feedItems;
    Context context;
    public MyAdapter(Context context,ArrayList<FeedItem>feedItems){
        this.feedItems=feedItems;
        this.context=context;
    }
    @Override
    public MyViewHolder onCreateViewHolder(ViewGroup parent, int viewType) {
        View view= LayoutInflater.from(context).inflate(R.layout.custum_row_news_item,parent,false);
        MyViewHolder holder=new MyViewHolder(view);
        return holder;
    }

    @Override
    public void onBindViewHolder(MyViewHolder holder, int position) {
        YoYo.with(Techniques.FadeIn).playOn(holder.cardView);
        FeedItem current=feedItems.get(position);
        holder.Title.setText(current.getTitle());
        holder.Description.setText(current.getDescription());
        holder.Date.setText(current.getPubDate());
        holder.Link.setText(current.getLink());
        //the original String
        String somestring = current.getLink();
        //save the index of the string '=' since after that is were you find your number, remember to add one as the begin index is inclusive
        int beginIndex = somestring.indexOf("=") + 1;
        //if the number ends the string then save the length of the string as the end, you can change this index if that's not the case
        int endIndex = somestring.length();
        //Obtain the substring using the indexes you obtained (if the number ends the string you can ignore the second index, but i leave it here so you may use it if that's not the case)
        String theNumber = somestring.substring(beginIndex,endIndex);
        //printing the number for testing purposes
        System.out.println("The number is: " + theNumber);
        //Then create a new string with the data you want (I recommend using StringBuilder) with the first part of what you want
        StringBuilder sb=new StringBuilder("http://shake.uprm.edu/~shake/archive/shake/");
        // add the number
        sb.append(theNumber);
        //then the rest of the string
        sb.append("/download/tvmap.jpg");
        //Saving the String in a variable
        String endResult = sb.toString();
        //Verifying end result
        System.out.println("The end result is: "+endResult);
        Glide.with(context).load(endResult).into(holder.Thumbnail);

    }



    @Override
    public int getItemCount() {
        return feedItems.size();
    }

    public class MyViewHolder extends RecyclerView.ViewHolder {
        TextView Title,Description,Date,Link;
        ImageView Thumbnail;
        CardView cardView;
        public MyViewHolder(View itemView) {
            super(itemView);
            Title= (TextView) itemView.findViewById(R.id.title_text);
            Description= (TextView) itemView.findViewById(R.id.description_text);
            Date= (TextView) itemView.findViewById(R.id.date_text);
            Thumbnail= (ImageView) itemView.findViewById(R.id.thumb_img);
            cardView= (CardView) itemView.findViewById(R.id.cardview);
            Link= (TextView) itemView.findViewById(R.id.info);
        }
    }
}

更新

我有兴趣从西班牙语开始。

更新2

import nltk
sentence = "I'm not sure!"
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)

产:

import nltk
from nltk.tokenize import word_tokenize

training_set = [[(w.lower(),t) for w,t in s] for s in nltk.corpus.conll2002.tagged_sents('esp.train')]

unigram_tagger = nltk.UnigramTagger(training_set)
bigram_tagger = nltk.BigramTagger(train_set, backoff=unigram_tagger)

tokens = [token.lower() for token in word_tokenize("El Congreso no podrá hacer ninguna ley con respecto al establecimiento de la religión, ni prohibiendo la libre práctica de la misma; ni limitando la libertad de expresión, ni de prensa; ni el derecho a la asamblea pacífica de las personas, ni de solicitar al gobierno una compensación de agravios.")]

1 个答案:

答案 0 :(得分:2)

Afaik nltk没有为英语以外的任何语言准备好使用标记器或解析器。 nltk之外有这样的工具,你可以下载和使用它们。

nltk确实提供了培训您自己的西班牙语标记器的工具,使用西班牙语标记语料库之一作为培训材料。例如,您可以按照building a tagger的nltk说明操作,但使用conll2002.tagged_sents("esp.train")作为训练数据。它只有大约250K字,所以你不会获得很好的表现,但它应该让你开始。 (当然,你可以找到一个更大的标记语料库来训练。)