Question

我正在尝试对单词进行预处理，以删除诸如“ un”和“ re”之类的通用前缀，但是所有nltk的通用词干似乎都完全忽略了前缀：

{
    "manifest_version" : 2,
    "name" : "Weather",
    "description" : "Weather",
    "version" : "1.0",
    "browser_action" : {
        "default_popup" : "popup.html"
    },

    "icons" : {
        "16" : "weather.png",
        "48" : "weather.png",
        "120" : "weather.png"
    },

    "permissions" : [
        "tabs" , "<all_urls>"
    ],

    "content_scripts": [
        {

        "js" : ["popup.js"]
        }
    ]
}

除去通用前缀和后缀不是茎提取器工作的一部分吗？还有另一个可靠地做到这一点的词干提取器吗？

Answer 1

您是对的。大多数词干仅词干后缀。实际上，马丁·波特（Martin Porter）的原始论文标题为：

Porter，M。“一种用于后缀剥离的算法。”计划14.3（1980）：130-137。

在NLTK中唯一带有前缀词干的词干可能是阿拉伯词干：

但是，如果我们看一下这个prefix_replace函数，它只是删除了旧的前缀，并用新的前缀代替。

def prefix_replace(original, old, new):
    """
     Replaces the old prefix of the original string by a new suffix
    :param original: string
    :param old: string
    :param new: string
    :return: string
    """
    return new + original[len(old):]

但是我们可以做得更好！

首先，您是否有需要处理的语言的前缀和替换列表？

让我们使用（不幸的）事实上的语言英语，并做一些语言学工作来找出英语的前缀：

https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes

无需太多工作，您就可以在源于NLTK的后缀之前编写前缀词干函数，例如

import re
from nltk.stem import PorterStemmer

# From https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes
english_prefixes = {
"anti": "",    # e.g. anti-goverment, anti-racist, anti-war
"auto": "",    # e.g. autobiography, automobile
"de": "",      # e.g. de-classify, decontaminate, demotivate
"dis": "",     # e.g. disagree, displeasure, disqualify
"down": "",    # e.g. downgrade, downhearted
"extra": "",   # e.g. extraordinary, extraterrestrial
"hyper": "",   # e.g. hyperactive, hypertension
"il": "",     # e.g. illegal
"im": "",     # e.g. impossible
"in": "",     # e.g. insecure
"ir": "",     # e.g. irregular
"inter": "",  # e.g. interactive, international
"mega": "",   # e.g. megabyte, mega-deal, megaton
"mid": "",    # e.g. midday, midnight, mid-October
"mis": "",    # e.g. misaligned, mislead, misspelt
"non": "",    # e.g. non-payment, non-smoking
"over": "",  # e.g. overcook, overcharge, overrate
"out": "",    # e.g. outdo, out-perform, outrun
"post": "",   # e.g. post-election, post-warn
"pre": "",    # e.g. prehistoric, pre-war
"pro": "",    # e.g. pro-communist, pro-democracy
"re": "",     # e.g. reconsider, redo, rewrite
"semi": "",   # e.g. semicircle, semi-retired
"sub": "",    # e.g. submarine, sub-Saharan
"super": "",   # e.g. super-hero, supermodel
"tele": "",    # e.g. television, telephathic
"trans": "",   # e.g. transatlantic, transfer
"ultra": "",   # e.g. ultra-compact, ultrasound
"un": "",      # e.g. under-cook, underestimate
"up": "",      # e.g. upgrade, uphill
}

porter = PorterStemmer()

def stem_prefix(word, prefixes):
    for prefix in sorted(prefixes, key=len, reverse=True):
        # Use subn to track the no. of substitution made.
        # Allow dash in between prefix and root. 
        word, nsub = re.subn("{}[\-]?".format(prefix), "", word)
        if nsub > 0:
            return word

def porter_english_plus(word, prefixes=english_prefixes):
    return porter.stem(stem_prefix(word, prefixes))


word = "extraordinary"
porter_english_plus(word)

现在我们有了一个简单的前缀词干分析器，我们可以做得更好吗？

# E.g. this is not satisfactory:
>>> porter_english_plus("united")
"ited"

如果我们在词干词干之前检查词干词是否出现在某些列表中怎么办？

import re

from nltk.corpus import words
from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer

# From https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes
english_prefixes = {
"anti": "",    # e.g. anti-goverment, anti-racist, anti-war
"auto": "",    # e.g. autobiography, automobile
"de": "",      # e.g. de-classify, decontaminate, demotivate
"dis": "",     # e.g. disagree, displeasure, disqualify
"down": "",    # e.g. downgrade, downhearted
"extra": "",   # e.g. extraordinary, extraterrestrial
"hyper": "",   # e.g. hyperactive, hypertension
"il": "",     # e.g. illegal
"im": "",     # e.g. impossible
"in": "",     # e.g. insecure
"ir": "",     # e.g. irregular
"inter": "",  # e.g. interactive, international
"mega": "",   # e.g. megabyte, mega-deal, megaton
"mid": "",    # e.g. midday, midnight, mid-October
"mis": "",    # e.g. misaligned, mislead, misspelt
"non": "",    # e.g. non-payment, non-smoking
"over": "",  # e.g. overcook, overcharge, overrate
"out": "",    # e.g. outdo, out-perform, outrun
"post": "",   # e.g. post-election, post-warn
"pre": "",    # e.g. prehistoric, pre-war
"pro": "",    # e.g. pro-communist, pro-democracy
"re": "",     # e.g. reconsider, redo, rewrite
"semi": "",   # e.g. semicircle, semi-retired
"sub": "",    # e.g. submarine, sub-Saharan
"super": "",   # e.g. super-hero, supermodel
"tele": "",    # e.g. television, telephathic
"trans": "",   # e.g. transatlantic, transfer
"ultra": "",   # e.g. ultra-compact, ultrasound
"un": "",      # e.g. under-cook, underestimate
"up": "",      # e.g. upgrade, uphill
}

porter = PorterStemmer()

whitelist = list(wn.words()) + words.words()

def stem_prefix(word, prefixes, roots):
    original_word = word
    for prefix in sorted(prefixes, key=len, reverse=True):
        # Use subn to track the no. of substitution made.
        # Allow dash in between prefix and root. 
        word, nsub = re.subn("{}[\-]?".format(prefix), "", word)
        if nsub > 0 and word in roots:
            return word
    return original_word

def porter_english_plus(word, prefixes=english_prefixes):
    return porter.stem(stem_prefix(word, prefixes, whitelist))

我们解决了以下问题：不删除前缀，导致无意义的词根，例如

>>> stem_prefix("united", english_prefixes, whitelist)
"united"

但是搬运工的茎仍然可以去掉后缀-ed，尤其是后缀>>> porter_english_plus("united") "unit"。当目标是在数据中保留语言上合理的单位时：

{{1}}

因此，根据任务的不同，有时使用引理比词干分析器更有益。

另请参阅：

Answer 2

如果您的列表包含400,000多个英语单词，以及645个前缀的列表。

https://www.dictionary.com/e/affixes/

https://raw.githubusercontent.com/dwyl/english-words/master/words.txt

def englishWords():
    with open(r'C:\Program Files (x86)\MyJournal\Images\American English\EnglishWords.txt') as word_file:
        return set(word.strip().lower() for word in word_file)  


def is_english_word(word, english_words):
    return word.lower() in english_words


def removePref(word):
    prefs = ['a','ab','abs','ac','acanth','acantho','acous','acr','acro','ad','aden','adeno','adren','adreno','aer','aero','af','ag','al','all','allo','alti','alto','am','amb','ambi','amphi','amyl','amylo','an','ana','andr','andro','anem','anemo','ant','ante','anth','anthrop','anthropo','anti','ap','api','apo','aqua','aqui','arbor','arbori','arch','archae','archaeo','arche','archeo','archi','arteri','arterio','arthr','arthro','as','aster','astr','astro','at','atmo','audio','auto','avi','az','azo','bacci','bacteri','bacterio','bar','baro','bath','batho','bathy','be','bi','biblio','bio','bis','blephar','blepharo','bracchio','brachy','brevi','bronch','bronchi','bronchio','broncho','caco','calci','cardio','carpo','cat','cata','cath','cato','cen','ceno','centi','cephal','cephalo','cerebro','cervic','cervici','cervico','chiro','chlor','chloro','chol','chole','cholo','chondr','chondri','chondro','choreo','choro','chrom','chromato','chromo','chron','chrono','chrys','chryso','circu','circum','cirr','cirri','cirro','cis','cleisto','co','cog','col','com','con','contra','cor','cosmo','counter','cranio','cruci','cry','cryo','crypt','crypto','cupro','cyst','cysti','cysto','cyt','cyto','dactyl','dactylo','de','dec','deca','deci','dek','deka','demi','dent','denti','dento','dentro','derm','dermadermo','deut','deutero','deuto','dextr','dextro','di','dia','dif','digit','digiti','dipl','diplo','dis','dodec','dodeca','dors','dorsi','dorso','dyna','dynamo','dys','e','ec','echin','echino','ect','ecto','ef','el','em','en','encephal','encephalo','end','endo','ennea','ent','enter','entero','ento','entomo','eo','ep','epi','equi','erg','ergo','erythr','erythro','ethno','eu','ex','exo','extra','febri','ferri','ferro','fibr','fibro','fissi','fluvio','for','fore','gain','galact','galacto','gam','gamo','gastr','gastri','gastro','ge','gem','gemmi','geo','geront','geronto','gloss','glosso','gluc','gluco','glyc','glyph','glypto','gon','gono','grapho','gymn','gymno','gynaec','gynaeco','gynec','gyneco','haem','haemato','haemo','hagi','hagio','hal','halo','hapl','haplo','hect','hecto','heli','helic','helico','helio','hem','hema','hemi','hemo','hepat','hepato','hept','hepta','heter','hetero','hex','hexa','hist','histo','hodo','hol','holo','hom','homeo','homo','hydr','hydro','hyet','hyeto','hygr','hygro','hyl','hylo','hymeno','hyp','hyper','hypn','hypno','hypo','hypso','hyster','hystero','iatro','ichthy','ichthyo','ig','igni','il','ile','ileo','ilio','im','in','infra','inter','intra','intro','ir','is','iso','juxta','kerat','kerato','kinesi','kineto','labio','lact','lacti','lacto','laryng','laryngo','lepto','leucleuco','leuk','leuko','lign','ligni','ligno','litho','log','logo','luni','lyo','lysi','macr','macro','magni','mal','malac','malaco','male','meg','mega','megalo','melan','melano','mero','mes','meso','met','meta','metr','metro','micr','micro','mid','mini','mis','miso','mon','mono','morph','morpho','mult','multi','my','myc','myco','myel','myelo','myo','n','naso','nati','ne','necr','necro','neo','nepho','nephr','nephro','neur','neuro','nocti','non','noso','not','noto','nycto','o','ob','oc','oct','octa','octo','ocul','oculo','odont','odonto','of','oleo','olig','oligo','ombro','omni','oneiro','ont','onto','oo','op','ophthalm','ophthalmo','ornith','ornitho','oro','orth','ortho','ossi','oste','osteo','oto','out','ov','over','ovi','ovo','oxy','pachy','palae','palaeo','pale','paleo','pan','panto','par','para','pari','path','patho','ped','pedo','pel','pent','penta','pente','per','peri','petr','petri','petro','phago','phleb','phlebo','phon','phono','phot','photo','phren','phreno','phyll','phyllo','phylo','picr','picro','piezo','pisci','plan','plano','pleur','pleuro','pluto','pluvio','pneum','pneumat','pneumato','pneumo','poly','por','post','prae','pre','preter','prim','primi','pro','pros','prot','proto','pseud','pseudo','psycho','ptero','pulmo','pur','pyo','pyr','pyro','quadr','quadri','quadru','quinque','re','recti','reni','reno','retro','rheo','rhin','rhino','rhiz','rhizo','sacchar','sacchari','sacchro','sacr','sacro','sangui','sapr','sapro','sarc','sarco','scelero','schisto','schizo','se','seba','sebo','selen','seleno','semi','septi','sero','sex','sexi','shiz','sider','sidero','sine','somat','somato','somn','sperm','sperma','spermat','spermato','spermi','spermo','spiro','stato','stauro','stell','sten','steno','stere','stereo','stom','stomo','styl','styli','stylo','sub','subter','suc','suf','sug','sum','sup','super','supra','sur','sus','sy','syl','sym','syn','tachy','taut','tauto','tel','tele','teleo','telo','terra','the','theo','therm','thermo','thromb','thrombo','topo','tox','toxi','toxo','tra','trache','tracheo','trans','tri','tris','ultra','un','undec','under','uni','up','uter','utero','vari','vario','vas','vaso','ventr','ventro','vice','with','xen','xeno','zo','zoo','zyg','zygo','zym','zymo']
    english_words = englishWords()
    for pre in prefs:
        if  word.startswith(pre):
            withoutPref = word[len(pre):]
            if is_english_word(withoutPref,english_words):
                return(withoutPref)
    return word  


>>> removePref('reload')
'load'

>>> removePref('unhappy')
'happy'

>>>removePref('reactivate')
'activate'

>>> removePref('impertinent')
'pertinent'

>>> removePref('aerophobia')
'phobia'

Python nltk干法器从不删除前缀

2 个答案: