如何用R中的特定词典对语料库进行定形?”

时间:2019-03-23 19:12:22

标签: r text-mining lemmatization

我正在尝试使用函数 let list=document.querySelectorAll(".item-cat"); list.forEach(function(elArg){ elArg.addEventListener("mouseover",function(ev){ let childIn=this.children[1]; childIn.classList.add("active-sub-cat"); }) elArg.addEventListener("mouseleave",function(ev){ let childOut=this.children[1]; childOut.classList.remove("active-sub-cat"); }) }); let nav=document.querySelector(".nav-item.dropdown"); let drp=document.querySelector(".main-drop-down"); nav.addEventListener("mouseenter",function(ev){ document.querySelector(".main-drop-down").classList.add("active-cat"); }); drp.addEventListener("mouseleave",function(ev){ document.querySelector(".main-drop-down").classList.remove("active-cat"); })作为 body{ background-color: darkcyan; } .main-drop-down{ top:-7px; right:0; left:0; display: none; z-index:22; position: relative; opacity: 0; visibility: hidden; } .drop-cat{ position: absolute; left: 166px; width: 230px; background: #fff; height: 468px; top:-10px; box-shadow: 5px 5px 10px rgba(0, 0, 0, 0.2); border:1px solid #fafafa; border-radius: 2px; } .drop-cat::before{ content: ""; position: absolute; top: -5px; height: 0; width: 0; border-left: 5px solid transparent; border-right: 5px solid transparent; border-bottom: 5px solid #fff; left: 70px; } .custom-li{ padding: 1rem .2rem 1rem .2rem; } .item-cat{ margin-left:.2rem; cursor: pointer; padding:.2rem 0 .2rem 0; } .item-cat:hover{ color:lightcoral; font-weight: 600; } .item-cat:hover > i{ color:#000 !important; } .sub-cat-section{ position: absolute; height: auto; background-color: #fff; width: 468px; left: 229px; top:0; visibility: hidden; height: 100%; } .active-sub-cat{ display: block; visibility: visible; } .active-cat{ display: block; visibility:visible; opacity: 1; } .sub-cat-section img{ bottom:0; position: absolute; clear: both; } .sub-cat-section > .ul-sub-cat{ padding:1rem; } .ul-sub-cat .list-sub-cat{ padding:1rem .5rem 1rem .5rem; } .nav-item.dropdown{ z-index:0; }包的 <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script> <link href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" rel="stylesheet"/> <nav class="navbar navbar-expand-md navbar-light bg-primary"> <a class="navbar-brand text-white" href="#">Navbar</a> <button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbarNavDropdown" aria-controls="navbarNavDropdown" aria-expanded="false" aria-label="Toggle navigation"> <span class="navbar-toggler-icon"></span> </button> <div class="collapse navbar-collapse text-white" id="navbarNavDropdown"> <ul class="navbar-nav "> <li class="nav-item active"> <a class="nav-link" href="#">Home <span class="sr-only">(current)</span></a> </li> <li class="nav-item dropdown"> <a class="nav-link dropdown-toggle" href="#" id="navbarDropdownMenuLink" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false"> Dropdown link </a> </li> </ul> </div> </nav> <div class="main-drop-down"> <div class="drop-cat"> <ul class="list-unstyled custom-li"> <li class="item-cat"> Echo & Alexa <i class="fas fa-chevron-right float-right"></i> <div class="sub-cat-section"> <div class="ul-sub-cat"> <ul class="list-unstyled"> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> </ul> </div> <img src="https://images-eu.ssl-images-amazon.com/images/G/31/img18/AmazonDevices/Neel/GW/500x450_Flyout-new._CB456065619_.png" alt="img" class="img-fluid" /> </div> </li> <li class="item-cat"> Fire Tv Stick <i class="fas fa-chevron-right float-right"></i> <div class="sub-cat-section"> <div class="ul-sub-cat"> <ul class="list-unstyled"> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> </ul> </div> </div> </li> <li class="item-cat"> Kindle E-Reader & eBook <i class="fas fa-chevron-right float-right"></i> <div class="sub-cat-section"> <div class="ul-sub-cat"> <ul class="list-unstyled"> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> </ul> </div> </div> </li> <li class="item-cat"> Amazon prime Video <i class="fas fa-chevron-right float-right"></i> <div class="sub-cat-section"> <div class="ul-sub-cat"> <ul class="list-unstyled"> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> </ul> </div> </div> </li> <li class="item-cat"> Amazon Prime Music <i class="fas fa-chevron-right float-right"></i> <div class="sub-cat-section"> <div class="ul-sub-cat"> <ul class="list-unstyled"> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> <li class="list-sub-cat"> <a href="#" class="sub-link"> first </a> </li> </ul> </div> </div> </li> </ul> </div> </div> <script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js"></script>的参数对语料库执行lemmatization

但是我想使用自己的字典(“ lexico”-第一列以小写字母表示,而第二列具有相应的替换引理)。

我尝试使用:

lemmatize_strings()

但是没用... 当我使用时:

tm_map()

我没问题!

如何将我的字典“ lexico”放入功能tm_map()?

抱歉,这个问题是我48岁时第一次尝试进行文本挖掘。

为了更易于理解,我的语料库由2000个文档组成;第一个文档的摘录:

tm

然后使用以下配置处理字典文件(lexico):

corpus<-tm_map(corpus, lemmatize_strings)

当我使用函数lemmatize_strings(corpus[[1]], dictionary = lexico) 时,它可以正常工作,并从我的字典中给出用引理定理的n -1个语料库的文档。

我的问题在于此功能:

corpus[[1]][[1]]

[9] "..."

[10] "Nos últimos dias da passada legislatura, a maioria de direita aprovou duas leis que significam enormes recuos nos direitos das cidadãs do país. Fizeram tábua rasa do pronunciamento das cidadãs e cidadãos do país em referendo, optando por humilhar e tentar culpabilizar as mulheres que abortam por sua livre escolha. Estas duas leis são a Lei n.º 134/2015 e a Lei n.º 136/2015, de setembro. A primeira prevê o pagamento de taxas moderadoras na interrupção de gravidez quando for realizada, por opção da mulher, nas primeiras 10 semanas de gravidez. A segunda representa a primeira alteração à Lei n.º 16/2007, de 17 de abril, sobre exclusão de ilicitude nos casos de interrupção voluntária da gravidez." 

这只会破坏我在语料库中的所有文档

lexico[1:10,]
           termo         lema pos.tag
1             aa            a NCMP000
2           aais          aal NCMP000
3            aal          aal NCMS000
4      aaleniano    aaleniano NCMS000
5     aalenianos    aaleniano NCMP000
6     ab-rogação   ab-rogação NCFS000
7    ab-rogações   ab-rogação NCFP000
8   ab-rogamento ab-rogamento NCMS000
9  ab-rogamentos ab-rogamento NCMP000
10   ab-rogáveis   ab-rogável  AQ0CP0

请提前感谢您!

1 个答案:

答案 0 :(得分:0)

例如,您可以为此使用quanteda包:

library("quanteda")
text <- "This is a test sentence. We can lemmatize it using quanteda."
dict <- data.frame(
  word = c("is", "using"),
  lemma = c("be", "use"),
  stringsAsFactors = FALSE
)

toks <- tokens(text, remove_punct = TRUE)
toks_lemma <- tokens_replace(toks,
                             pattern = dict$word,
                             replacement = dict$lemma,
                             case_insensitive = TRUE, 
                             valuetype = "fixed")
toks_lemma
tokens from 1 document.
text1 :
 [1] "This"      "be"        "a"         "test"      "sentence"  "We"        "can"       "lemmatize"
 [9] "it"        "use"       "quanteda" 

该功能非常快速,尽管该名称主要用于进行词素化。