如何在R中标记单词时保留非字母数字符号?

时间:2017-10-13 12:29:39

标签: r nlp tokenize

我使用 R 中的tokenizers包来标记文字,但使用非字母数字符号,例如“@”或“&”迷路了,我需要保留它们。这是我正在使用的功能:

tokenize_ngrams("My number & email address user@website.com", lowercase = FALSE, n = 3, n_min = 1,stopwords = character(), ngram_delim = " ", simplify = FALSE)

我知道tokenize_character_shinglesstrip_non_alphanum参数,允许保留标点符号,但标记化应用于字符,而不是单词。

任何人都知道如何处理这个问题?

1 个答案:

答案 0 :(得分:3)

如果你可以使用不同的包ngram,这有两个有用的功能可以保留那些非alpha

> library(ngram)
> print(ngram("My number & email address user@website.com",n = 2), output = 'full')
number & | 1 
email {1} | 

My number | 1 
& {1} | 

address user@website.com | 1 
NULL {1} | 

& email | 1 
address {1} | 

email address | 1 
user@website.com {1} | 

> print(ngram_asweka("My number & email address user@website.com",1,3), output = 'full')
 [1] "My number &"                    "number & email"                
 [3] "& email address"                "email address user@website.com"
 [5] "My number"                      "number &"                      
 [7] "& email"                        "email address"                 
 [9] "address user@website.com"       "My"                            
[11] "number"                         "&"                             
[13] "email"                          "address"                       
[15] "user@website.com"              
> 

另一个漂亮的软件包quantedaremove_punct参数提供了更大的灵活性。

> library(quanteda)
> tokenize(text, ngrams = 1:3)
tokenizedTexts from 1 document.
Component 1 :
 [1] "My"                             "number"                        
 [3] "&"                              "email"                         
 [5] "address"                        "user@website.com"              
 [7] "My_number"                      "number_&"                      
 [9] "&_email"                        "email_address"                 
[11] "address_user@website.com"       "My_number_&"                   
[13] "number_&_email"                 "&_email_address"               
[15] "email_address_user@website.com"

>