我使用 R 中的tokenizers
包来标记文字,但使用非字母数字符号,例如“@”或“&”迷路了,我需要保留它们。这是我正在使用的功能:
tokenize_ngrams("My number & email address user@website.com", lowercase = FALSE, n = 3, n_min = 1,stopwords = character(), ngram_delim = " ", simplify = FALSE)
我知道tokenize_character_shingles
有strip_non_alphanum
参数,允许保留标点符号,但标记化应用于字符,而不是单词。
任何人都知道如何处理这个问题?
答案 0 :(得分:3)
如果你可以使用不同的包ngram
,这有两个有用的功能可以保留那些非alpha
> library(ngram)
> print(ngram("My number & email address user@website.com",n = 2), output = 'full')
number & | 1
email {1} |
My number | 1
& {1} |
address user@website.com | 1
NULL {1} |
& email | 1
address {1} |
email address | 1
user@website.com {1} |
> print(ngram_asweka("My number & email address user@website.com",1,3), output = 'full')
[1] "My number &" "number & email"
[3] "& email address" "email address user@website.com"
[5] "My number" "number &"
[7] "& email" "email address"
[9] "address user@website.com" "My"
[11] "number" "&"
[13] "email" "address"
[15] "user@website.com"
>
另一个漂亮的软件包quanteda
为remove_punct
参数提供了更大的灵活性。
> library(quanteda)
> tokenize(text, ngrams = 1:3)
tokenizedTexts from 1 document.
Component 1 :
[1] "My" "number"
[3] "&" "email"
[5] "address" "user@website.com"
[7] "My_number" "number_&"
[9] "&_email" "email_address"
[11] "address_user@website.com" "My_number_&"
[13] "number_&_email" "&_email_address"
[15] "email_address_user@website.com"
>