我有一个语料库,其中包含我导入的两个文本文件:
temp = list.files(pattern = ".txt")
mydata = lapply(temp, read.delim, sep ="\t", quote = "")
mydata
输出类是 list 但我将其转换为字符:
class(mydata)
list
mydata <- as.character(mydata)
文本属于字符类:
class(mydata)
[1] "character"
但似乎它们是字符串,因为输出首先显示:
[[1]]ï..We.give.the.observer.as.much.time.as.he.wants.to.make.his.response..we.simply.increase.the.number.of.alternative.stimuli.among.which.he.must.
(上面的一行只是其中一个文本的例子);然后它打印实际文本,因为它们是每个句子在一个单独的行上,例如:
ï..this.is.just.a.bunch.of.crab.to.analyse.
1 I need to understand how this R package works.
2 lexical diversity needs to be analysed for two texts for now.
3 In this document I am typing each sentence on a separate line.
我需要将这些文本转换为字符向量,以便在下一步分析中使用R中的 stringi 包将它们转换为ASCII,例如:
stri_enc_toascii(mydata)
- 此包仅将字符向量转换为ascii编码。所以问题是:
- 如何将字符串语料库转换为矢量?
P.S:我已经回顾了StackOverflow中的所有其他问题,以避免重复的问题。谢谢你的帮助!
谢谢你们的帮助!我只是使用as.vector将字符串转换为字符向量:
as.vector(mydata)
is.vector(mydata)
TRUE
但主要问题仍然存在:我想要一个字符向量作为stringi包的输入和 stri_enc_toascii(mydata)函数将mydata转换为ASCII编码(检查here,但是编码仍显示未知。是否有任何直接的方式将&#34;未知&#34;编码转换为&#34; ascii&#34;?
答案 0 :(得分:0)
问题不是很清楚,但听起来你想要展平一个也转换为ASCII的字符串向量:
library(stringi)
string1 <- "Here's a random phrase." # English, ASCII
string2 <- ".هنا عبارة عشوائية هناائية" # Arabic, not ASCII
string3 <- "여기에 임의의 문구가 있습니다." # Korean, not ASCII
strings <- c(string1, string2, string3) # as a vector of strings of length 3
ascii_strings <- stri_enc_toascii(strings) # convert to ASCII
stri_flatten(ascii_strings) # as a flat, single element string
# other options....
stri_c(ascii_strings, collapse = " ") # as a flat, single element string
Reduce(paste, ascii_strings) # base::Reduce() / purrr::reduce() with paste() will do the same
stringr::str_c(ascii_strings) # stringr::str_c() just wraps stringi::str_c()
stringr::str_flatten() # stringr::str_flatten() just wraps stringi::flatten()