Quanteda中的自定义词典

时间:2019-01-19 19:59:03

标签: text encoding quanteda

我需要进行LIWC(语言查询和字数统计),并且我使用的是Quanteda / quanteda.dictionaries。我需要“加载”自定义词典:我将单词列表另存为单独的.txt文件,并通过读取行​​“加载”(例如,只有一个词典):

val numPartitions: Long = df
      .select(org.apache.spark.sql.functions.spark_partition_id()).distinct().count()

这是我正在尝试的文字

.rdd

然后我运行它:

autonomy = readLines("Dictionary/autonomy.txt", encoding = "UTF-8")

EODic<-quanteda::dictionary(list(autonomy=autonomy),encoding = "auto")

并收到此错误:

txt <- c("12th Battalion Productions is producing a fully holographic feature length production. Presenting a 3D audio-visual projection without a single cast member present, to give the illusion of live stage performance.")

显然,问题出在我的txt文件上。我有很多字典,而是将它们作为文件加载。

如何解决此错误?在阅读行中指定编码似乎没有帮助

这是文件https://drive.google.com/file/d/12plgfJdMawmqTkcLWxD1BfWdaeHuPTXV/view?usp=sharing

更新:在Mac上解决此问题的最简单方法是在Word中打开.txt文件,而不是TextEdit。 Word提供了与默认TextEdit不同的编码选项!

1 个答案:

答案 0 :(得分:1)

好的,问题不在于编码,因为您链接的文件中的所有内容都可以完全以低128个字符的ASCII编码。问题是由空行引起的空白。还有一些前导空间需要移除。使用一些子集和一些 stringi 清理操作很容易做到。

library("quanteda")
## Package version: 1.3.14

autonomy <- readLines("~/Downloads/risktaking.txt", encoding = "UTF-8")
head(autonomy, 15)
##  [1] "adventuresome"  " adventurous"   " audacious"     " bet"          
##  [5] " bold"          " bold-spirited" " brash"         " brave"        
##  [9] " chance"        " chancy"        " courageous"    " danger"       
## [13] ""               "dangerous"      " dare"

# strip leading or trailing whitespace
autonomy <- stringi::stri_trim_both(autonomy)
# get rid of empties
autonomy <- autonomy[!autonomy == ""]

现在,您可以创建字典并应用quanteda.dictionaries::liwcalike()函数。

# now define the quanteda dictionary
EODic <- dictionary(list(autonomy = autonomy))

txt <- c("12th Battalion Productions is producing a fully holographic feature length production. Presenting a 3D audio-visual projection without a single cast member present, to give the illusion of live stage performance.")

library("quanteda.dictionaries")
liwcalike(txt, dictionary = EODic)
##   docname Segment WC  WPS Sixltr Dic autonomy AllPunc Period Comma Colon
## 1   text1       1 35 15.5  34.29   0        0   11.43   5.71  2.86     0
##   SemiC QMark Exclam Dash Quote Apostro Parenth OtherP
## 1     0     0      0 2.86     0       0       0   8.57