从R中的文件路径列表创建语料库

时间:2016-03-15 21:45:06

标签: r text tm corpus

我在一个目录中有1030个单独的.txt文件,代表了研究中的所有参与者。

我已经成功创建了一个语料库,用于在目录中的所有文件中使用R中的tm包。

现在我试图创建这些文件的众多子集的corpi。例如,所有女性作者和男性作者之一的语料库。

我希望能够传递文件路径列表的语料库功能子集,但这还没有成功。

感谢任何帮助。这是一个从以下构建的示例:

pathname <- c("C:/Desktop/Samples")

study.files <- list.files(path = pathname, pattern = NULL, all.files = T, full.names = T, recursive = T, ignore.case = T, include.dirs = T) 

### This gives me a character vector that is equivalent to:

study.files <- c("C:/Desktop/Samples/author1.txt","C:/Desktop/Samples/author2.txt","C:/Desktop/Samples/author3.txt","C:/Desktop/Samples/author4.txt","C:/Desktop/Samples/author5.txt")

### I define my subsets with numeric vectors

women <- c(1,3)
men <- c(2,4,5)

### This creates new character vectors containing the file paths
women.files <- study.files[women]
men.files <- study.files[men]

### Here are the things I've tried to create a corpus from the subsetted list. None of these work.

women_corpus <- Corpus(women.files)
women_corpus <- Corpus(DirSource(women.files))
women_corpus <- Corpus(DirSource(unlist(women.files)))

我需要创建的子集相当复杂,因此我无法轻松创建仅包含每个语料库感兴趣的文本文件的新文件夹。

2 个答案:

答案 0 :(得分:1)

这正如你所希望的那样有效。

pathname <- c("C:/data/test")

study.files <- list.files(path = pathname, pattern = NULL, all.files = T, full.names = T, recursive = T, ignore.case = T, include.dirs = F) 

### This gives me a character vector that is equivalent to:

study.files <- c("C:/data/test/test1/test1.txt",
                 "C:/data/test/test2/test2.txt",
                 "C:/data/test/test3/test3.txt")

### I define my subsets with numeric vectors

women <- c(1,3)
men <- c(2)

### This creates new character vectors containing the file paths
women.files <- study.files[women]
men.files <- study.files[men]

### Here are the things I've tried to create a corpus from the subsetted list. None of these work.

women_corpus <- NULL
nedir <- lapply(women.files, function (filename) read.table(filename, sep="\t", stringsAsFactors = F))
hepsi <- lapply( nedir, function(x) x$V1)
women_corpus <- Corpus(VectorSource(hepsi))

答案 1 :(得分:0)

我遇到了类似的问题,即我基于文档的余弦相似度对其进行聚类,因此我想分别分析各个聚类,但又不想将文档组织到单独的文件夹中。

在查看DirSource的文档时,有一个选项可以传递正则表达式模式“仅将返回与正则表达式匹配的文件名”,因此我使用聚类信息对文档进行分组并构造用于每个集群。

使用上面的示例,您可以使用类似的方法:

library(tidyverse)
library(tm)

study.files <- c(
  "C:/Desktop/Samples/author1.txt"
  ,"C:/Desktop/Samples/author2.txt"
  ,"C:/Desktop/Samples/author3.txt"
  ,"C:/Desktop/Samples/author4.txt"
  ,"C:/Desktop/Samples/author5.txt"
)

### I define my subsets with numeric vectors

women <- c(1,3)
men <- c(2,4,5)

# putting this into a data.frame
doc_df <- data.frame(document = study.files) %>% 
  # categoris each of the documents using the numeric vectors 
  # defined above, as per original example
  mutate(
    index = row_number()
    , gender = if_else(index %in% women, 'woman', 'man')
    # separate the file name from the full path
    , filename = basename(as.character(document))
    ) %>% 
  group_by(gender) %>%
  # build the regex select pattern
  mutate(select_pattern = str_replace_all(paste0(filename, collapse = '|'), '[.]', "[.]")) %>%
  summarise(select_pattern = first(select_pattern))
  
men_df <- doc_df %>% filter(gender == 'man')
woman_df <- doc_df %>% filter(gender == 'woman')

# you can then use this to load a subset of documents from a single directory using regex
men_corpus <- Corpus(DirSource("C:/Desktop/Samples/", pattern = men_df$select_pattern[1]))
woman_corpus <- Corpus(DirSource("C:/Desktop/Samples/", pattern = woman_df$select_pattern[1]))