R的代码在下面,我不确定为什么找不到对象文件夹。 我首先使用untar()函数将tar文件解压缩。然后,创建一个包含20news-bydate-train数据的培训文件夹,使用make函数读取文件夹,并创建一个数据框来保存新闻组的标题,消息ID和附带的文本。
library(dplyr)
library(tidyr)
library(purrr)
url <- "http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz"
download.file(url, destfile = "20news-bydate.tar.gz")
untar("20news-bydate.tar.gz")
training_folder <- "20news-bydate-train"
# Create a function to read all files from a folder into a data frame
read_folder <- function(infolder) {
data_frame(file = dir(infolder, full.names = TRUE)) %>%
mutate(text = map(file, read_lines)) %>%
transmute(id = basename(file), text) %>%
unnest(text)
}
# Use unnest() and map() to apply read_folder to each subfolder
(raw_text <- data_frame(folder = dir(training_folder, full.names = TRUE)) %>%
unnest(map(folder, read_folder)) %>%
transmute(newsgroup = basename(folder), id, text))
答案 0 :(得分:0)
注意: read_folder 函数中使用的 read_lines 函数需要库(阅读器)。这个问题不存在。作者不知道问题“我为什么会收到错误”的确切答案。以下是尝试解决此问题的尝试。
最可能的问题:
在数据帧上应用unnest()时,必须首先对其进行突变。发问者可能会使用其在描述之前所存在的功能。添加这一小步骤,确保数据得到正确处理。
可能的解决方案:
library(dplyr)
library(tidyr)
library(purrr)
library(readr)
url <- "http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz"
download.file(url, destfile = "20news-bydate.tar.gz")
untar("20news-bydate.tar.gz")
training_folder <- "20news-bydate-train"
# Create a function to read all files from a folder into a data frame
read_folder <- function(infolder) {
data_frame(file = dir(infolder, full.names = TRUE)) %>%
mutate(text = map(file, read_lines)) %>%
transmute(id = basename(file), text) %>%
unnest(text)
}
raw_text <- data_frame(folder = dir(training_folder, full.names = TRUE)) %>%
mutate(temp = map(folder, read_folder)) %>%
unnest(temp) %>%
transmute(newsgroup = basename(folder), id, text)
转换为数据框
raw_text_df <- as.data.frame(raw_text)
输出看起来像这样:
> print(head(raw_text_df ))
newsgroup id text
1 alt.atheism 49960 From: mathew <mathew@mantis.co.uk>
2 alt.atheism 49960 Subject: Alt.Atheism FAQ: Atheist Resources
3 alt.atheism 49960 Summary: Books, addresses, music -- anything related to atheism
4 alt.atheism 49960 Keywords: FAQ, atheism, books, music, fiction, addresses, contacts
5 alt.atheism 49960 Expires: Thu, 29 Apr 1993 11:57:19 GMT
6 alt.atheism 49960 Distribution: world
希望有帮助。