对于我的硕士论文,我正在分析一所大学的课程。我有3个变量(作为列)的1134门课程(作为行)。由于我很少有R的经验,所以我正在努力为其编写代码。这是更多信息,并且在其中,我将数据库的样本附加为图像。
第1列是课程名称 第2列是课程说明 第三列是学习成果
我想清除数据并删除停用词,标点符号和其他不相关的字符。我使用以下代码执行此操作:
rm(list=ls());
library(readxl);
library(MASS);
library(nnet);
library(NLP);
library(tm);
database <- read_excel("/Volumes/GoogleDrive/My Drive/TU e Innovation Management /Thesis/testdatabasematrix.xlsx");
#name columns
colnames(database)[1] <- "Name";
colnames(database)[2] <- "Description";
colnames(database)[3] <- "LearningOutcomes";
#replace punctuation
database2 <- gsub(pattern = "\\W", replace = " ", database)
#replace digits
database2 <- gsub(pattern="\\d", " ", database2)
#everything to lower
database2 <- tolower(database2)
#until here everything fine
database2 <- removeWords(database2, stopwords());
#When I try to save the database in a data frame, the output is merely 3 observations of 1 variable instead of 1141 obs. of 3 variables
database2 <- data.frame(database2)
希望您能帮助我:)。如果您需要更多信息,请这样说,我当然会提供。
最好, 克里斯蒂安(Christiaan)
答案 0 :(得分:0)
似乎removeWords
需要一个字符向量作为第一个参数。所以也许
for (x in colnames(database2)) database2[,x] <- removeWords(database2[,x], stopwords())
起到了作用。
答案 1 :(得分:0)
您还可以考虑使用tidytext
和dplyr
软件包,这非常好:
# some data similar to yours
database <- data.frame(Name = c('Aalto Fellows II', 'Aalto introduction to Services'),
Description = c('This course is a lot of words I do not know.','Service economy, whatever it does mean.'),
LearningOutcomes = c('Aalto Fellows, which are the smartest, learn.','Knowing what does Service economy means.'), stringsAsFactors = FALSE)
# cool packages
library(tidytext)
library(dplyr)
# here the text transformations for titles
title <- tibble(line = 1:nrow(database), text = database$Name) %>% # as tibble
unnest_tokens(word, text)%>% # remove punctuations, lowercase, put words in column
anti_join(stop_words, by = c("word" = "word")) %>% # remove stopwords
group_by(line) %>% summarise(title = paste(word,collapse =' ')) # now all in a row!
# here the text transformations for descriptions
description <- tibble(line = 1:nrow(database), text = database$Description) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = c("word" = "word")) %>%
group_by(line) %>% summarise(title = paste(word,collapse =' '))
# here the text transformations for learning outcomes
learningoutcomes <- tibble(line = 1:nrow(database), text = database$LearningOutcomes) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = c("word" = "word")) %>%
group_by(line) %>% summarise(title = paste(word,collapse =' '))
# now the full dataset
database2 <- title %>% left_join(description, by = 'line') %>% left_join(learningoutcomes, by = 'line')
colnames(database2) <- c("line","Name","Description","LearningOutcomes")
database2
# A tibble: 2 x 4
line Name Description LearningOutcomes
<int> <chr> <chr> <chr>
1 1 aalto fellows ii lot words aalto fellows smartest learn
2 2 aalto introduction services service economy knowing service economy means
您可以使用data.frame()
将其转换为data.frame。
答案 2 :(得分:0)
由于您尚未提供示例数据集,因此我创建了一个示例数据集。 此解决方案还使用了tidytext和dplyr软件包,但与上面的稍有不同。在这种情况下,您可以直接使用数据框并在继续操作时维护数据框。
# Create dataframe
database <- data.frame("Name" = c("Hello how, are , you.", "I am fine thank you!"),
"Description" = c("installation of package ‘tibble’",
"package ‘readxl’ successfully unpacked "),
"LearningOutcome" = c("Volumes/GoogleDrive/My Drive",
"TU e Innovation Management")
# Take off stop words
data(stop_words)
database_stopwords <- database %>%
filter(!Name %in% stop_words$word) %>%
filter(!Description %in% stop_words$word) %>%
filter(!LearningOutcome %in% stop_words$word)
# Convert to lower
database_lower <- database_stopwords %>% mutate(
Name = tolower(Name),
Description = tolower(Description),
LearningOutcome = tolower(LearningOutcome)
)
# Take off punctuation and others
database_lower$Name = gsub("\\W|\\d|http\\w?", " ", database_lower$Name, perl = T)
database_lower$Description = gsub("\\W|\\d|http\\w?", " ", database_lower$Description, perl = T)
database_lower$LearningOutcome = gsub("\\W|\\d|http\\w?", " ", database_lower$LearningOutcome, perl = T)