删除带有前导和尾随停用词的ngrams

时间:2017-10-11 10:10:26

标签: r text-mining tm quanteda

我想在一堆学术论文中找出主要的n-gram,包括带有嵌套停用词的n-gram,但不包含带有前导或尾随停用词的n-gram。

我有大约100个pdf文件。我通过Adobe批处理命令将它们转换为纯文本文件,并将它们收集在一个目录中。从那里我使用R.(这是一个拼凑的代码,因为我刚开始使用文本挖掘。)

我的代码:

library(tm)
# Make path for sub-dir which contains corpus files 
path <- file.path(getwd(), "txt")
# Load corpus files
docs <- Corpus(DirSource(path), readerControl=list(reader=readPlain, language="en"))

#Cleaning
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)

# Merge corpus (Corpus class to character vector)
txt <- c(docs, recursive=T)

# Find trigrams (but I might look for other ngrams as well)
library(quanteda)
myDfm <- dfm(txt, ngrams = 3)
# Remove sparse features
myDfm <- dfm_trim(myDfm, min_count = 5)
# Display top features
topfeatures(myDfm)
#                  as_well_as             of_the_ecosystem                  in_order_to         a_business_ecosystem       the_business_ecosystem strategic_management_journal 
#603                          543                          458                          431                          431                          359 
#in_the_ecosystem        academy_of_management                  the_role_of                the_number_of 
#336                          311                          289                          276

例如,在这里提供的顶级ngrams示例中,我想保留“管理学院”,但不是“以及”,也不是“the_role_of”。我希望代码适用于任何n-gram(最好包括不到3克,但我知道在这种情况下,首先删除停用词更简单)。

2 个答案:

答案 0 :(得分:2)

使用语料库 R包,以绿野仙踪为例(Project Gutenberg ID#55):

private static byte[] ConvertHexToBytes(string hex)
{
    byte[] bytes = new byte[hex.Length / 2];
    for (int i = 0; i < hex.Length; i += 2)
    {
        bytes[i / 2] = Convert.ToByte(hex.Substring(i, 2), 16);
    }
    return bytes;
}

protected void saveBtn_Click(object sender, EventArgs e)
{
    using (StreamReader reader = new StreamReader(Request.InputStream))
    {
        string str1 = Server.UrlEncode(reader.ReadToEnd());

        string imageName = "Image_" + DateTime.Now.ToString("dd-MM-yy hh-mm-ss");
        string imagePath = string.Format("~/SavedPics/{0}.png", imageName);

        File.WriteAllBytes(Server.MapPath(imagePath), ConvertHexToBytes(str1));

        byte[] visImageBytes = ConvertHexToBytes(str1);
    }
}

在您的情况下,您可以使用

转换library(corpus) library(Matrix) # needed for sparse matrix operations # download the corpus corpus <- gutenberg_corpus(55) # set the preprocessing options text_filter(corpus) <- text_filter(drop_punct = TRUE, drop_number = TRUE) # compute trigram statistics for terms appearing at least 5 times; # specify `types = TRUE` to report component types as well stats <- term_stats(corpus, ngrams = 3, min_count = 5, types = TRUE) # discard trigrams starting or ending with a stopword stats2 <- subset(stats, !type1 %in% stopwords_en & !type3 %in% stopwords_en) # print first five results: print(stats2, 5) ## term type1 type2 type3 count support ## 4 said the scarecrow said the scarecrow 36 1 ## 7 back to kansas back to kansas 28 1 ## 16 said the lion said the lion 19 1 ## 17 said the tin said the tin 19 1 ## 48 road of yellow road of yellow 12 1 ## ⋮ (35 rows total) # form a document-by-term count matrix for these terms x <- term_matrix(corpus, select = stats2$term) Corpus对象
tm

答案 1 :(得分:1)

以下是 quanteda 中的方法:使用dfm_remove(),其中您要删除的模式是表达式开头和结尾的停用词列表,后跟连接符。 (请注意,为了重现性,我使用了内置的文本对象。)

library("quanteda")

# remove for your own txt
txt <- data_char_ukimmig2010

(myDfm <- dfm(txt, remove_numbers = TRUE, remove_punct = TRUE, ngrams = 3))
## Document-feature matrix of: 9 documents, 5,518 features (88.5% sparse).

(myDfm2 <- dfm_remove(myDfm, 
                     pattern = c(paste0("^", stopwords("english"), "_"), 
                                 paste0("_", stopwords("english"), "$")), 
                     valuetype = "regex"))
## Document-feature matrix of: 9 documents, 1,763 features (88.6% sparse).
head(featnames(myDfm2))
## [1] "immigration_an_unparalleled" "bnp_can_solve"               "solve_at_current"           
## [4] "immigration_and_birth"       "birth_rates_indigenous"      "rates_indigenous_british" 

奖金回答:

您可以使用 readtext 程序包阅读您的pdf文件,使用上述代码也能正常使用 quanteda

library("readtext")
txt <- readtext("yourpdfolder/*.pdf") %>% corpus()