Question

我正在尝试使用Java源代码创建语料库。
我正在遵循本文中的预处理步骤http://cs.queensu.ca/~sthomas/data/Thomas_2011_MSR.pdf

根据[2.1]节，应删除以下内容：
- 与编程语言的语法相关的字符[已由removePunctuation完成]
- 编程语言关键字[已由tm_map（dsc，removeWords，javaKeywords）完成]
- 常见的英语停用词[已由tm_map完成（dsc，removeWords，stopwords（“english”））] - 词干[已经由tm_map（dsc，stemDocument）完成]

剩下的部分是根据通用命名约定将标识符和方法名称拆分为多个部分。

例如，'firstName'应分为'first'和'name'。

另一个例子'calculateAge'应分为'计算'和'年龄'。
有人可以帮帮我吗？

    library(tm)
    dd = DirSource(pattern = ".java", recursive = TRUE)
    javaKeywords = c("abstract","continue","for","new","switch","assert","the","default","package","synchronized","boolean","do","if","private","this","break","double","implements","protected","throw","byte","else","the","null","NULL","TRUE","FALSE","true","false","import","public","throws","case","enum", "instanceof","return","transient","catch","extends","int","short","try","char","final","interface","static","void","class","finally","long","volatile","const","float","native","super","while")
    dsc <- Corpus(dd)
    dsc <- tm_map(dsc, stripWhitespace)
    dsc <- tm_map(dsc, removePunctuation)
    dsc <- tm_map(dsc, removeNumbers)
    dsc <- tm_map(dsc, removeWords, stopwords("english"))
    dsc <- tm_map(dsc, removeWords, javaKeywords)
    dsc = tm_map(dsc, stemDocument)
    dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = FALSE))

Answer 1

您可以创建一个自定义函数来按大写字母分割单词（在此处向量化）：

splitCapital  <- function(x) 
     unlist(strsplit(tolower(sub('(.*)([A-Z].*)','\\1 \\2',x)),' '))

示例：

splitCapital('firstName')
[1] "first" "name" 

splitCapital(c('firstName','calculateAge'))
[1] "first"     "name"      "calculate" "age"

然后你可以迭代你的语料库：

corpus.split <- lapply(dsc,splitCapital)

Answer 2

我在Perl中编写了一个工具来进行各种源代码预处理，包括标识符拆分：

https://github.com/stepthom/lscp

相关的代码片段是：

=head2 tokenize
 Title    : tokenize
 Usage    : tokenize($wordsIn)
 Function : Splits words based on camelCase, under_scores, and dot.notation.
          : Leaves other words alone.
 Returns  : $wordsOut => string, the tokenized words
 Args     : named arguments:
          : $wordsIn => string, the white-space delimited words to process
=cut
sub tokenize{
    my $wordsIn  = shift;
    my $wordsOut = "";

    for my $w (split /\s+/, $wordsIn) {
        # Split up camel case: aaA ==> aa A
        $w =~ s/([a-z]+)([A-Z])/$1 $2/g;

        # Split up camel case: AAa ==> A Aa
        # Split up camel case: AAAAa ==> AAA Aa
        $w =~ s/([A-Z]{1,100})([A-Z])([a-z]+)/$1 $2$3/g;

        # Split up underscores 
        $w =~ s/_/ /g;

        # Split up dots
        $w =~ s/([a-zA-Z0-9])\.+([a-zA-Z0-9])/$1 $2/g;

        $wordsOut = "$wordsOut $w";
    }

    return removeDuplicateSpaces($wordsOut);
}

以上黑客攻击是基于我自己的预处理文本分析源代码的经验。随意窃取和修改。

Answer 3

I realize this is an old question and the OP has either solved their problem or moved on, but in case someone else comes across this question and is seeking an identifier splitting package, I would like to offer Spiral ("SPlitters for IdentifieRs: A Library"). It is written in Python but comes with a command-line utility that can read a file of identifiers (one per line) and split each one.

Splitting identifiers is deceptively difficult. It's actually a research-grade problem for which no perfect solution exists today. Even in cases where the input consists of identifiers that follow some convention such as camel case, ambiguities can arise—and of course, things are much harder when source code does not follow a consistent convention.

Spiral implements numerous identifier splitting algorithms, including a novel algorithm called Ronin. It uses a variety of heuristic rules, English dictionaries, and tables of token frequencies obtained from mining source code repositories. Ronin can split identifiers that do not use camel case or other naming conventions, including cases such as splitting J2SEProjectTypeProfiler into [J2SE, Project, Type, Profiler], which requires the reader to recognize J2SE as a unit. Here are some more examples of what Ronin can split:

# spiral mStartCData nonnegativedecimaltype getUtf8Octets GPSmodule savefileas nbrOfbugs
mStartCData: ['m', 'Start', 'C', 'Data']
nonnegativedecimaltype: ['nonnegative', 'decimal', 'type']
getUtf8Octets: ['get', 'Utf8', 'Octets']
GPSmodule: ['GPS', 'module']
savefileas: ['save', 'file', 'as']
nbrOfbugs: ['nbr', 'Of', 'bugs']

If you want simple strict camel-case or other simpler splitters, Spiral offers several of those too. Please see the GitHub page for more information.

创建源代码语料库中的拆分标识符和方法名称

3 个答案: