一切都运行得很好,除非我想使用正则表达式删除LaTeX的不需要的命令,但由于某些原因我尝试了不同的变化,
这是我的代码段
input = new BufferedReader(new FileReader(args[0]));
output = new PrintWriter(new FileWriter(args[1]));
Set<String> wordsSet = new TreeSet<String>();
String currentWord;
String wholeText = "";
while ((currentWord = input.readLine()) != null)
wholeText += currentWord + "\n";
wholeText = wholeText.replaceAll
(" |'|\\.|:|/|`|%|-|\\d", "\n");
//output.print(wholeText);
String [] asda = wholeText.split("\n");
String [] un = {"\\documentclass", "\\usepackage", "\\input", "\\begin"
, "\\end" , "\\vspace", "\\ref", "\\includegraphics"
, "\\label"};
System.out.println(asda.length);
for (String a: asda)
{
for (String unw: un)
if (a.startsWith(unw))
continue;
if (a.contains("-"))
continue;
if (a.contains("/"))
continue;
if (a.matches(".*\\d.*"))
continue;
a = a.replaceAll("[.,?!'`()=:-<>{} <]","");
if ( a == "\n" || a.startsWith("\\") || a.length() == 1
|| (a.length() > 0 && !Character.isLetter(a.charAt(0))))
if (a.startsWith("\\cite"))
a = a.replace("\\cite","");
else if (a.startsWith("\\textbf"))
a = a.replace("\\textbf","");
else if (a.startsWith("\\author"))
a = a.replace("\\author","");
else if (a.startsWith("\\emph"))
a = a.replace("\\emph","");
else if (a.startsWith("\\texttt"))
a = a.replace("\\texttt","");
else if (a.startsWith("\\section"))
a = a.replace("\\section","");
else if (a.startsWith("\\url"))
a = a.replace("\\url","");
else
continue;
if(a.length() < 1)
continue;
if(a.length() > 1 && Character.isUpperCase(a.charAt(0))
&& Character.isLowerCase(a.charAt(1)))
a = a.toLowerCase();
wordsSet.add(a);
}
//Collections.sort(wordsSet);
Iterator i = wordsSet.iterator();
while(i.hasNext())
output.println(i.next());
System.out.println(wordsSet.size());
首先我在一个String上获取latex文件中的所有内容,然后在String类中使用replaceAll
方法执行一些替换,但是,我尝试包含我不想使用的LaTeX命令在取代中,但由于某种原因它不会起作用,似乎没有任何作用。其中一些正则表达式我尝试了"\\documentclass\\[.*\\]\\{.*\\}"
,"\\documentclass\\[+.*+\\]\\{+.*+\\}"
,"\\docume.*\\}"
以及更多失败的尝试。我不知道什么不起作用,理论上它应该工作得很好,任何帮助都会受到赞赏。
其他信息:
输出将是乳胶文件中按字母顺序排序的所有单词
当我遇到\documentclass[12pt,a4paper]{article}
时会产生paper]article
和pta
。当我遇到\usepackage{a4-mancs}
时,我得到mancs