我正在进行一项输入,其输入格式如下,我必须尽快解析它:
5 (
5 (
3 (
)
)
3 (
3 (
)
3 (
)
)
5 (
2 (
)
4 (
)
)
)
这是"员工"的树结构,数字用于后续任务(语言索引)。
每个员工可以拥有任意数量的下属和一个上级(根节点是" Boss")。
这是我的解析器:(原来我使用的是Scanner
,它很简单,但速度慢了两倍)
// Invocation
// Employee boss = collectEmployee(null, 0, reader);
private Employee collectEmployee(final Employee parent, int indent, final Reader r) throws IOException
{
final StringBuilder sb = new StringBuilder();
boolean nums = false;
while (true) {
char c = (char) r.read();
if (c == 10 || c == 13) continue; // newline
if (c == ' ') {
if (nums) break;
} else {
nums = true;
sb.append(c);
}
}
final int lang = Integer.parseInt(sb.toString());
final Employee self = new Employee(lang, parent);
r.skip(1); // opening paren
int spaces = 0;
while (true) {
r.mark(1);
int i = r.read();
char c = (char) i;
if (c == 10 || c == 13) continue; // newline
if (c == ' ') {
spaces++;
} else {
if (spaces == indent) {
break; // End of this employee
} else {
spaces = 0; // new line.
r.reset();
self.add(collectEmployee(self, indent + 1, r));
}
}
}
return self; // the root employee for this subtree
}
我需要在代码上多花几个周期,因此它将通过严格的要求。我已经对它进行了描述,这部分确实会减慢应用程序的速度。输入文件最多可以有30 MiB,因此任何微小的改进都会产生很大的不同。
任何想法都赞赏。感谢。
(为了完整起见,扫描仪实现在这里 - 它可以让你了解我如何解析它)
private Employee collectEmployee(final Employee parent, final Scanner sc)
{
final int lang = Integer.parseInt(sc.next());
sc.nextLine(); // trash the opening parenthesis
final Employee self = new Employee(lang, parent);
while (sc.hasNextInt()) {
Employee sub = collectEmployee(self, sc);
self.add(sub);
}
sc.nextLine(); // trash the closing parenthesis
return self;
}
答案 0 :(得分:2)
您正在使用StringBuilder
进行大量数据推送 - 保留在遇到小数字符时更新的int值可能会有所帮助('0'
- {{1} })('9'
)并在遇到非小数时存储/重置。这样你也可以摆脱Integer.parseInt。
您似乎正在使用/检查层次结构的缩进,但您的输入格式包含使其成为基于S表达式的语法的大括号 - 因此您的解析器执行的工作比需要的多得多(您可以忽略空格和使用一堆Employees处理大括号。
我考虑使用JMH基准测试并使用perf-asm(如果可用)运行以查看代码花费时间的位置。真的,它是一个非常宝贵的工具。
答案 1 :(得分:2)
嗯,基础知识是阅读和解析,以及你对数据做了什么。
通过递归下降进行读取和解析应该完全受IO限制。 它的运行时间只需要读取字符的一小部分。
您对数据的处理方式取决于您设计数据结构的方式。 如果你不小心,你可以花更多的时间在内存管理上。
无论如何,这里是C ++中一个骨骼简单的解析器。您可以将其转换为您喜欢的任何语言。
void scanWhite(const char* &pc){while(WHITE(*pc)) pc++;}
bool seeChar(const char* &pc, char c){
scanWhite(pc);
if (*pc != c) return False;
pc++;
return True;
}
bool seeNum((const char* &pc, int &n){
scanWhite(pc);
if (!DIGIT(*pc)) return False;
n = 0; while(DIGIT(*pc)) n = n * 10 + (*pc++ - '0');
return True;
}
// this sucks up strings of the form: either nothing or number ( ... )
bool readNumFollowedByList(const char* &pc){
int n = 0;
if (!seeNum(pc, n)) return False;
// what you do with this number and what follows is up to you
// if you hit the error, print a message and throw to the top level
if (!seeChar(pc, LP)){ /* ERROR - NUMBER NOT FOLLOWED BY LEFT PAREN */ }
// read any number of number ( ... )
while(readNumFollowedByList(*pc)); // <<-- note the recursion
if (!seeChar(pc, RP)){ /* ERROR - MISSING RIGHT PAREN */ }
return True;
}
答案 2 :(得分:0)
正确的实现应该真正使用状态机和Builder
。不确定这是多少/多少有效,但它肯定适用于后来的增强和一些真正的简单。
static class Employee {
final int language;
final Employee parent;
final List<Employee> children = new ArrayList<>();
public Employee(int language, Employee parent) {
this.language = language;
this.parent = parent;
}
@Override
public String toString() {
StringBuilder s = new StringBuilder();
s.append(language);
if (!children.isEmpty()) {
for (Employee child : children) {
s.append("(").append(child.toString()).append(")");
}
} else {
s.append("()");
}
return s.toString();
}
static class Builder {
// Make a boss to wrap the data.
Employee current = new Employee(0, null);
// The number that is growing into the `language` field.
StringBuilder number = new StringBuilder();
// Bracket counter - not sure if this is necessary.
int brackets = 0;
// Current state.
State state = State.Idle;
enum State {
Idle {
@Override
State next(Builder builder, char ch) {
// Any digits kick me into Number state.
if (Character.isDigit(ch)) {
return Number.next(builder, ch);
}
// Watch for brackets.
if ("()".indexOf(ch) != -1) {
return Bracket.next(builder, ch);
}
// No change - stay as I am.
return this;
}
},
Number {
@Override
State next(Builder builder, char ch) {
// Any non-digits treated like an idle.
if (Character.isDigit(ch)) {
// Store it.
builder.number.append(ch);
} else {
// Now we have his number - make the new employee.
builder.current = new Employee(Integer.parseInt(builder.number.toString()), builder.current);
// Clear the number for next time around.
builder.number.setLength(0);
// Remember - could be an '('.
return Idle.next(builder, ch);
}
// No change - stay as I am.
return this;
}
},
Bracket {
@Override
State next(Builder builder, char ch) {
// Open or close.
if (ch == '(') {
builder.brackets += 1;
} else {
builder.brackets -= 1;
// Keep that child.
Employee child = builder.current;
// Up to parent.
builder.current = builder.current.parent;
// Add the child.
builder.current.children.add(child);
}
// Always back to Idle after a bracket.
return Idle;
}
};
abstract State next(Builder builder, char ch);
}
Builder data(String data) {
for (int i = 0; i < data.length(); i++) {
state = state.next(this, data.charAt(i));
}
return this;
}
Employee build() {
// Current should hold the boss.
return current;
}
}
}
static String testData = "5 (\n"
+ " 5 (\n"
+ " 3 (\n"
+ " )\n"
+ " )\n"
+ " 3 (\n"
+ " 3 (\n"
+ " )\n"
+ " 3 (\n"
+ " )\n"
+ " )\n"
+ " 5 (\n"
+ " 2 (\n"
+ " )\n"
+ " 4 (\n"
+ " )\n"
+ " )\n"
+ ")";
public void test() throws IOException {
Employee e = new Employee.Builder().data(testData).build();
System.out.println(e.toString());
File[] ins = Files.listFiles(new File("C:\\Temp\\datapub"),
new FileFilter() {
@Override
public boolean accept(File file) {
return file.getName().endsWith(".in");
}
});
for (File f : ins) {
Employee.Builder builder = new Employee.Builder();
String[] lines = Files.readLines(f);
ProcessTimer timer = new ProcessTimer();
for (String line : lines) {
builder.data(line);
}
System.out.println("Read file " + f + " took " + timer);
}
}
打印
0(5(5(3()))(3(3())(3()))(5(2())(4())))
请注意,0
第一个元素是您提到的boss
。