可以优化这些类型的正则表达式吗?

时间:2017-06-02 17:37:30

标签: c# regex

我正在开发一个应用程序,我们需要从文本中提取关键信息。问题是文本来自OCRed文档,因此可能存在OCR识别错误,噪音,垃圾字符等。此外,文档上的文本可能有百万种不同的格式,具体取决于来源。

因此,我们使用大量正则表达式来提取文本。我们注意到,在大批量生产中,这会破坏服务器上的CPU。我试过预编译正则表达式并缓存它们而没有任何改进。 Profiler显示65%的运行时是由于调用Regex.Match()。

阅读正则表达式,我看到灾难性的回溯是一个性能问题。

让我们说我有一个这样的表达式(这只是为了说明我们的正则表达式的一般格式 - 其他可以包含更多的关键字和格式):

(.*) KEYWORD1 AND (.* KEYWORD2)

当我逐步使用Regex Coach时,我发现它会进行大量的回溯以匹配字符串。

这种类型的正则表达式可以在概念上得到改进吗?我们只对整个文档的一个子集(一个较小的文本块)运行,但是抽出blob的预处理本质上并不完美。

所以,是的,几乎所有事情都可以出现在" KEYWORD1"任何东西都可以出现在" KEYWORD2"等等

之前

我们不能限制为AZ而不是。*,因为在OCR世界中,字母有时可能是错误的数字(即Illene = I11ene),或者由于OCR识别错误我们可以在那里得到垃圾字符

1 个答案:

答案 0 :(得分:3)

是的,这些类型可以轻松优化。

通过将regex替换为目标代码来优化它们。也就是说,两个子字符串搜索。如果/home/rof/cache/bundler/ruby/2.2.0/gems/pg-0.20.0/lib/pg.rb:56:in `initialize' /home/rof/cache/bundler/ruby/2.2.0/gems/pg-0.20.0/lib/pg.rb:56:in `new' /home/rof/cache/bundler/ruby/2.2.0/gems/pg-0.20.0/lib/pg.rb:56:in `connect' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/postgresql_adapter.rb:671:in `connect' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/postgresql_adapter.rb:217:in `initialize' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/postgresql_adapter.rb:37:in `new' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/postgresql_adapter.rb:37:in `postgresql_connection' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:729:in `new_connection' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:773:in `checkout_new_connection' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:752:in `try_to_checkout_new_connection' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:713:in `acquire_connection' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:490:in `checkout' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:364:in `connection' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:883:in `retrieve_connection' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_handling.rb:128:in `retrieve_connection' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_handling.rb:91:in `connection' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/migration.rb:1038:in `current_version' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/migration.rb:1273:in `last_stored_environment' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/tasks/database_tasks.rb:48:in `check_protected_environments!' /home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/railties/databases.rake:11:in `block (2 levels) in <top (required)>' /home/rof/cache/bundler/ruby/2.2.0/gems/airbrake-6.1.1/lib/airbrake/rake.rb:19:in `execute' /home/rof/cache/bundler/ruby/2.2.0/gems/rake-12.0.0/exe/rake:27:in `<top (required)>' /home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/cli/exec.rb:74:in `load' /home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/cli/exec.rb:74:in `kernel_load' /home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/cli/exec.rb:27:in `run' /home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/cli.rb:360:in `exec' /home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run' /home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/vendor/thor/lib/thor/invocation.rb:126:in `invoke_command' /home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/vendor/thor/lib/thor.rb:369:in `dispatch' /home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/cli.rb:20:in `dispatch' /home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/vendor/thor/lib/thor/base.rb:444:in `start' /home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/cli.rb:10:in `start' /home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/exe/bundle:35:in `block in <top (required)>' /home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/friendly_errors.rb:121:in `with_friendly_errors' /home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/exe/bundle:27:in `<top (required)>' /home/rof/.rvm/gems/ruby-2.2.3/bin/bundle:23:in `load' /home/rof/.rvm/gems/ruby-2.2.3/bin/bundle:23:in `<main>' /home/rof/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `eval' /home/rof/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `<main>' Tasks: TOP => db:schema:load => db:check_protected_environments (See full trace by running task with --trace) 的位置小于" KEYWORD1 AND "的位置,那么您就匹配了。

对于额外的速度,您可以使用优化的子字符串搜索,但几乎肯定不需要。只是消除正则表达式将大大提高速度。

[编辑] 好的,所以有400个。其中一些稍微复杂一些。模式保持不变:具有很小变化的大量子串,可以有效定位。如果您知道输入中出现"KEYWORD2",则检查"PART OF"是否发生可以在大约一纳秒内完成。如果发生" PART OF" AS PART OF。

现在400个正则表达并不多。如果你有40.000,那么自动检查公共子串是值得的。目前,您可能依次运行每个正则表达式,尝试匹配其他399个正则表达式字符串以获得第一个剪切。 PARTF OF_ _doesn't_ occur, you don't need to check at all whether将与.*PART OF.*匹配。

出于同样的原因,您也不需要其他优化。有40.000个正则表达式匹配,我计算每个字母对的频率。即输入".*AS PART OF.*"包含字母对FOO AS PART OF BAR。这与FO, OO, PA, AR (twice), RT, OF, BA无法匹配,因为缺少字母对.*FOR EXAMPLE.*。对了