我正在开发一个应用程序,我们需要从文本中提取关键信息。问题是文本来自OCRed文档,因此可能存在OCR识别错误,噪音,垃圾字符等。此外,文档上的文本可能有百万种不同的格式,具体取决于来源。
因此,我们使用大量正则表达式来提取文本。我们注意到,在大批量生产中,这会破坏服务器上的CPU。我试过预编译正则表达式并缓存它们而没有任何改进。 Profiler显示65%的运行时是由于调用Regex.Match()。
阅读正则表达式,我看到灾难性的回溯是一个性能问题。
让我们说我有一个这样的表达式(这只是为了说明我们的正则表达式的一般格式 - 其他可以包含更多的关键字和格式):
(.*) KEYWORD1 AND (.* KEYWORD2)
当我逐步使用Regex Coach时,我发现它会进行大量的回溯以匹配字符串。
这种类型的正则表达式可以在概念上得到改进吗?我们只对整个文档的一个子集(一个较小的文本块)运行,但是抽出blob的预处理本质上并不完美。
所以,是的,几乎所有事情都可以出现在" KEYWORD1"任何东西都可以出现在" KEYWORD2"等等
之前我们不能限制为AZ而不是。*,因为在OCR世界中,字母有时可能是错误的数字(即Illene = I11ene),或者由于OCR识别错误我们可以在那里得到垃圾字符
答案 0 :(得分:3)
是的,这些类型可以轻松优化。
通过将regex替换为目标代码来优化它们。也就是说,两个子字符串搜索。如果/home/rof/cache/bundler/ruby/2.2.0/gems/pg-0.20.0/lib/pg.rb:56:in `initialize'
/home/rof/cache/bundler/ruby/2.2.0/gems/pg-0.20.0/lib/pg.rb:56:in `new'
/home/rof/cache/bundler/ruby/2.2.0/gems/pg-0.20.0/lib/pg.rb:56:in `connect'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/postgresql_adapter.rb:671:in `connect'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/postgresql_adapter.rb:217:in `initialize'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/postgresql_adapter.rb:37:in `new'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/postgresql_adapter.rb:37:in `postgresql_connection'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:729:in `new_connection'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:773:in `checkout_new_connection'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:752:in `try_to_checkout_new_connection'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:713:in `acquire_connection'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:490:in `checkout'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:364:in `connection'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_adapters/abstract/connection_pool.rb:883:in `retrieve_connection'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_handling.rb:128:in `retrieve_connection'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/connection_handling.rb:91:in `connection'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/migration.rb:1038:in `current_version'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/migration.rb:1273:in `last_stored_environment'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/tasks/database_tasks.rb:48:in `check_protected_environments!'
/home/rof/cache/bundler/ruby/2.2.0/gems/activerecord-5.0.3/lib/active_record/railties/databases.rake:11:in `block (2 levels) in <top (required)>'
/home/rof/cache/bundler/ruby/2.2.0/gems/airbrake-6.1.1/lib/airbrake/rake.rb:19:in `execute'
/home/rof/cache/bundler/ruby/2.2.0/gems/rake-12.0.0/exe/rake:27:in `<top (required)>'
/home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/cli/exec.rb:74:in `load'
/home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/cli/exec.rb:74:in `kernel_load'
/home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/cli/exec.rb:27:in `run'
/home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/cli.rb:360:in `exec'
/home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
/home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/vendor/thor/lib/thor/invocation.rb:126:in `invoke_command'
/home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/vendor/thor/lib/thor.rb:369:in `dispatch'
/home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/cli.rb:20:in `dispatch'
/home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/vendor/thor/lib/thor/base.rb:444:in `start'
/home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/cli.rb:10:in `start'
/home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/exe/bundle:35:in `block in <top (required)>'
/home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/lib/bundler/friendly_errors.rb:121:in `with_friendly_errors'
/home/rof/.rvm/gems/ruby-2.2.3/gems/bundler-1.15.0/exe/bundle:27:in `<top (required)>'
/home/rof/.rvm/gems/ruby-2.2.3/bin/bundle:23:in `load'
/home/rof/.rvm/gems/ruby-2.2.3/bin/bundle:23:in `<main>'
/home/rof/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `eval'
/home/rof/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `<main>'
Tasks: TOP => db:schema:load => db:check_protected_environments
(See full trace by running task with --trace)
的位置小于" KEYWORD1 AND "
的位置,那么您就匹配了。
对于额外的速度,您可以使用优化的子字符串搜索,但几乎肯定不需要。只是消除正则表达式将大大提高速度。
[编辑]
好的,所以有400个。其中一些稍微复杂一些。模式保持不变:具有很小变化的大量子串,可以有效定位。如果您知道输入中出现"KEYWORD2"
,则检查"PART OF"
是否发生可以在大约一纳秒内完成。如果发生" PART OF"
AS PART OF。
现在400个正则表达并不多。如果你有40.000,那么自动检查公共子串是值得的。目前,您可能依次运行每个正则表达式,尝试匹配其他399个正则表达式字符串以获得第一个剪切。 PARTF OF_ _doesn't_ occur, you don't need to check at all whether
将与.*PART OF.*
匹配。
出于同样的原因,您也不需要其他优化。有40.000个正则表达式匹配,我计算每个字母对的频率。即输入".*AS PART OF.*"
包含字母对FOO AS PART OF BAR
。这与FO, OO, PA, AR (twice), RT, OF, BA
无法匹配,因为缺少字母对.*FOR EXAMPLE.*
。对了