我正在使用study,这是一个Perl功能,用于检查字符串以使后续正则表达式更加快速:
while( <> ) {
study;
$count++ if /PATTERN/;
$count++ if /OTHER/;
$count++ if /PATTERN2/;
}
没有太多关于哪些情况会从中受益的说法。你可以从the docs中挑出一些东西:
我正在寻找具体的案例,我不仅可以展示一个巨大的优势,而且还有一些我可以略微调整以失去优势的案例。 the docs中的一个警告是您应该对个别案例进行基准测试。我想找到一些边缘情况,其中字符串(或模式)中的小差异会对性能产生很大的影响。
如果您还没有使用study,请不要回答。我宁愿有完善的正确答案而不是快速猜测。这里没有紧迫感,这并没有阻碍任何工作。
而且,作为奖励,我一直在使用基准测试工具来比较两个NYTProf运行,我宁愿使用它而不是通常的基准测试工具。如果我想出一种自动化的方法,我也会分享它。
答案 0 :(得分:7)
#!/usr/bin/perl
#
# Exercise 7.8
#
# This is a more difficult exercise. The study function in Perl may speed up searches
# for motifs in DNA or protein. Read the Perl documentation on this function. Its use
# is simple: given some sequence data in a variable $sequence, type:
#
# study $sequence;
#
# before doing the searches. Do you think study will speed up searches in DNA or
# protein, based on what you've read about it in the documentation?
#
# For lots of extra credit! Now read the Perl documentation on the standard module
# Benchmark. (Type perldoc Benchmark, or visit the Perl home page at http://www.
# perl.com.) See if your guess is right by writing a program that benchmarks motif
# searches of DNA and of protein, with and without study.
#
# Answer to Exercise 7.8
use strict;
use warnings;
use Benchmark;
my $dna = join ('', qw(
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat
cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca
ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga
gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga
gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag
ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca
gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa
atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc
agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca
tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa
gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt
gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca
ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac
ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg
agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc
acacctgagccactctcagatgaggaccta
));
my $protein = join('', qw(
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD
SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ
GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR
));
my $count = 1000;
print "DNA pattern matches without 'study' function:\n";
timethis($count,
' for(my $i=1 ; $i < 10000; ++$i) {
$dna =~ /aggtc/;
$dna =~ /aatggccgt/;
$dna =~ /gatcgatcagctagcat/;
$dna =~ /gtatgaac/;
$dna =~ /[ac][cg][gt][ta]/;
$dna =~ /ccccccccc/;
} '
);
print "\nDNA pattern matches with 'study' function:\n";
timethis($count,
' study $dna;
for(my $i=1 ; $i < 10000; ++$i) {
$dna =~ /aggtc/;
$dna =~ /aatggccgt/;
$dna =~ /gatcgatcagctagcat/;
$dna =~ /gtatgaac/;
$dna =~ /[ac][cg][gt][ta]/;
$dna =~ /ccccccccc/;
} '
);
print "\nProtein pattern matches without 'study' function:\n";
timethis($count,
' for(my $i=1 ; $i < 10000; ++$i) {
$protein =~ /PH.EI/;
$protein =~ /KFTEQGESMRLY/;
$protein =~ /[YAL][NVP][ISV][KQE]/;
$protein =~ /DKKQIR/;
$protein =~ /[MD][VT][HQ][ER]/;
$protein =~ /NVPISVKQEITFTDVSEQL/;
} '
);
print "\nProtein pattern matches with 'study' function:\n";
timethis($count,
' study $protein;
for(my $i=1 ; $i < 10000; ++$i) {
$protein =~ /PH.EI/;
$protein =~ /KFTEQGESMRLY/;
$protein =~ /[YAL][NVP][ISV][KQE]/;
$protein =~ /DKKQIR/;
$protein =~ /[MD][VT][HQ][ER]/;
$protein =~ /NVPISVKQEITFTDVSEQL/;
} '
);
请注意,对于最有利可图的案例(蛋白质匹配),报告的收益仅为约2%:
# $ perl exer07.08
# On my computer, this is the output I get: your results probably vary.
# DNA pattern matches without 'study' function:
# timethis 1000: 29 wallclock secs (29.25 usr + 0.00 sys = 29.25 CPU) @ 34.19/s (n=1000)
#
# DNA pattern matches with 'study' function:
# timethis 1000: 30 wallclock secs (29.21 usr + 0.15 sys = 29.36 CPU) @ 34.06/s (n=1000)
#
# Protein pattern matches without 'study' function:
# timethis 1000: 32 wallclock secs (29.47 usr + 0.04 sys = 29.51 CPU) @ 33.89/s (n=1000)
#
# Protein pattern matches with 'study' function:
# timethis 1000: 30 wallclock secs (28.97 usr + 0.02 sys = 28.99 CPU) @ 34.49/s (n=1000)
#
答案 1 :(得分:4)
我将留下笔记作为答案,稍后我会把它发展成一个真正的答案:
在 pp.c 的PP(pp_study)
中,它有这些奇怪的行(减去评论):
if (len == 0 || len > I32_MAX || !SvPOK(sv) || SvUTF8(sv) || SvVALID(sv)) {
RETPUSHNO;
}
看起来设置了UTF8标志的标量根本没有研究过。
答案 2 :(得分:2)
不是真的。如果您搜索,并且大多数结果都在Perl测试套件中,那意味着没有人使用它。另外,由于bug,你只能notice speed benefits on global variables。它在处理英语时实际上带来了一些速度增强(有时甚至快2倍),但你必须使变量全局化。
有时也会导致infinite loops或false positives(study
可能会为您的程序添加错误,即使它只是为了让它更快),因此它是{ {3}} - 无论如何,没有人想保留一部分人无关紧要的事情。