Stackoverflow用于大写正则表达式

时间:2012-10-19 10:27:22

标签: java regex string

我遇到了一个很大的问题。我在灾难性的回溯中运行了一次验证(http://www.regular-expressions.info/catastrophic.html)。但我很难搞清楚原因。也许有人有想法?除此之外,正则表达式适用于所有用例。

regex: "^((^|[^A-Za-z]+)[A-Z][A-Za-z]*)*[^A-Za-z]*$"

问题输入:

"Disposable
BHT,
Tocopheryl AcetateHydrating Shave Gel
Aqua,
Glycerin,
Palmitic Acid,
Triethanolamine,
Isopentane,
Glyceryl Oleate,
Stearic Acid,
Isobutane,
Sorbitol,
Parfum,
Hydroxyethylcellulose,
Myristic Acid,
PEG-90M,
Butyrospermum Parkii Butter Extract,
Lauric Acid,
PTFE,
PEG-23M,
Propylene Glycol,
Glyceryl Acrylate/Acrylic Acid Copolymer,
PVM/MA Copolymer,
Silica,
Methylparaben,
Propylparaben,
BHT,
Limonene,
Benzyl Salicylate,
Linalool,
CI 42053,
CI 42090
Series Thermal Face Scrub
PEG-4,
Magnesium Sulfate,
PEG/PPG-300/55 Copolymer,
Polyethylene,
Polypropylene,
Laureth-23,
Stearyl Alcohol,
Dioleoylethyl Hydroxyethylmonium Methosulfate,
Cetyl Alcohol,
Behentrimonium Chloride,
Distearyldimonium Chloride,
Hydroxypropylcellulose,
Parfum,
Methylparaben,
Propylparaben,
Niacinamide,
Alcohol Denat,
Hexylene Glycol,
Benzyl Salicylate,
AquaClassic Clean Shampoo
Aqua,
Sodium Lauryl Sulfate,
Sodium Laureth Sulfate,
Glycol Distearate,
Zinc Carbonate,
Sodium Chloride,
Sodium Xylenesulfonate,
Zinc Pyrithione,
Cocamidopropyl Betaine,
Dimethicone,
Sodium Benzoate,
Guar Hydroxypropyltrimonium Chloride,
Hydrochloric Acid,
Hexyl Cinnamal,
Linalool,
Butylphenyl Methylpropional,
Magnesium Carbonate Hydroxide,
Ammonium Laureth Sulfate,
Magnesium Nitrate,
Sodium Polynaphthalenesulfonate,
Methylchloroisothiazolinone,
Magnesium Chloride,
CI 42090,
Citric Acid,
Methylisothiazolinone,
Tetrasodium EDTA,
CI 17200,
DMDM Hydantoin    Perspirant Deodorant Spray Sport Protect 48H
Butane,
Isobutane,
Cyclopentasiloxane,
Aluminum Chlorohydrate,
Cyclodextrin,
Disteardimonium Hectorite,
Dimethicone,
Aqua,
Triethyl Citrate,
Alpha-Isomethyl Ionone,
Butylphenyl Methylpropional,
Citral,
Citronellol,
Coumarin,
Geraniol,
Limonene,
Linalool
Pillite Series Instant Hydration Moisturiser +SPF 15
Aqua,
Glycerin,
Ethylhexyl Salicylate,
Niacinamide,
Butyl Methoxydibenzoylmethane,
Dimethicone,
Polyethylene,
Octocrylene,
Isopropyl Palmitate,
Phenylbenzimidazole Sulfonic Acid,
Sorbitan Stearate,
Triethanolamine,
Cetyl Alcohol,
Sodium Acrylates Copolymer,
Aluminum Starch Octenylsuccinate,
Stearyl Alcohol,
Caprylic/Capric Triglyceride,
Panthenol,
Benzyl Alcohol,
Dimethiconol,
Fragrance,
Ethylparaben,
Cetearyl Glucoside,
Cetearyl Alcohol,
PEG 100 Stearate,
Propylparaben,
Disodium EDTA,
C12-13 Pareth-3,
Palmitic Acid,
Stearic Acid,
Benzyl Salicylate,
Laureth-7,
Linalool,
Butylphenyl Methylpropional,
Myristic Acid,
Coumarin,
Heptadecanoic Acid,
Benzyl Benzoate"

谢谢!

1 个答案:

答案 0 :(得分:3)

问题是您有一个

形式的子句
(something*)* 

当正则表达式匹配正确时,这种方法很好,但是如果你的一条线路格式不正确,就会出现灾难性的错误。这是由于回溯以及正则表达式引擎将尝试的所有各种组合。

如果是你最长的一行:

  

吉列系列速效保湿霜+ SPF 15

如果这一行与你的正则表达式不匹配,那么它将需要正则表达式引擎2,251,799,813,685,248(2 ^ 51)尝试才会意识到这一行与正则表达式不匹配。

修复程序位于您链接的页面上。既然你正在寻找一个单词而不是单词的交替序列,那么回溯对你来说是没有用的(因为一个单词不能被分成单词/非单词/单词的序列)。您可以通过使用所有格量词来防止回溯(即,一旦正则表达式匹配单词或非单词,它就不会放弃该匹配)。

使用占有量词只是加上所有量词的结尾,所以

(something*)*变为(something*+)*+