收据项的正则表达式

时间:2016-02-03 22:14:52

标签: regex

我收到了一份简单的收据。我需要能够阅读收据上购买的物品。样品收据如下。

               Tim Hortons
              Alwasy Fresh

1   Brek Wrap Combo /A          ($0.76)
1   Bacon-wrap                  $3.79
1   Grilled                     $0.00
1   5 Pieces Bacon-wrap         $0.00
1   Orange                      $1.40
1   Deposit                     $0.10
Subtotal:                       $55.84
GST:                $0.29
Debit:                          $55.84
Take out

         Thanks for stopping by!!
           Tell us how we did

我想出了以下正则表达式字符串来查找项目。

\d(\s){1,10}(.)*\s{1,}\$\d\.[0-9]{2}

它在很大程度上起作用,但有一些不正确的行,如

4
GST:                $0.29

有人能想出更好的模式。下面是一个链接,可以看到它的实际效果。

Dev Tools console showing my if statement is not detecting display:block when it is clearly there

3 个答案:

答案 0 :(得分:1)

这是我的尝试:

^(\d+)\s+(.*)\s+\(?(\$.+)\)?$

Stub。请记住打开多线选项。组件:

^         - beginning of line
(\d+)     - capture the quantity at the beginning of each line item
\s+       - one or more space
(.*)      - capture the item description
\s+       - one or more space
\(?       - optional open bracket `(` character
($.+)     - capture anything including and after the dollar sign
\)?       - optional close bracket `)` character
$         - end of line

答案 1 :(得分:0)

您可以使用

^(\d+)\s+(.*?)\s+\(?\$(\d+\.\d+)

请参阅regex demo

此正则表达式应与/m修饰符一起使用,以匹配不同行上的数据。在JS中,还需要/g修饰符。

<强>解释

  • ^ - 开始行
  • (\d+) - 第1组捕获一个或多个数字
  • \s+ - 一个或多个空格
  • (.*?) - 第2组捕获零个或多个任何字符,但换行符最近的
  • \s+ - 一个或多个空格
  • \(? - 可选的((在第一行)
  • \$ - 文字$
  • (\d+\.\d+) - 第3组捕获一个或多个数字,后跟.和一个或多个数字。

JS演示:

var re = /^(\d+)\s+(.*?)\s+\(?\$(\d+\.\d+)/gm; 
var str = '               Tim Hortons\n              Alwasy Fresh\n\n1   Brek Wrap Combo /A          ($0.76)\n1   Bacon-wrap                  $3.79\n1   Grilled                     $0.00\n1   5 Pieces Bacon-wrap         $0.00\n1   Orange                      $1.40\n1   Deposit                     $0.10\nSubtotal:                       $55.84\nGST:                $0.29\nDebit:                          $55.84\nTake out\n\n         Thanks for stopping by!!\n           Tell us how we did';

while ((m = re.exec(str)) !== null) {
    document.body.innerHTML += "Pcs: <b>" + m[1] + "</b>, item: <b>" + m[2] + "</b>, paid: <b>" + m[3] + "</b><br/>";
}

答案 2 :(得分:0)

我发现这个原始正则表达式有很多问题:

\d(\s){1,10}(.)*\s{1,}\$\d\.[0-9]{2}

首先,括号分组和匹配,但是当您量化匹配时,只捕获最后一次迭代,因此像(.)*这样的匹配只会存储最后一个字符;你想要(.*)。因为它是greedy,所以它将是美元符号前面的空格之前的字符,因为你的数据总是一个空格。同样,您在开头用(\s){1,10}量化一个组,它只捕获最后一个空白字符。在这种情况下,您不需要该组,因为\s是单个空格字符,因此您只需使用\s{1,10}

这是正则表达式的piece-by-piece explanation

捕获解决方案

以下正则表达式捕获数量($ 1),商品描述($ 2),价格是否为括号($ 3)和价格($ 4):

^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$

解释并与您的样本at regex101匹配。

分离并注释(假设支持/ x标志):

/          # begin regex
 ^\s*      # start of line, ignore leading spaces if present
 (\d+)     # $1 = quantity
 \s+       # spacing as a delimiter
 (.*\S)    # $2 = item: contains anything, must end in a non-space char
 \s+       # spacing as a delimiter
 (\(?)     # $3 = negation, an optional open parenthesis
 \$        # dollar sign
 ([0-9.]+) # $4 = price
 \)?\s*$   # trailing characters: optional end-paren and space(s)
/x         # end regex, multi-line regex flag

从命令行执行示例perl代码:

perl -ne '
  my ($quantity, $item, $neg, $price)
    = /^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$/;
  if ($item) {
    if ($neg) { $price *= -1; }
    print "<$quantity><$item><$price>\n"
  }' RECEIPT_FILE

(如果您希望将其作为perl脚本,请使用while(<>) { }包装代码并完成。)

这会将变量$ quantity,$ item和$ price分配给收据上的明细行。我假设要减去带括号的项目(但我无法验证,因为总计是无意义的),所以$ neg注意到括号的存在,因此$ price可以被否定。

我将输出设置为使用尖括号(<>)来指示每个变量存储的内容。

您给定的样品收据的输出因此是:

<1><Brek Wrap Combo /A><-0.76>
<1><Bacon-wrap><3.79>
<1><Grilled><0.00>
<1><5 Pieces Bacon-wrap><0.00>
<1><Orange><1.40>
<1><Deposit><0.10>

价格仅解决方案

你没说出你想要匹配的东西。如果您不关心价格而且没有任何负值,那么如果您有负面的后视或\K,则不需要匹配器:

grep -Po '^\s*[0-9].*\$\K[0-9.]+' RECEIPT_FILE

Grep的-P标志调用libpcre(如果您使用的是旧系统或嵌入式系统,则可能无法使用)并且-o仅显示匹配的文本。 \K表示比赛的开始。如果要捕获\$,请将\K放在0.76 3.79 0.00 0.00 1.40 0.10 之后。 (另请参阅regex101 description and matches。)

该grep命令的输出:

awk

仅限价格 - 使用awk

没有很好的方法可以有效地处理这个正则表达式。如果你正在处理大量的内容,你会感受到伤害。这是使用awk '$1 / 1 > 0 && $NF ~ /\$/ { gsub(/[()]/, "", $0); print $NF; }' RECEIPT_FILE 的解决方案,应该明显更快。 (如果输入很小,差异就不会明显。)

awk '
  # if the quantity is indeed a number and the last field has a dollar sign
  $1 / 1 > 0 && $NF ~ /\$/ {
    gsub(/[()]/, "", $NF);   # remove all parentheses from the last field
    print $NF;               # print the contents of the last field
  }' RECEIPT_FILE

带注释的评论版:

awk '
  # if the quantity is indeed a number and the last field has a dollar sign
  $1 / 1 > 0 && $NF ~ /\$/ {
    neg = 1;
    if ( $NF ~ /\(/ ) {      # the last field has an open parenthesis
      gsub(/[()]/, "", $NF); # remove all parentheses from the last field
      neg = -1;
    }
    print $NF * neg;         # print the last field, negated if parenthesized
  }' RECEIPT_FILE

仅限价格 - 使用awk,支持负价

var gulp = require('gulp');
var clean = require('gulp-clean');
var concat = require('gulp-concat');
var uglify = require('gulp-uglify');
var filter = require('gulp-filter');  
var mainBowerFiles = require('main-bower-files');
// var imagemin = require('gulp-imagemin');
// var pngquant = require('imagemin-pngquant');
var bases = {
    app: 'app/',
    dist: 'dist/',
};
var paths = {
    scripts: ['ppt/scripts/**/*.js'],
    styles: ['ppt/styles/**/*.css'],
    html: ['ppt/views/**/*.html'],
    assets: ['ppt/assets/**/*.png', 'ppt/assets/**/*.svg'],
    extras: ['index.html', '404.html', 'robots.txt', 'favicon.ico'],
};

var gulp = require('gulp'),
    mainBowerFiles = require('main-bower-files');

gulp.task('bower', function() {
    // mainBowerFiles is used as a src for the task,
    // usually you pipe stuff through a task
    return gulp.src(mainBowerFiles())
        // Then pipe it to wanted directory, I use
        // dist/lib but it could be anything really
        .pipe(gulp.dest('dist/lib'))
});

// Delete the dist directory
gulp.task('clean', function() {
    return gulp.src(bases.dist).pipe(clean());
});
// Process scripts and concatenate them into one output file
gulp.task('scripts', ['clean'], function() {
    gulp.src(paths.scripts, {
        cwd: bases.app
    }).pipe(uglify()).pipe(concat('app.min.js')).pipe(gulp.dest(bases.dist + 'scripts/'));
});
// Imagemin images and ouput them in dist
// gulp.task('imagemin', ['clean'], function() {
//     gulp.src(paths.images, {
//         cwd: bases.app
//     }).pipe(imagemin()).pipe(gulp.dest(bases.dist + 'assets/'));
// });
// Copy all other files to dist directly
gulp.task('copy', ['clean'], function() {
    // Copy html
    gulp.src(paths.html, {
        cwd: bases.app
    }).pipe(gulp.dest(bases.dist + 'views'));
    // Copy styles
    gulp.src(paths.styles, {
        cwd: bases.app
    }).pipe(gulp.dest(bases.dist + 'styles'));
    //Copy assets
    gulp.src(paths.assets, {
        cwd: bases.app
    }).pipe(gulp.dest(bases.dist + 'assets'));
    // Copy app scripts
    gulp.src(paths.scripts, {
        cwd: bases.app
    }).pipe(gulp.dest(bases.dist + 'scripts'));
    // Copy extra html5bp files
    gulp.src(paths.extras, {
        cwd: bases.app
    }).pipe(gulp.dest(bases.dist));
});
// A development task to run anytime a file changes
gulp.task('watch', function() {
    gulp.watch('app/**/*', ['scripts', 'copy']);
});
// Define the default task as a sequence of the above tasks
gulp.task('default', ['clean', 'scripts', 'copy']);