Question

I get this text from a pdf invoice:

INVOCE DATE            Nº ITEM          CONTRACT DATA 

10/10/15           EN56000004567WWG      Standard Plan 3

  CONCEPT        AMOUNT       MONTHS   UNITPRIZE     PRIZE

CONCEPT AAA    47,101   MB      1,0    3,394074   159,86   Dollars
CONCEPT BBB    26,122   MB      1,0    3,394074    88,66   Dollars
CONCEPT CCC    37,101   MB      1,0    3,394074   125,92   Dollars

                       TOTAL       374,44 Dollars

This text is actually a table with several lines but only one colunm where data is in fact only separated with a diferent number of whitespaces in almost every line.

What I want is to get the amounts "47,101" , "26,122", "37,101" with a specific regex for each one based on their concept, for example: regex1 gets "47,101" looking for "CONCEPT AAA" and so on.

I have achieved to get "CONCEPT AAA 47,101" using this R line:

regmatches(invoice,regexpr("\\bCONCEPT AAA\\s*([-,0-9]+)", invoice, perl=TRUE))

but I only want the number "47,101".

ADDITIONAL INFO

For read the pdf I use readPDF function from tm package in R which outputs this table which indeed it is a character vector.

Due to there are a lot of invoices with slight differences in disposition I prefer use regex way to get data rather than try a best pdf to table conversion.

BONUS:

Then I will would like to get the prices for each concept "159,86", "88,66", "125,92".

Answer 1

如果您愿意用单个空格替换任意空格，那么您可以使用lookbehind。这可以内联完成，但有点乱：

regmatches(
    x = gsub("\\s+", replacement = " ", x = invoice),
    m = regexpr(
        "(?<=\\bCONCEPT AAA\\s)([-,0-9]+)",
        gsub("\\s+", replacement = " ", x = invoice),
        perl = TRUE
    )
)
# [1] "47,101"

如果您预先处理，似乎更直接：

invoice_onespace = gsub("\\s+", replacement = " ", x = invoice)
regmatches(
    x = invoice_onespace ,
    m = regexpr(
        "(?<=\\bCONCEPT AAA\\s)([-,0-9]+)",
        invoice_onespace ,
        perl = TRUE
    )
)

由于这一点似乎是你可以使用匹配的字符串作为变量，因此很难将"\\bCONCEPT\\b(\\s[A-Za-z]*)*\\s(?=[0-9])"和paste这样的概念放在一起。像上面的字符串使用lookbehind断言中的概念提取作为获取所有概念的方法。如果没有关于“概念”的可能值的更多信息，我无法更好地猜测正确的正则表达式 - 我假设它们不是像你的例子中的所有3个大写字母字符串。如果是，则"\\bCONCEPT\\b\\s[A-Z]{3}\\b"应该有效。

Regular expression to match numbers after a string

1 个答案: