I get this text from a pdf invoice:
INVOCE DATE Nº ITEM CONTRACT DATA
10/10/15 EN56000004567WWG Standard Plan 3
CONCEPT AMOUNT MONTHS UNITPRIZE PRIZE
CONCEPT AAA 47,101 MB 1,0 3,394074 159,86 Dollars
CONCEPT BBB 26,122 MB 1,0 3,394074 88,66 Dollars
CONCEPT CCC 37,101 MB 1,0 3,394074 125,92 Dollars
TOTAL 374,44 Dollars
This text is actually a table with several lines but only one colunm where data is in fact only separated with a diferent number of whitespaces in almost every line.
What I want is to get the amounts "47,101" , "26,122", "37,101" with a specific regex for each one based on their concept, for example: regex1 gets "47,101" looking for "CONCEPT AAA" and so on.
I have achieved to get "CONCEPT AAA 47,101" using this R line:
regmatches(invoice,regexpr("\\bCONCEPT AAA\\s*([-,0-9]+)", invoice, perl=TRUE))
but I only want the number "47,101".
ADDITIONAL INFO
For read the pdf I use readPDF function from tm package in R which outputs this table which indeed it is a character vector.
Due to there are a lot of invoices with slight differences in disposition I prefer use regex way to get data rather than try a best pdf to table conversion.
BONUS:
Then I will would like to get the prices for each concept "159,86", "88,66", "125,92".
答案 0 :(得分:0)
如果您愿意用单个空格替换任意空格,那么您可以使用lookbehind。这可以内联完成,但有点乱:
regmatches(
x = gsub("\\s+", replacement = " ", x = invoice),
m = regexpr(
"(?<=\\bCONCEPT AAA\\s)([-,0-9]+)",
gsub("\\s+", replacement = " ", x = invoice),
perl = TRUE
)
)
# [1] "47,101"
如果您预先处理,似乎更直接:
invoice_onespace = gsub("\\s+", replacement = " ", x = invoice)
regmatches(
x = invoice_onespace ,
m = regexpr(
"(?<=\\bCONCEPT AAA\\s)([-,0-9]+)",
invoice_onespace ,
perl = TRUE
)
)
由于这一点似乎是你可以使用匹配的字符串作为变量,因此很难将"\\bCONCEPT\\b(\\s[A-Za-z]*)*\\s(?=[0-9])"
和paste
这样的概念放在一起。像上面的字符串使用lookbehind断言中的概念提取作为获取所有概念的方法。如果没有关于“概念”的可能值的更多信息,我无法更好地猜测正确的正则表达式 - 我假设它们不是像你的例子中的所有3个大写字母字符串。如果是,则"\\bCONCEPT\\b\\s[A-Z]{3}\\b"
应该有效。