我是python的新手。我正在尝试使用正则表达式从子字符串中提取美元金额。它在大多数情况下都有效,但是我遇到了一些我无法解决的问题。
结果金额是一个字符串,由于逗号而未被识别为金额。对于少于$1
(例如0.89
)的少量金额,它也不起作用。没有前导$
。任何帮助将不胜感激。
这是我所拥有的:
df['Amount']=df['description'].str.extract('(\d{1,3}?(\,\d{3})*\.\d{2})')
这里是一个应该解析的字符串:
000000000463 NYC DOF OPA CONCENTRATION ACCT. *00029265 07/01/2013 AP5378 1,107,844.38 Ven000000000463 Vch:00029265
我正在尝试在数据框对象的单独列中提取金额1,107,844.38
。我没有任何应该拒绝的字符串。
答案 0 :(得分:0)
您可以尝试使用正则表达式,例如
rx = r"\b(?<!/)(\d{1,3}(?:,\d{3})*(?:\.\d{2})?)\b(?!/)"
df['Amount']=df['description'].str.extract(rx)
请参见regex demo
详细信息
\b
-单词边界(?<!/)
-当前位置左侧不/
(以避免与日期时间值匹配)\d{1,3}
-1到3位数字(?:,\d{3})*
-0个重复的,
和3位数字(?:\.\d{2})?
-可选的.
和2位数字\b
-单词边界(?!/)
-当前位置右边不/
(以避免与日期时间值匹配)答案 1 :(得分:0)
给出示例字符串:
"000000000463 NYC DOF OPA CONCENTRATION ACCT. *00029265 07/01/2013 AP5378 1,107,844.38 Ven000000000463 Vch:00029265"
这是我想出的:
match = re.search(r"(?P<amount>\$?(?:\d+,)*\d+\.\d+)", subject)
if match:
result = match.group("amount") # result will be "1,107,844.38"
else:
result = ""
提取金额。它还处理0.38
之类的小额数字,123456789.38
之类没有千位分隔符的数字,也可能不带有美元符号$
。
正则表达式详细信息
(?<amount>\$?(?:\d+,)*\d+\.\d+) Match the regular expression below and capture its match into backreference with name “amount”
\$? Match the character “$” literally
? Between zero and one times, as many times as possible, giving back as needed (greedy)
(?:\d+,)* Match the regular expression below
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d+ Match a single digit 0..9
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)
, Match the character “,” literally
\d+ Match a single digit 0..9
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\. Match the character “.” literally
\d+ Match a single digit 0..9
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)