我希望解析十进制数字而不管它们的格式是什么,这是未知的。原文的语言未知,可能会有所不同。此外,源字符串可以在之前或之后包含一些额外的文本,如货币或单位。
我正在使用以下内容:
# NOTE: Do not use, this algorithm is buggy. See below.
def extractnumber(value):
if (isinstance(value, int)): return value
if (isinstance(value, float)): return value
result = re.sub(r'&#\d+', '', value)
result = re.sub(r'[^0-9\,\.]', '', result)
if (len(result) == 0): return None
numPoints = result.count('.')
numCommas = result.count(',')
result = result.replace(",", ".")
if ((numPoints > 0 and numCommas > 0) or (numPoints == 1) or (numCommas == 1)):
decimalPart = result.split(".")[-1]
integerPart = "".join ( result.split(".")[0:-1] )
else:
integerPart = result.replace(".", "")
result = int(integerPart) + (float(decimalPart) / pow(10, len(decimalPart) ))
return result
这种作品......
>>> extractnumber("2")
2
>>> extractnumber("2.3")
2.3
>>> extractnumber("2,35")
2.35
>>> extractnumber("-2 000,5")
-2000.5
>>> extractnumber("EUR 1.000,74 €")
1000.74
>>> extractnumber("20,5 20,8") # Testing failure...
ValueError: invalid literal for int() with base 10: '205 208'
>>> extractnumber("20.345.32.231,50") # Returns false positive
2034532231.5
所以我的方法对我来说似乎非常脆弱,并且会带来许多误报。
是否有可以处理此问题的库或智能功能?理想情况下,20.345.32.231,50
不会通过,但会提取其他语言中的数字,例如1.200,50
或1 200'50
,而不管其他文字和字符(包括换行符)的数量。
(根据接受的答案更新实施: https://github.com/jjmontesl/cubetl/blob/master/cubetl/text/functions.py#L91)
答案 0 :(得分:4)
你可以用适当的花哨正则表达式来做到这一点。这是我最好的尝试。我使用命名捕获组,就像这种复杂的数字模式一样,在反向引用中会更加混乱。
首先,正则表达式:
_pattern = r"""(?x) # enable verbose mode (which ignores whitespace and comments)
^ # start of the input
[^\d+-\.]* # prefixed junk
(?P<number> # capturing group for the whole number
(?P<sign>[+-])? # sign group (optional)
(?P<integer_part> # capturing group for the integer part
\d{1,3} # leading digits in an int with a thousands separator
(?P<sep> # capturing group for the thousands separator
[ ,.] # the allowed separator characters
)
\d{3} # exactly three digits after the separator
(?: # non-capturing group
(?P=sep) # the same separator again (a backreference)
\d{3} # exactly three more digits
)* # repeated 0 or more times
| # or
\d+ # simple integer (just digits with no separator)
)? # integer part is optional, to allow numbers like ".5"
(?P<decimal_part> # capturing group for the decimal part of the number
(?P<point> # capturing group for the decimal point
(?(sep) # conditional pattern, only tested if sep matched
(?! # a negative lookahead
(?P=sep) # backreference to the separator
)
)
[.,] # the accepted decimal point characters
)
\d+ # one or more digits after the decimal point
)? # the whole decimal part is optional
)
[^\d]* # suffixed junk
$ # end of the input
"""
这是一个使用它的功能:
def parse_number(text):
match = re.match(_pattern, text)
if match is None or not (match.group("integer_part") or
match.group("decimal_part")): # failed to match
return None # consider raising an exception instead
num_str = match.group("number") # get all of the number, without the junk
sep = match.group("sep")
if sep:
num_str = num_str.replace(sep, "") # remove thousands separators
if match.group("decimal_part"):
point = match.group("point")
if point != ".":
num_str = num_str.replace(point, ".") # regularize the decimal point
return float(num_str)
return int(num_str)
一些数字字符串只有一个逗号或句号,后面跟着正好三位数(如"1,234"
和"1.234"
)不明确。无论使用何种实际的分隔符,此代码都会将它们解析为具有千位分隔符(1234
)的整数,而不是浮点值(1.234
)。如果您想要对这些数字采用不同的结果(例如,如果您更愿意使用1.234
进行浮动),则可以使用特殊情况处理此问题。
一些测试输出:
>>> test_cases = ["2", "2.3", "2,35", "-2 000,5", "EUR 1.000,74 €",
"20,5 20,8", "20.345.32.231,50", "1.234"]
>>> for s in test_cases:
print("{!r:20}: {}".format(s, parse_number(s)))
'2' : 2
'2.3' : 2.3
'2,35' : 2.35
'-2 000,5' : -2000.5
'EUR 1.000,74 €' : 1000.74
'20,5 20,8' : None
'20.345.32.231,50' : None
'1.234' : 1234
答案 1 :(得分:2)
我稍微改了你的代码。这与下面的valid_number
函数一起应该可以解决问题。
我花时间编写这段糟糕的代码的主要原因是向未来的读者展示如果你不知道如何使用regexp(例如我),解析正则表达式会有多糟糕。
希望能比我更了解正则表达式的人可以向我们展示应该如何完成:)
.
,,
和'
被接受为千位分隔符和小数
分离器123,456
被解释为123.456
,而不是123456
)' '
)123,456.00
且1,345.00
均视为有效,但2345,11.00
不予考虑VALD) import re
from itertools import combinations
def extract_number(value):
if (isinstance(value, int)) or (isinstance(value, float)):
yield float(value)
else:
#Strip the string for leading and trailing whitespace
value = value.strip()
if len(value) == 0:
raise StopIteration
for s in value.split(' '):
s = re.sub(r'&#\d+', '', s)
s = re.sub(r'[^\-\s0-9\,\.]', ' ', s)
s = s.replace(' ', '')
if len(s) == 0:
continue
if not valid_number(s):
continue
if not sum(s.count(sep) for sep in [',', '.', '\'']):
yield float(s)
else:
s = s.replace('.', '@').replace('\'', '@').replace(',', '@')
integer, decimal = s.rsplit('@', 1)
integer = integer.replace('@', '')
s = '.'.join([integer, decimal])
yield float(s)
嗯 - 这里的代码可能会被一些正则表达式语句取代。
def valid_number(s):
def _correct_integer(integer):
# First number should have length of 1-3
if not (0 < len(integer[0].replace('-', '')) < 4):
return False
# All the rest of the integers should be of length 3
for num in integer[1:]:
if len(num) != 3:
return False
return True
seps = ['.', ',', '\'']
n_seps = [s.count(k) for k in seps]
# If no separator is present
if sum(n_seps) == 0:
return True
# If all separators are present
elif all(n_seps):
return False
# If two separators are present
elif any(all(c) for c in combinations(n_seps, 2)):
# Find thousand separator
for c in s:
if c in seps:
tho_sep = c
break
# Find decimal separator:
for c in reversed(s):
if c in seps:
dec_sep = c
break
s = s.split(dec_sep)
# If it is more than one decimal separator
if len(s) != 2:
return False
integer = s[0].split(tho_sep)
return _correct_integer(integer)
# If one separator is present, and it is more than one of it
elif sum(n_seps) > 1:
for sep in seps:
if sep in s:
s = s.split(sep)
break
return _correct_integer(s)
# Otherwise, this is a regular decimal number
else:
return True
extract_number('2' ): [2.0]
extract_number('.2' ): [0.2]
extract_number(2 ): [2.0]
extract_number(0.2 ): [0.2]
extract_number('EUR 200' ): [200.0]
extract_number('EUR 200.00 -11.2' ): [200.0, -11.2]
extract_number('EUR 200 EUR 300' ): [200.0, 300.0]
extract_number('$ -1.000,22' ): [-1000.22]
extract_number('EUR 100.2345,3443' ): []
extract_number('111,145,234.345.345'): []
extract_number('20,5 20,8' ): [20.5, 20.8]
extract_number('20.345.32.231,50' ): []