Question

设置

我正在使用Selenium和Python 3.x进行网络抓取产品价格。

我有一个包含每个产品价格的字符串列表。

对于低于1000欧元的价格，字符串看起来像'€ 505.93 net'（即505.93）。对于价格从1000欧元起的字符串，它们看起来像'€ 1 505.93 net'（即1505.93）。

问题

我不确定如何整齐地处理千元价格和点中的空白。

然后让product_price = '€ 1 505.93 net'

[int(s) for s in product_price if s.isdigit()]

给，

[1, 5, 0, 5, 9, 3]

在product_price = '€ 505.93 net'上的类似过程给出了[5, 0, 5, 9, 3]。

问题

如何调整代码以得到1505.93和505.93？

Answer 1

这是一种方法。我们可以匹配以下正则表达式模式，该模式使用空格作为千位分隔符：

€\s*(\d{1,3}(?: \d{3})*(?:\.\d+)?)

然后，第一个捕获组应包含匹配的欧元金额。

input = '€ 1 505.93 net and here is another price € 505.93'
result = re.findall(r'€\s*(\d{1,3}(?: \d{3})*\.\d+)', input)
print list(result)

['1 505.93', '505.93']

正则表达式的解释：

€                  a Euro sign
\s*                followed by optional whitespace
(                  (capture what follows)
    \d{1,3}        one to three digits
    (?: \d{3})*    followed by zero or more thousands groups
    (?:\.\d+)?     an optional decimal component
)                  (close capture group)

Answer 2

您需要为此使用正则表达式：

import re
pattern = r'((?:\d\s)?\d+\.\d+)'
re.findall(pattern, '€ 1 505.93 and € 505.93')
>>['1 505.93', '505.93']

说明：

\d代表数字

\s代表一个空格

?:表示法是非捕获组表示法

?指定可选组

所以

(?:\d\s)?

不能单独捕获数字和空格，并且该模式是可选的

\d+.\d+ 指定浮点数

Answer 3

看起来最好使用正则表达式。另外，您的问题将以下输出指定为浮点数，而不是字符串，因此我在加入了正则表达式的输出后将转换添加为浮点数。

import re

def bar(string):
    return float(''.join(re.findall(r"[\d.]", string)))

a = '€ 1 505.93 net'
b = '€ 505.93 net'

print(bar(a))
print(bar(b))

输出：

1505.93
505.93

如果您还想处理逗号，则为了实现区域兼容性，可以使用replace（）将其交换一段时间：

def bar(string):
    return float(''.join(re.findall(r"[\d.,]", string)).replace(',', '.'))

c = '€ 6 812,51 net'
print(bar(c))

输出：

6812.51

如何从字符串正确获取价格

3 个答案: