Question

我有一个名为data的数据框，我试图清除数据框中的一列，以便可以将价格仅转换为数值。这就是我过滤列以查找那些不正确值的方式。 data[data['incorrect_price'].astype(str).str.contains('[A-Za-z]')]

    Incorrect_Price    Occurences   errors
23  99 cents                732       1
50  3 dollars and 49 cents  211       1
72  the price is 625        128       3
86  new price is 4.39       19        2
138 4 bucks                 3         1
199 new price 429           13        1
225 price is 9.99           5         1
240 new price is 499        8         2

我尝试过data['incorrect_Price'][20:51].str.findall(r"(\d+) dollars")和data['incorrect_Price'][20:51].str.findall(r"(\d+) cents")来查找其中包含“美分”和“美元”的行，因此我可以提取美元和美分的金额，但是在出现这种情况时无法将其合并遍历数据帧中的所有行。

  I would like the results to like look this:  

    Incorrect_Price        Desired    Occurences    errors
23  99 cents                .99           732         1
50  3 dollars and 49 cents  3.49          211         1
72  the price is 625        625           128         3
86  new price is 4.39       4.39           19         2
138 4 bucks                 4.00           3          1
199 new price 429           429            13         1
225 price is 9.99           9.99           5          1
240 new price is 499        499            8          2

Answer 1

只要字符串Incorrect_Price保留了示例中显示的结构（数字未用单词表示），就可以相对轻松地解决该任务。

使用正则表达式，您可以使用similar SO question中的方法来提取数字部分和可选的“分” /“分”或“美元” /“美元”。两个主要区别在于，您正在寻找一对数值和“ cent [s]”或“ dollar [s]”，并且它们可能多次出现。

import re


def extract_number_currency(value):
    prices  = re.findall('(?P<value>[\d]*[.]?[\d]{1,2})\s*(?P<currency>cent|dollar)s?', value)

    result = 0.0
    for value, currency in prices:
        partial = float(value)
        if currency == 'cent':
            result += partial / 100
        else:
            result += partial

    return result


print(extract_number_currency('3 dollars and 49 cent'))

3.49

现在，您需要将此功能应用于带有价格字样的列中的所有不正确值。为简单起见，我在这里将其应用于所有的值（但我相信你一定能够处理子集）：

data['Desired'] = data['Incorrect_Price'].apply(extract_number_currency)

Voila！

破坏正则表达式'(?P<value>[\d]*[.]?[\d]{1,2})\s*(?P<currency>cent|dollar)s?'

有两个捕获的命名组(?P<name_of_the_capture_group> .... )

在第一个捕获组(?P<value>[\d]*[.]?[\d]{1,2})捕获：

[\d] - 数字

[\d]* - 重复0次或多次

[.]? - 后跟可选（?）点

[\d]{1,2} - 随后从1到2次重复位

\s*-表示0个或多个空格

现在的第二捕获基团，其是简单得多：(?P<currency>cent|dollar)

cent|dollar - 它归结为替代之间cent和dollar字符串被捕获

s?是'cent s '或'dollar s '

的可选复数

从pandas数据框中的一列中的字符串中提取数字

1 个答案: