Question

假设我在python中有以下字符串（输入）：

1）"$ 1,350,000" 2）"1.35 MM $" 3）"$ 1.35 M" 4）1350000（现在是数值）

显然，尽管字符串表示不同，但数字是相同的。如何实现字符串匹配或换句话说将它们分类为相等的字符串？

一种方法是建模 - 使用正则表达式 - 可能的模式。然而，可能有一个我没想到的情况。

有人看到这个问题的NLP解决方案吗？

由于

Answer 1

这不是一个NLP问题，只是一个正则表达式的工作，加上一些忽略顺序的代码，并查找已知缩写（/本体）的字典，如“MM”。

首先，您可以在此完全忽略'$'字符（除非您需要消除其他货币或符号的歧义）。
所以这一切都归结为解析数字格式，并映射'M'/'MM'/'百万' - ＆gt; 1e6乘数。并以与订单无关的方式进行解析（例如，乘数，货币符号和金额可以按任何相对顺序出现，或者根本不出现）

这是一些有效的代码：

def parse_numeric_string(s):

    if isinstance(s, int): s = str(s)

    amount = None
    currency = ''
    multiplier = 1.0

    for token in s.split(' '):

        token = token.lower()

        if token in ['$','€','£','¥']:
            currency = token

        # Extract multipliers from their string names/abbrevs
        if token in ['million','m','mm']:
            multiplier = 1e6
        # ... or you could use a dict:
        # multiplier = {'million': 1e6, 'm': 1e6...}.get(token, 1.0)

        # Assume anything else is some string format of number/int/float/scientific
        try:
            token = token.replace(',', '')
            amount = float(token)
        except:
            pass # Process your parse failures...

    # Return a tuple, or whatever you prefer
    return (currency, amount * multiplier)

parse_numeric_string("$ 1,350,000")
parse_numeric_string("1.35 MM $")
parse_numeric_string("$ 1.35 M")
parse_numeric_string(1350000)

对于国际化，您可能要注意,和.可以切换千位分隔符和小数点，或' as（阿拉伯语）千位分隔符。还有一个第三方Python包'解析'，例如parse.parse('{fn}', '1,350,000')（与format()相反）
使用本体或一般NLP库可能会比它的价值更麻烦。例如，您需要消除'mm'之间的歧义，如“百万计的会计缩写”与“毫米”对比'Mm'，如'Megameters，10 ^ 6米'，这是一个几乎从未使用但有效的度量标准距离单位。因此，这项任务可能更不通用。
您还可以使用基于字典的方法来映射其他货币符号，例如'美元'，'美国'，'美元'，'美元'，'欧盟'......
这里我在空格上进行了标记，但你可能想要在任何单词/数字/空格/标点边界上进行标记，以便你可以解析，例如USD1.3m

Answer 2

考虑创建一个例程，该例程将字符串输入（可以是任何4种给定格式）与4种正则表达式模式匹配，如下所示：

“1,350,000美元”：

(?<=\$ )([\d,]+)

对于“1.35 MM $”：

([\d\.]+)(?= MM \$)

“$ 1.35 M”：

(?<=\$ )([\d\.]+)(?= M)

对于1350000：

([\d]+)

然后将这些匹配转换为整数以便返回并与其他匹配。

提供的正则表达式模式只匹配字符串的数字逗号和小数位（通过使用前瞻和后观）。

注意：取决于哪个正则表达式匹配，将要求相应地处理提取的数字。（例如，“1.35 M”中的1.35需要在返回之前乘以1000000）

Answer 3

有趣的问题，这是我的解决方案你可以写一个小类来寻找潜在的匹配，将它们分成一个数量和一个单位，然后尝试转换它们：

import re, locale, math

# us
locale.setlocale(locale.LC_ALL, 'en_US')
from locale import atof

data = """
Say I have the following strings (inputs) in python:

1) "$ 1,350,000" 2) "1.35 MM $" 3) "$ 1.35 M" 4) 1350000 (now it is a numeric value)

Obviously the number is the same although the string representation is different. How can I achieve a string matching or in other words classify them as equal strings?

One way would be to model -using regular expressions- the possible patterns. However there might be a case that I haven't thought of.

Does someone see a NLP solution to this problem?
Thanks

here might be some other digits: 1.234
"""

class DigitMiner:
    def __init__(self):
        self.numbers = []

    def convert(self, amount, unit):
        if unit in ['M', 'MM']:
            amount *= 10**6
        elif unit in ['K']:
            amount *= 10**3
        else:
            pass
        return amount

    def search(self, string=None):
        rx = re.compile(r'''
            (?P<amount>\b\d[\d.,]+\b)\s*
            (?P<unit>M*)''', re.VERBOSE)

        for match in rx.finditer(string):
            amount = self.convert(atof(match.group('amount')), match.group('unit'))
            if amount not in self.numbers:
                self.numbers.append(amount)


dm = DigitMiner()
dm.search(data)
print(dm.numbers)

这会产生：

[1350000.0, 1.234]

<小时/> 请注意，locale.atof()会根据LC_NUMERIC设置将字符串转换为浮点数。

Python - 匹配和解析包含数字/货币金额的字符串

3 个答案: