Question

我想通过使用带有正则表达式的NLTK标记来对所有货币符号进行标记。

例如，这是我的一句话：

The price of it is $5.00.
The price of it is RM5.00.
The price of it is €5.00.

我使用了这种正则表达式：

pattern = r'''(['()""\w]+|\.+|\?+|\,+|\!+|\$?\d+(\.\d+)?%?)'''
tokenize_list = nltk.regexp_tokenize(sentence, pattern)

但我们可以看到它只考虑$。

我尝试按照What is regex for currency symbol?中的说明使用\p{Sc}，但它仍然不适用于我。

Answer 1

尝试使用带空格的货币符号填充编号，然后标记为：

>>> import re
>>> from nltk import word_tokenize
>>> sents = """The price of it is $5.00.
... The price of it is RM5.00.
... The price of it is €5.00.""".split('\n')
>>>
>>> for sent in sents:
...     numbers_in_sent = re.findall("[-+]?\d+[\.]?\d*", sent)
...     for num in numbers_in_sent:
...             sent = sent.replace(num, ' '+num+' ')
...     print word_tokenize(sent)
... 
['The', 'price', 'of', 'it', 'is', '$', '5.00', '.']
['The', 'price', 'of', 'it', 'is', 'RM', '5.00', '.']
['The', 'price', 'of', 'it', 'is', '\xe2\x82\xac', '5.00', '.']

如何在python中使用Regex标记所有货币符号？

1 个答案: