Question

我有文件，其中字符\u0080用作欧元。我想将这些和其他字符添加到货币符号列表中，以便货币实体被Spacy NER接收。解决这个问题的最佳方法是什么？

此外，我还有一些案例，其中钱被表示为CAD 5,000，而这些案件并未被NER选为Money。处理这种情况的最佳方法是什么，培训NER或将CAD添加为货币符号？

Answer 1

<强> 1。 'u\0080'问题

首先，似乎'u\0080'字符的解释取决于您使用的平台，它不能在Windows 7计算机上打印，但它可以在Linux机器上运行......

为了完整起见，我假设您从包含''转义序列（应在浏览器中打印为€）的html文档中获取文本，'\u0080'字符和我们认定为货币的其他一些任意符号。

在将文本内容传递给spaCy之前，我们可以调用html.unescape来处理将转换为€，而text_html = ("I just found out that CAD 1,000 is about 641.3 &#x80. " "Some people call it 641.3 \u0080. " "Fantastic! But in the U.K. I'd rather pay 344 or \U0001F33B56.") text = html.unescape(text_html)将被默认配置识别为一种货币。

其次，如果有符号未被识别为货币，例如和Defaults，那么我们可以更改我们使用的语言的lex_attr_getters[IS_CURRENCY]来限定它们作为货币。

这包括将def is_currency_custom(text): # Stripping punctuation table = str.maketrans({key: None for key in string.punctuation}) text = text.translate(table) all_currencies = ["\U0001F385", "\U0001F33B", "\u0080", "CAD"] if text in all_currencies: return True return is_currency_original(text) # Keep a reference to the original is_currency function is_currency_original = EnglishDefaults.lex_attr_getters[IS_CURRENCY] # Assign a new function for IS_CURRENCY EnglishDefaults.lex_attr_getters[IS_CURRENCY] = is_currency_custom函数替换为包含描述货币符号列表的自定义函数。

CAD 5,000

<强> 2。 CAD问题

对于这个，一个简单的解决方案是定义一个特例。我们对分词器说，无论它遇到IS_CURRENCY，这都是一个特例，它需要按照我们的指示去做。我们可以设置special_case = [{ ORTH: u'CAD', TAG: u'$', IS_CURRENCY: True}] nlp.tokenizer.add_special_case(u'CAD', special_case)标志等等。

Matcher

请注意，这并不完美，因为您可能会得到误报。想象一下加拿大公司出售CAD绘图服务的文件......所以这很好，但不是很好。

如果我们想要更精确，我们可以创建一个CURRENCY[SPACE]NUMBER对象来查找NUMBER[SPACE]CURRENCY或MONEY等模式，并将matcher = Matcher(nlp.vocab) MONEY = nlp.vocab.strings['MONEY'] # This is the matcher callback that sets the MONEY entity def add_money_ent(matcher, doc, i, matches): match_id, start, end = matches[i] doc.ents += ((MONEY, start, end),) matcher.add( 'MoneyRedefined', add_money_ent, [{'IS_CURRENCY': True}, {'IS_SPACE': True, 'OP': '?'}, {'LIKE_NUM': True}], [{'LIKE_NUM': True}, {'IS_SPACE': True, 'OP': '?'}, {'IS_CURRENCY': True}] )实体与之关联起来。

doc

并使用matcher(doc)将其应用于'OP'对象。 import spacy from spacy.symbols import IS_CURRENCY from spacy.lang.en import EnglishDefaults from spacy.matcher import Matcher from spacy import displacy import html import string def is_currency_custom(text): # Stripping punctuation table = str.maketrans({key: None for key in string.punctuation}) text = text.translate(table) all_currencies = ["\U0001F385", "\U0001F33B", "\u0080", "CAD"] if text in all_currencies: return True return is_currency_original(text) # Keep a reference to the original is_currency function is_currency_original = EnglishDefaults.lex_attr_getters[IS_CURRENCY] # Assign a new function for IS_CURRENCY EnglishDefaults.lex_attr_getters[IS_CURRENCY] = is_currency_custom nlp = spacy.load('en') matcher = Matcher(nlp.vocab) MONEY = nlp.vocab.strings['MONEY'] # This is the matcher callback that sets the MONEY entity def add_money_ent(matcher, doc, i, matches): match_id, start, end = matches[i] doc.ents += ((MONEY, start, end),) matcher.add( 'MoneyRedefined', add_money_ent, [{'IS_CURRENCY': True}, {'IS_SPACE': True, 'OP': '?'}, {'LIKE_NUM': True}], [{'LIKE_NUM': True}, {'IS_SPACE': True, 'OP': '?'}, {'IS_CURRENCY': True}] ) text_html = ("I just found out that CAD 1,000 is about 641.3 &#x80. " "Some people call it 641.3 \u0080. " "Fantastic! But in the U.K. I'd rather pay 344 or \U0001F33B56.") text = html.unescape(text_html) doc = nlp(text) matcher(doc) displacy.serve(doc, style='ent')键使模式可选，允许它匹配0或1次。

第3。完整代码

$x = (9.28/29*1)*100; // ~ 32
var_dump($x);
// float(32)

echo "Char: ".intval($x)." -".chr(intval($x))."-";
// output: Char: 31 --

echo $x; // gives 32 though..

这给出了预期的：

如何在Spacy中添加其他货币字符

1 个答案: