在包含主题标签的句子中,例如推文,spacy的标记生成器将主题标签拆分为两个标记:
import spacy
nlp = spacy.load('en')
doc = nlp(u'This is a #sentence.')
[t for t in doc]
输出:
[This, is, a, #, sentence, .]
我希望将主题标记符号化为:
[This, is, a, #sentence, .]
这可能吗?
谢谢
答案 0 :(得分:2)
> >>> import re > >>> import spacy > >>> nlp = spacy.load('en') > >>> sentence = u'This is my twitter update #MyTopic' > >>> parsed = nlp(sentence) > >>> [token.text for token in parsed]
[u'This', u'is', u'my', u'twitter', u'update', u'#', u'MyTopic']
> >>> new_sentence = re.sub(r'#(\w+)',r'ZZZPLACEHOLDERZZZ\1',sentence) > >>> new_sentence u'This is my twitter update ZZZPLACEHOLDERZZZMyTopic' > >>> parsed = nlp(new_sentence) > >>> [token.text for token in parsed]
[u'This', u'is', u'my', u'twitter', u'update', u'ZZZPLACEHOLDERZZZMyTopic']
> >>> [x.replace(u'ZZZPLACEHOLDERZZZ','#') for x in [token.text for token in parsed]]
[u'This', u'is', u'my', u'twitter', u'update', u'#MyTopic']
更新:您可以使用正则表达式来查找您希望保留为单个令牌的令牌范围,并使用如下所述的span.merge方法进行重新声明:https://spacy.io/docs/api/span#merge
合并示例:
>>> import spacy
>>> import re
>>> nlp = spacy.load('en')
>>> my_str = u'Tweet hashtags #MyHashOne #MyHashTwo'
>>> parsed = nlp(my_str)
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#', u'NOUN'), (u'MyHashOne', u'NOUN'), (u'#', u'NOUN'), (u'MyHashTwo', u'PROPN')]
>>> indexes = [m.span() for m in re.finditer('#\w+',my_str,flags=re.IGNORECASE)]
>>> indexes
[(15, 25), (26, 36)]
>>> for start,end in indexes:
... parsed.merge(start_idx=start,end_idx=end)
...
#MyHashOne
#MyHashTwo
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#MyHashOne', u'NOUN'), (u'#MyHashTwo', u'PROPN')]
>>>
答案 1 :(得分:1)
@DhruvPathak和来自以下链接的github线程的无耻复制这个更好的加载项(以及更好的答案@ csvance)。 spaCy功能(自V2.0起)add_pipe
方法。这意味着您可以在函数中定义@DhruvPathak的最佳答案,并将步骤(方便地)添加到您的nlp处理管道中,如下所示。
引文从这里开始:
def hashtag_pipe(doc):
merged_hashtag = False
while True:
for token_index,token in enumerate(doc):
if token.text == '#':
if token.head is not None:
start_index = token.idx
end_index = start_index + len(token.head.text) + 1
if doc.merge(start_index, end_index) is not None:
merged_hashtag = True
break
if not merged_hashtag:
break
merged_hashtag = False
return doc
nlp = spacy.load('en')
nlp.add_pipe(hashtag_pipe)
doc = nlp("twitter #hashtag")
assert len(doc) == 2
assert doc[0].text == 'twitter'
assert doc[1].text == '#hashtag'
引文在这里结束;查看how to add hashtags to the part of speech tagger #503了解完整主题。
PS读取代码时很清楚,但对于复制和粘贴,请不要禁用解析器:)
答案 2 :(得分:0)
我在github上找到了它,它使用spaCy的Matcher
:
from spacy.matcher import Matcher matcher = Matcher(nlp.vocab) matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}]) doc = nlp('This is a #sentence. Here is another #hashtag. #The #End.') matches = matcher(doc) hashtags = [] for match_id, start, end in matches: hashtags.append(doc[start:end]) for span in hashtags: span.merge() print([t.text for t in doc])
输出:
['This', 'is', 'a', '#sentence', '.', 'Here', 'is', 'another', '#hashtag', '.', '#The', '#End', '.']
hashtags
列表中也提供了#标签列表:
print(hashtags)
输出:
[#sentence, #hashtag, #The, #End]
答案 3 :(得分:0)
我花了很多时间在此上,发现我分享了我的想法: 对令牌生成器进行子类化,然后将标签的正则表达式添加到默认的URL_PATTERN对我来说是最简单的解决方案,另外添加了一个自定义扩展名以匹配标签以识别它们:
import re
import spacy
from spacy.language import Language
from spacy.tokenizer import Tokenizer
from spacy.tokens import Token
nlp = spacy.load('en_core_web_sm')
def create_tokenizer(nlp):
# contains the regex to match all sorts of urls:
from spacy.lang.tokenizer_exceptions import URL_PATTERN
# spacy defaults: when the standard behaviour is required, they
# need to be included when subclassing the tokenizer
prefix_re = spacy.util.compile_prefix_regex(Language.Defaults.prefixes)
infix_re = spacy.util.compile_infix_regex(Language.Defaults.infixes)
suffix_re = spacy.util.compile_suffix_regex(Language.Defaults.suffixes)
# extending the default url regex with regex for hashtags with "or" = |
hashtag_pattern = r'''|^(#[\w_-]+)$'''
url_and_hashtag = URL_PATTERN + hashtag_pattern
url_and_hashtag_re = re.compile(url_and_hashtag)
# set a custom extension to match if token is a hashtag
hashtag_getter = lambda token: token.text.startswith('#')
Token.set_extension('is_hashtag', getter=hashtag_getter)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=url_and_hashtag_re.match
)
nlp.tokenizer = create_tokenizer(nlp)
doc = nlp("#spreadhappiness #smilemore so_great@good.com https://www.somedomain.com/foo")
for token in doc:
print(token.text)
if token._.is_hashtag:
print("-> matches hashtag")
# returns: "#spreadhappiness -> matches hashtag #smilemore -> matches hashtag so_great@good.com https://www.somedomain.com/foo"
答案 4 :(得分:0)
我还尝试了几种方法来防止spaCy用诸如“ cutting-edge”之类的连字符来分隔标签或单词。我的经验是,之后合并令牌可能会带来问题,因为pos标记器和依赖项解析器已经在决策中使用了错误的令牌。触摸后缀,前缀,后缀regexp容易出错/复杂,因为您不想因更改而产生副作用。
如前所述,最简单的方法确实是修改令牌生成器的token_match函数。这是重新匹配,用于标识将不会拆分的正则表达式。我宁愿扩展spaCy的默认设置,也不愿导入特殊的URL模式。
from spacy.tokenizer import _get_regex_pattern
nlp = spacy.load('en')
# get default pattern for tokens that don't get split
re_token_match = _get_regex_pattern(nlp.Defaults.token_match)
# add your patterns (here: hashtags and in-word hyphens)
re_token_match = f"({re_token_match}|#\w+|\w+-\w+)"
# overwrite token_match function of the tokenizer
nlp.tokenizer.token_match = re.compile(re_token_match).match
text = "@Pete: choose low-carb #food #eatsmart ;-) ??"
doc = nlp(text)
这将产生:
['@Pete', ':', 'choose', 'low-carb', '#food', '#eatsmart', ';-)', '?', '?']