Question

我正在使用 spaCy 对可能以

开头的文本进行句子分段

<script>
  function debounce(func, wait, immediate) {
      var timeout;
      return function() {
          var context = this, args = arguments;
          var later = function() {
              timeout = null;
              if (!immediate) func.apply(context, args);
          };
          var callNow = immediate && !timeout;
          clearTimeout(timeout);
          timeout = setTimeout(later, wait);
          if (callNow) func.apply(context, args);
      };
  };

  function handleScroll() {
    if($(window).scrollTop() >= $('.instacol').height() - $(window).height()) {
          var xmlhttp = new XMLHttpRequest();
          xmlhttp.onreadystatechange = function() {
              if (this.readyState == 4 && this.status == 200) {
          $('#loadmoreinstagram').remove();
                  document.getElementById("instagramresponse").innerHTML = document.getElementById("instagramresponse").innerHTML+this.responseText;
              }
          };
          xmlhttp.open("GET", "<?php echo $settings['URL']; ?>getdata.php?type=" + $('#type').last().val() + "&page=" + $('#page').last().val() + "&lasttime=" + $('#lasttime').last().val(), true);
          xmlhttp.send();
      }
  }

  var debouncedScroll = debounce(handleScroll, 300)
  $(window).scroll(debouncedScroll);


</script>

对于所有这些文本，段落编号后面可能带有\ r，\ n或\ t。

使用spaCy句子分段，每个文本中的第一句话会产生以下结果：

text1 = "1. Dies ist ein Text"
text2 = "A. Dies ist ein Text"
text3 = "1.) Dies ist ein Text"
text4 = "B.) Dies ist ein Text"

因此，我正在尝试添加一条规则，如何将句子分隔

编写我的函数（包括此类规则）和
将此函数传递给nlp.pipeline

不幸的是，我在正确定义此规则时遇到了麻烦。

我已经执行以下操作：

**** 1.
**** A.
**** 1.)
**** B.)

并将此函数传递给def custom_sentensizer(doc): boundary1 = re.compile(r'^[a-zA-Z0-9][\.]?$') boundary2 = re.compile(r'\)') prev = doc[0].text length = len(doc) for i, token in enumerate(doc): if (boundary1.match(prev) and i != (length -1)) or (boundary2.match(token.text) and prev == "." and i != (length -1)): doc[i+1].sent_start = False prev = token.text return doc

nlp

对于以上文字，它似乎有效，但仅在没有nlp = spacy.load('de_core_news_sm') nlp.add_pipe(custom_sentensizer, before='parser') all_sentences = [] for text in texts: # texts is list of list with each list including one text doc = nlp(text) sentences = [sent for sent in doc.sents] all_sentences.append(sentences)，\r和\n的情况下有效。

因此，我有两个问题：

我如何处理\t，\r和\n，因为它们有时是句子拆分的有效边界，即我不想定义排除规则这些。
我自己的功能似乎非常复杂。有更简单的方法吗？

感谢您的帮助！

Python Spacy自定义句子拆分

0 个答案: