斯卡拉停止说话

时间:2013-01-10 17:16:22

标签: scala text csv stop-words

我想从我的推文数据中删除以下内容:

任何带有@(例如@nike)

的内容

以://

开头的任何内容

在我的scala脚本中,我有停用词,但它们必须完全匹配输出。有没有办法添加一个诸如@ *或:// *之类的限位词来解释我想删除的单词的所有可能性?

val source = CSVFile("output.csv")

val tokenizer = {
SimpleEnglishTokenizer() ~>            // tokenize on space and punctuation
WordsAndNumbersOnlyFilter() ~>         // ignore non-words and non-numbers
CaseFolder() ~>                        // lowercase everything
MinimumLengthFilter(3)                 // take terms with >=3 characters 
}

val text = {
source ~>                              // read from the source file
Column(1) ~>                           // select column containing text
TokenizeWith(tokenizer) ~>             // tokenize with tokenizer above
TermCounter() ~>                       // collect counts (needed below)
TermMinimumDocumentCountFilter(30) ~>   // filter terms in <4 docs
TermStopListFilter(List("a", "and", "I", "but", "what")) ~> // stopword list
TermDynamicStopListFilter(10) ~>       // filter out 30 most common terms  
DocumentMinimumLengthFilter(5)         // take only docs with >=5 terms 
}

Tokenizer似乎没有接受这些非字母字符。然而,它没有问题地过滤#。 谢谢你的帮助!

1 个答案:

答案 0 :(得分:1)

我在这里仍然缺少很多细节,因为我从未与stanford-nlp合作过,但这是我能说的。

我找到了一些将TermStopListFilter定义为

的源代码from a forked scalanlp repository
/**
 * Filters out terms from the given list.
 * 
 * @author dramage
 */
case class TermStopListFilter[ID:Manifest](stops : List[String])
extends Stage[LazyIterable[Item[ID,Iterable[String]]],LazyIterable[Item[ID,Iterable[String]]]] {
  override def apply(parcel : Parcel[LazyIterable[Item[ID,Iterable[String]]]]) : Parcel[LazyIterable[Item[ID,Iterable[String]]]] = {
    val newMeta = {
      if (parcel.meta.contains[TermCounts]) {
        parcel.meta + parcel.meta[TermCounts].filterIndex(term => !stops.contains(term)) + TermStopList(stops)
      } else {
        parcel.meta + this;
      }
    }

    Parcel(parcel.history + this, newMeta,
      parcel.data.map((doc : Item[ID,Iterable[String]]) => (doc.map(_.filter(term => !stops.contains(term))))));
  }

  override def toString =
    "TermStopListFilter("+stops+")";
}

在代码中我看到了

if (parcel.meta.contains[TermCounts]) {
  parcel.meta + 
  parcel.meta[TermCounts].filterIndex(term => !stops.contains(term)) +
  TermStopList(stops)
}

TermCounts数据中获取的meta对象似乎是通过使用stops将字词与contains元素相匹配来过滤其包含的字词。

要使用更通用的表达式进行过滤,应该足以实现使用正则表达式的TermStopListFilter的新版本,例如

import scala.util.matching.Regex

/**
 * Filters out terms that matches the supplied regular expression.
 */
case class TermStopListFilter[ID:Manifest](regex: String)
extends Stage[LazyIterable[Item[ID,Iterable[String]]],LazyIterable[Item[ID,Iterable[String]]]] {
  override def apply(parcel : Parcel[LazyIterable[Item[ID,Iterable[String]]]]) : Parcel[LazyIterable[Item[ID,Iterable[String]]]] = {

    //extract the pattern from the regular expression string
    val pat = regex.r.pattern

    val newMeta = {
      if (parcel.meta.contains[TermCounts]) {
        parcel.meta + parcel.meta[TermCounts].filterIndex(term => pat.matcher(term).matches) // something should be added here??
      } else {
        parcel.meta + this; // is this still correct?
      }
    }

    Parcel(parcel.history + this, newMeta,
      parcel.data.map((doc : Item[ID,Iterable[String]]) => (doc.map(_.filter(term => pat.matcher(term).matches)))));
  }

  override def toString =
    "TermStopListFilter("+regex+")";
}