我想从我的推文数据中删除以下内容:
任何带有@(例如@nike)
的内容以://
开头的任何内容在我的scala脚本中,我有停用词,但它们必须完全匹配输出。有没有办法添加一个诸如@ *或:// *之类的限位词来解释我想删除的单词的所有可能性?
val source = CSVFile("output.csv")
val tokenizer = {
SimpleEnglishTokenizer() ~> // tokenize on space and punctuation
WordsAndNumbersOnlyFilter() ~> // ignore non-words and non-numbers
CaseFolder() ~> // lowercase everything
MinimumLengthFilter(3) // take terms with >=3 characters
}
val text = {
source ~> // read from the source file
Column(1) ~> // select column containing text
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
TermCounter() ~> // collect counts (needed below)
TermMinimumDocumentCountFilter(30) ~> // filter terms in <4 docs
TermStopListFilter(List("a", "and", "I", "but", "what")) ~> // stopword list
TermDynamicStopListFilter(10) ~> // filter out 30 most common terms
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms
}
Tokenizer似乎没有接受这些非字母字符。然而,它没有问题地过滤#。 谢谢你的帮助!
答案 0 :(得分:1)
我在这里仍然缺少很多细节,因为我从未与stanford-nlp合作过,但这是我能说的。
我找到了一些将TermStopListFilter
定义为
/**
* Filters out terms from the given list.
*
* @author dramage
*/
case class TermStopListFilter[ID:Manifest](stops : List[String])
extends Stage[LazyIterable[Item[ID,Iterable[String]]],LazyIterable[Item[ID,Iterable[String]]]] {
override def apply(parcel : Parcel[LazyIterable[Item[ID,Iterable[String]]]]) : Parcel[LazyIterable[Item[ID,Iterable[String]]]] = {
val newMeta = {
if (parcel.meta.contains[TermCounts]) {
parcel.meta + parcel.meta[TermCounts].filterIndex(term => !stops.contains(term)) + TermStopList(stops)
} else {
parcel.meta + this;
}
}
Parcel(parcel.history + this, newMeta,
parcel.data.map((doc : Item[ID,Iterable[String]]) => (doc.map(_.filter(term => !stops.contains(term))))));
}
override def toString =
"TermStopListFilter("+stops+")";
}
在代码中我看到了
if (parcel.meta.contains[TermCounts]) {
parcel.meta +
parcel.meta[TermCounts].filterIndex(term => !stops.contains(term)) +
TermStopList(stops)
}
从TermCounts
数据中获取的meta
对象似乎是通过使用stops
将字词与contains
元素相匹配来过滤其包含的字词。
要使用更通用的表达式进行过滤,应该足以实现使用正则表达式的TermStopListFilter
的新版本,例如
import scala.util.matching.Regex
/**
* Filters out terms that matches the supplied regular expression.
*/
case class TermStopListFilter[ID:Manifest](regex: String)
extends Stage[LazyIterable[Item[ID,Iterable[String]]],LazyIterable[Item[ID,Iterable[String]]]] {
override def apply(parcel : Parcel[LazyIterable[Item[ID,Iterable[String]]]]) : Parcel[LazyIterable[Item[ID,Iterable[String]]]] = {
//extract the pattern from the regular expression string
val pat = regex.r.pattern
val newMeta = {
if (parcel.meta.contains[TermCounts]) {
parcel.meta + parcel.meta[TermCounts].filterIndex(term => pat.matcher(term).matches) // something should be added here??
} else {
parcel.meta + this; // is this still correct?
}
}
Parcel(parcel.history + this, newMeta,
parcel.data.map((doc : Item[ID,Iterable[String]]) => (doc.map(_.filter(term => pat.matcher(term).matches)))));
}
override def toString =
"TermStopListFilter("+regex+")";
}