我正在java maven项目上的Apache Spark上工作,如图所示,我有一个subreddit注释;
+--------+--------------+----------------------+-----------------+--------------------+----------------+-----------+-------------+-----+------+------+-------+--------+----------+----------+------------+-----+------------+----------+------------+---+
|archived| author|author_flair_css_class|author_flair_text| body|controversiality|created_utc|distinguished|downs|edited|gilded| id| link_id| name| parent_id|retrieved_on|score|score_hidden| subreddit|subreddit_id|ups|
+--------+--------------+----------------------+-----------------+--------------------+----------------+-----------+-------------+-----+------+------+-------+--------+----------+----------+------------+-----+------------+----------+------------+---+
| true| bostich| null| null| test| 0| 1192450635| null| 0| false| 0|c0299an|t3_5yba3|t1_c0299an| t3_5yba3| 1427426409| 1| false|reddit.com| t5_6| 1|
| true|igiveyoumylife| null| null|much smoother.
...| 0| 1192450639| null| 0| false| 0|c0299ao|t3_5yba3|t1_c0299ao| t3_5yba3| 1427426409| 2| false|reddit.com| t5_6| 2|
| true| Arve| null| null|Can we please dep...| 0| 1192450643| null| 0| false| 0|c0299ap|t3_5yba3|t1_c0299ap|t1_c02999p| 1427426409| 0| false|reddit.com| t5_6| 0|
| true| [deleted]| null| null| [deleted]| 0| 1192450646| null| 0| false| 0|c0299aq|t3_5yba3|t1_c0299aq| t3_5yba3| 1427426409| 1| false|reddit.com| t5_6| 1|
| true| gigaquack| null| null|Oh, I see. Fancy ...| 0| 1192450646| null| 0| false| 0|c0299ar|t3_5yba3|t1_c0299ar|t1_c0299ah| 1427426409| 3| false|reddit.com| t5_6| 3|
| true| Percept| null| null| testing ...| 0| 1192450656| null| 0| false| 0|c0299as|t3_5yba3|t1_c0299as| t3_5yba3| 1427426409| 1| false|reddit.com| t5_6| 1|
我解析数据,仅显示正文列。我想在正文列中清除(过滤)[已删除]注释和非拉丁字母注释。我怎样才能做到这一点? (注意:数据大小= 32 GB)
body:[Deleted]
body:How can I do that?
答案 0 :(得分:0)
以下代码段适用于Scala
,但是您可以尝试将其改编为Java
按如下所述使用Dataset.filter(..)
方法
import org.apache.spark.sql.{DataFrame, SparkSession}
val filteredData: DataFrame = dirtyData.
filter(dirtyData("body") =!= "[Deleted]" && dirtyData("body").rlike("[\\x00-\\x7F]"))
说明
dirtyData("body") =!= "[Deleted]"
删除列body
的值为[Deleted]
(您也可能要处理大写和小写)的所有行。参见Column =!=
dirtyData("body").rlike("[\\x00-\\x7F]")
删除所有body
不包含ASCII
字符的行(我对此部分没有做太多研究,但是您可以寻找更好的regex
)。参见Column.rlike(..)
参考