我有一个火花数据框,其中包含以下数据
+---------------------------------------------------------------------------------------------------------------------------------------------------+
|text |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
|Know what you don't do at 1:30 when you can't sleep? Music shopping. Now I want to dance. #shutUpAndDANCE |
|Serasi ade haha @AdeRais "@SMTOWNGLOBAL: #SHINee ONEW(@skehehdanfdldi) and #AMBER(@llama_ajol) at KBS 'Music Bank'." |
|Happy Birhday Ps.Jeffrey Rachmat #JR50 #flipagram ? Music: This I Believe (The Creed) - Hillsong… |
数据框是一列'文字'并且在其中包含#的单词。例如 '#shutUpAndDANCE'
我正在尝试阅读每个单词并过滤掉,这样我只剩下一个带有哈希的单词列表
代码:
#Gets only those rows containing
hashtagList = sqlContext.sql("SELECT text FROM tweetstable WHERE text LIKE '%#%'")
print hashtagList.show(100, truncate=False)
#Process Rows to get the words
hashtagList = hashtagList.map(lambda p: p.text).map(lambda x: x.split(" ")).collect()
print hashtagList
输出结果为:
[[u'Know', u'what', u'you', u"don't", u'do', u'at', u'1:30', u'when', u'you', u"can't", u'sleep?', u'Music', u'shopping.', u'Now', u'I', u'want', u'to', u'dance.', u'#shutUpAndDANCE'], [...]]
有没有办法可以过滤掉所有内容,并在我的地图阶段只保留#words。
hashtagList = hashtagList.map(lambda p: p.text).map(lambda x: x.split(" "))<ADD SOMETHING HERE TO FETCH ONLY #>.collect()
答案 0 :(得分:1)
试试这个。
from pyspark.sql import Row
from __future__ import print_function
str = "Know what you don't do at 1:30 when you can't sleep? Music shopping. Now I want to dance. #shutUpAndDANCE Serasi ade haha @AdeRais @SMTOWNGLOBAL: #SHINee ONEW(@skehehdanfdldi) and #AMBER(@llama_ajol) at KBS 'Music Bank'.Happy Birhday Ps.Jeffrey Rachmat #JR50 #flipagram? Music: This I Believe (The Creed) - Hillsong"
df = spark.createDataFrame([Row(str)]);
words = df.rdd.flatMap(list).flatMap(lambda line: line.split()).filter(lambda word: word.startswith("#"));
words.foreach(print)
答案 1 :(得分:1)
使用:
>>> from pyspark.sql.functions import split, explode, col
>>>
>>> df.select(explode(split("text", "\\s+")).alias("word")) \
... .where(col("word").startswith("#"))