Question

我正在写一个wordCount程序从MySQL数据库读取数据，我的数据如下：

rawText = sqlContext.read.format("jdbc").options(url=jdbcUrl, dbtable = "KeyWordFed").load()
rawText.take(5)

[Row（id = 1，text ='RT @GretaLWall：#BREAKING：特朗普总统选出Jerome Powell担任下一任美联储主席'，创建= datetime.datetime（2017,11,1,21,56），59），id_str ='925844141896011776'，retweet_count = 0，极性= 0.0，主观性= 0.0），行（id = 2，.....）

我只想拍摄文字部分并对其进行一些清理，所以我使用：

def clean_text(x):
    text = re.search(r"text='(.+)', created=", str(x)).group(1)
    clean_str = text.translate(str.maketrans('','',punc))
    return clean_str

第一行是取出文本部分，第二行是剥离标点符号。

one_RDD = rawText.flatMap(lambda x: clean_text(x).split()).map(lambda y: (y,1))
one_RDD.take(30)

我得到了结果：

[（'RT'，1），（'@GretaLWall'，1），（'#BREAKING'，1），（'总统'，1），（'特朗普'，1），（'挑选'，1），（'杰罗姆'，1），（'鲍威尔'，1），（'到'，1），（'be'，1），（'下一个'，1），（'主席'，1），（'of'，1），（'the'，1），（'联邦'，1），（'保留'，1），（'#Trump'，1），（'nomina'，1），（'杰罗姆'，1），（'鲍威尔'，1），（'presidente'，1），（'della'，1），（'联邦'，1），（'保留'，1），（'#Trump'，1），（'#nomina'，1），（'#Jerome'，1），（'#Powell'，1），（'#presidente'，1），（'httpstco1ZUIZfgOFj'，1）]

到目前为止，一切都很完美。

但是当我尝试汇总所有单词时：

one_RDD = one_RDD.reduceByKey(lambda a,b: a + b)
one_RDD.take(5)

我遇到了一些错误，错误信息太长了。但基本上它说：

File "<ipython-input-113-d273e318b1c5>", line 1, in <lambda>
  File "<ipython-input-85-c8d7f3db6341>", line 2, in clean_text
AttributeError: 'NoneType' object has no attribute 'group'

其他信息：

在我尝试.map（lambda y：（y，1））步骤之前，我遇到了这个错误。当我看到错误时，我正在使用lambda x：（x，1），然后我改为y，它解决了问题，但我不明白为什么。

Answer 1

RDD中的其中一行不包含您要搜索的正则表达式。您可以使用以下方式检查：

rawText.filter(lambda x: re.search(r"text='(.+)', created=", str(x))).take(5)

请注意，错误是基于Python而不是Spark。 clean_text中的逻辑不处理异常：

import re
from string import punctuation as punc
def clean_text(x):
    try :
        text = re.search(r"text='(.+)', created=", str(x)).group(1)
        clean_str = text.translate(str.maketrans('','',punc))
        return clean_str
    except:
        return ""    

rawText=sc.parallelize(["Row(id=1, text='RT @GretaLWall: #BREAKING: President Trump picks Jerome Powell to be next Chair of the Federal Reserve', created=datetime.datetime(2017, 11, 1, 21, 56, 59), id_str='925844141896011776', retweet_count=0, polarity=0.0, subjectivity=0.0)", 
                        "Row(id=1, created=datetime.datetime(2017, 11, 1, 21, 56, 59), id_str='925844141896011776', retweet_count=0, polarity=0.0, subjectivity=0.0)"])
one_RDD = rawText.flatMap(lambda x: clean_text(x).split()).map(lambda y: (y,1))
one_RDD.take(30)

    [('RT', 1),
     ('GretaLWall', 1),
     ('BREAKING', 1),
     ('President', 1),
     ('Trump', 1),
     ('picks', 1),
     ('Jerome', 1),
     ('Powell', 1),
     ('to', 1),
     ('be', 1),
     ('next', 1),
     ('Chair', 1),
     ('of', 1),
     ('the', 1),
     ('Federal', 1),
     ('Reserve', 1)]

我建议过滤这些行，因为引发异常会导致计算速度慢

与flatmap lambda函数关联的Pyspark reduceByKey错误

1 个答案: