Spark Cluster

时间:2017-02-06 20:32:34

标签: python apache-spark beautifulsoup pyspark

我有一个Spark Dataframe,它有一个文本数据。我正在尝试使用Python BeautifulSoup Library从数据中清除html标记。

但是,当我在我的Mac笔记本电脑上本地安装Spark使用BeautifulSoup时,它可以与Spark udf一起使用并清理标记。

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def html_parsing(x):
    """ Cleans the text from Data Frame text column"""

    textcleaned=''
    #if row['desc'] is not None: 
    souptext=BeautifulSoup(x)
    #souptext=BeautifulSoup(text)
    p_tags=souptext.find_all('p')
    for p in p_tags: 
        if p.string:
            textcleaned+=p.string
    #print textcleaned
    #ret_list= (int(row['id']),row['title'],textcleaned)

    return textcleaned


parse_html=udf(html_parsing,StringType())

sdf_cleaned=sdf_rss.dropna(subset=['desc']).withColumn('text_cleaned',parse_html('desc'))\
.select('id','title','text_cleaned')

sdf_cleaned.cache().take(3)

[Row(id=u'-33753621', title=u'Royal Bank of Scotland is testing a robot that could solve your banking problems (RBS)', text_cleaned=u"If you hate dealing with bank tellers or customer service representatives, then the Royal Bank of Scotland might have a solution for you.If this program is successful, it could be a big step forward on the road to automated customer service through the use of AI, notes Laurie Beaver, research associate for BI Intelligence, Business Insider's premium research service.It's noteworthy that Luvo does not operate via a third-party app such as Facebook Messenger, WeChat, or Kik, all of which are currently trying to create bots that would assist in customer service within their respective platforms.Luvo would be available through the web and through smartphones. It would also use machine learning to learn from its mistakes, which should ultimately help with its response accuracy.Down the road, Luvo would become a supplement to the human staff. It can currently answer 20 set questions but as that number grows, it would allow the human employees to more complicated issues. If a problem is beyond Luvo's comprehension, then it would refer the customer to a bank employee; however,\xa0a user could choose to speak with a human instead of Luvo anyway.AI such as Luvo, if successful, could help businesses become more efficient and increase their productivity, while simultaneously improving customer service capacity, which would consequently\xa0save money that would otherwise go toward manpower.And this trend is already starting. Google, Microsoft, and IBM are investing significantly into AI research. Furthermore, the global AI market is estimated to grow from approximately $420 million in 2014 to $5.05 billion in 2020, according to a forecast by Research and Markets.\xa0The move toward AI would be just one more way in which the digital age is disrupting retail banking. Customers, particularly millennials, are increasingly moving toward digital banking, and as a result, they're walking into their banks' traditional brick-and-mortar branches less often than ever before."),

然而,当我拉起安装在群集上的Spark并使用相同的代码时,它会显示“没有名为bs4的模块”。在群集中安装了pyspark内核的Anaconda jupyter笔记本中使用相同的代码。

Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 9, 107-45-c02.sc): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/var/storage/nm-sda3/nm-local/usercache/appcache/application_1485803993783_0153/container_1485803993783_0153_01_000002/pyspark.zip/pyspark/worker.py", line 98, in main
    command = pickleSer._read_with_length(infile)
  File "/var/storage/nm-sda3/nm-local/usercache/appcache/application_1485803993783_0153/container_1485803993783_0153_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/var/storage/nm-sda3/nm-local/usercache/appcache/application_1485803993783_0153/container_1485803993783_0153_01_000002/pyspark.zip/pyspark/serializers.py", line 422, in loads
    return pickle.loads(obj)
ImportError: No module named bs4

我想强调一下,Spark集群Anaconda也安装了BeautifulSoup,我确认了这一点

conda list 

并显示那里的包裹。

那么在这里我可能会遇到什么问题?

非常感谢您的帮助

0 个答案:

没有答案