在使用YARN在集群上运行Spark代码片段和运行Spark的Spark时,我遇到了非常奇怪的事情。
这是我试过的代码。只是一个将文本字段转换为word2vec向量的小函数。我的数据如下所示:
[Row(id=u'-33753621', title=u'Royal Bank of Scotland is testing a robot that could solve your banking problems (RBS)', text_cleaned=u"If you hate dealing with bank tellers or customer service representatives, then the Royal Bank of Scotland might have a solution for you.If this program is successful, it could be a big step forward on the road to automated customer service through the use of AI, notes Laurie Beaver, research associate for BI Intelligence, Business Insider's premium research service.It's noteworthy that Luvo does not operate via a third-party app such as Facebook Messenger, WeChat, or Kik, all of which are currently trying to create bots that would assist in customer service within their respective platforms.Luvo would be available through the web and through smartphones. It would also use machine learning to learn from its mistakes, which should ultimately help with its response accuracy.Down the road, Luvo would become a supplement to the human staff. It can currently answer 20 set questions but as that number grows, it would allow the human employees to more complicated issues. If a problem is beyond Luvo's comprehension, then it would refer the customer to a bank employee; however,\xa0a user could choose to speak with a human instead of Luvo anyway.AI such as Luvo, if successful, could help businesses become more efficient and increase their productivity, while simultaneously improving customer service capacity, which would consequently\xa0save money that would otherwise go toward manpower.And this trend is already starting. Google, Microsoft, and IBM are investing significantly into AI research. Furthermore, the global AI market is estimated to grow from approximately $420 million in 2014 to $5.05 billion in 2020, according to a forecast by Research and Markets.\xa0The move toward AI would be just one more way in which the digital age is disrupting retail banking. Customers, particularly millennials, are increasingly moving toward digital banking, and as a result, they're walking into their banks' traditional brick-and-mortar branches less often than ever before."),
我有以下代码段:
def word2Vec(df):
""" This function takes in the data frame of the texts and finds the Word vector
representation of that
"""
from pyspark.ml.feature import Tokenizer, Word2Vec
# Carrying out the Tokenization of the text documents (splitting into words)
tokenizer = Tokenizer(inputCol="text_cleaned", outputCol="tokenised_text")
tokensDf = tokenizer.transform(df)
# Implementing the word2Vec model
word2Vec = Word2Vec(vectorSize=300, seed=42, inputCol="tokenised_text", outputCol="w2v_vector")
w2vmodel = word2Vec.fit(tokensDf)
w2vdf=w2vmodel.transform(tokensDf)
return w2vdf,w2vmodel
w2vdf,w2vmodel=word2Vec(df_cleaned)
上面的代码片段我首先尝试使用Spark Standalone,并在开头设置了以下Spark配置:
Spark Local:
sc.stop()
from pyspark import SparkContext
from pyspark.conf import SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setAppName('Learnfit_RSS')\
.set("spark.executor.memory", "4g")\
.set("spark.executor.cores",5)\
.set("spark.executor.instances",10)\
.set("spark.yarn.executor.memoryOverhead",1024)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
这段代码运行完全正常,并给了我以下输出:如果你看到上面的方法变换器已经工作正常创建相应的列。
[Row(id=u'-33753621', tokenised_text=[u'if', u'you', u'hate', u'dealing', u'with', u'bank', u'tellers', u'or', u'customer', u'service', u'representatives,', u'then', u'the', u'royal', u'bank', u'of', u'scotland', u'might', u'have', u'a', u'solution', u'for', u'you.if', u'this', u'program', u'is', u'successful,', u'it', u'could', u'be', u'a', u'big', u'step', u'forward', u'on', u'the', u'road', u'to', u'automated', u'customer', u'service', u'through', u'the', u'use', u'of', u'ai,', u'notes', u'laurie', u'beaver,', u'research', u'associate', u'for', u'bi', u'intelligence,', u'business', u"insider's", u'premium', u'research', u"service.it's", u'noteworthy', u'that', u'luvo', u'does', u'not', u'operate', u'via', u'a', u'third-party', u'app', u'such', u'as', u'facebook', u'messenger,', u'wechat,', u'or', u'kik,', u'all', u'of', u'which', u'are', u'currently', u'trying', u'to', u'create', u'bots', u'that', u'would', u'assist', u'in', u'customer', u'service', u'within', u'their', u'respective', u'platforms.luvo', u'would', u'be', u'available', u'through', u'the', u'web', u'and', u'through', u'smartphones.', u'it', u'would', u'also', u'use', u'machine', u'learning', u'to', u'learn', u'from', u'its', u'mistakes,', u'which', u'should', u'ultimately', u'help', u'with', u'its', u'response', u'accuracy.down', u'the', u'road,', u'luvo', u'would', u'become', u'a', u'supplement', u'to', u'the', u'human', u'staff.', u'it', u'can', u'currently', u'answer', u'20', u'set', u'questions', u'but', u'as', u'that', u'number', u'grows,', u'it', u'would', u'allow', u'the', u'human', u'employees', u'to', u'more', u'complicated', u'issues.', u'if', u'a', u'problem', u'is', u'beyond', u"luvo's", u'comprehension,', u'then', u'it', u'would', u'refer', u'the', u'customer', u'to', u'a', u'bank', u'employee;', u'however,\xa0a', u'user', u'could', u'choose', u'to', u'speak', u'with', u'a', u'human', u'instead', u'of', u'luvo', u'anyway.ai', u'such', u'as', u'luvo,', u'if', u'successful,', u'could', u'help', u'businesses', u'become', u'more', u'efficient', u'and', u'increase', u'their', u'productivity,', u'while', u'simultaneously', u'improving', u'customer', u'service', u'capacity,', u'which', u'would', u'consequently\xa0save', u'money', u'that', u'would', u'otherwise', u'go', u'toward', u'manpower.and', u'this', u'trend', u'is', u'already', u'starting.', u'google,', u'microsoft,', u'and', u'ibm', u'are', u'investing', u'significantly', u'into', u'ai', u'research.', u'furthermore,', u'the', u'global', u'ai', u'market', u'is', u'estimated', u'to', u'grow', u'from', u'approximately', u'$420', u'million', u'in', u'2014', u'to', u'$5.05', u'billion', u'in', u'2020,', u'according', u'to', u'a', u'forecast', u'by', u'research', u'and', u'markets.\xa0the', u'move', u'toward', u'ai', u'would', u'be', u'just', u'one', u'more', u'way', u'in', u'which', u'the', u'digital', u'age', u'is', u'disrupting', u'retail', u'banking.', u'customers,', u'particularly', u'millennials,', u'are', u'increasingly', u'moving', u'toward', u'digital', u'banking,', u'and', u'as', u'a', u'result,', u"they're", u'walking', u'into', u'their', u"banks'", u'traditional', u'brick-and-mortar', u'branches', u'less', u'often', u'than', u'ever', u'before.'], w2v_vector=DenseVector([-0.0394, -0.0388, 0.0368, -0.0455, 0.0602, -0.0734, 0.0515, -0.0064, -0.068, -0.0438, 0.0671, 0.007, -0.0227, -0.0393, -0.0254, -0.024, 0.0115, 0.0415, -0.0116, -0.0169, 0.0545, -0.0439, 0.0414, 0.0312, -0.028, -0.0085, 0.0234, -0.1321, -0.0364, 0.0921, 0.0208, 0.0156, 0.0071, 0.0186, -0.0455, -0.0634, 0.0379, 0.0148, 0.0401, -0.0395, 0.0334, 0.0026, -0.0748, -0.0242, -0.0373, 0.0602, -0.0341, -0.0181, 0.0723, 0.0012, -0.1177, 0.0319, 0.0322, -0.1054, -0.0011, -0.0415, -0.0161, -0.0472, -0.0785, -0.0219, -0.0311, 0.0296, -0.0149, 0.04, 0.0001, 0.0337, 0.0841, -0.0344, -0.0171, 0.0425, -0.0122, 0.0838, 0.034, 0.0054, 0.0171, 0.0209, 0.0286, -0.0227, -0.0147, 0.0532, -0.027, -0.0645, -0.0858, -0.1444, 0.0824, 0.0128, -0.0485, -0.0378, -0.0229, 0.0331, -0.0248, 0.0427, -0.0624, -0.0324, -0.0271, 0.0135, 0.0504, 0.0028, -0.0772, 0.0121, -0.09, 0.031, -0.0771, -0.0703, 0.0947, 0.0997, -0.0084, 0.0774, 0.0281, 0.0405, -0.0475, 0.0217, 0.0591, 0.0241, -0.0287, 0.1064, 0.059, -0.06, 0.0422, 0.0908, 0.0341, 0.028, -0.0334, 0.0065, -0.0289, -0.0851, -0.0208, 0.0598, -0.0218, 0.001, 0.0049, 0.0257, 0.0076, -0.0599, 0.006, -0.0494, -0.0081, 0.0066, 0.0131, -0.0299, 0.0159, -0.0383, 0.0402, -0.0571, 0.0359, 0.0009, 0.0404, -0.0207, 0.0044, -0.0089, 0.0306, -0.0405, -0.0012, 0.0159, -0.005, -0.031, -0.0016, -0.0081, 0.0123, -0.0364, 0.0161, -0.0383, -0.0303, -0.0073, -0.0184, 0.0399, 0.0412, 0.0278, 0.0455, -0.0304, 0.0145, -0.0163, 0.0631, -0.0423, 0.0239, 0.0801, -0.0659, -0.0382, 0.0138, 0.051, 0.0056, -0.1605, 0.0018, 0.0077, -0.0076, 0.0119, 0.0397, -0.0823, -0.0462, 0.0465, 0.0735, 0.0283, -0.0205, -0.012, 0.0662, 0.0429, 0.0089, -0.0562, 0.1624, 0.0192, 0.0098, -0.0483, 0.0248, 0.0005, -0.0619, -0.0115, 0.0424, -0.0875, 0.0383, -0.0463, -0.0044, -0.0218, 0.014, -0.0404, -0.0198, -0.0162, -0.018, -0.0377, -0.0291, -0.0273, -0.0713, -0.0047, 0.0263, 0.0809, -0.0477, 0.0056, -0.0563, -0.061, -0.0185, 0.0223, -0.0718, 0.0163, 0.0061, -0.0716, -0.0081, 0.0079, 0.0156, -0.0124, -0.0223, -0.0092, -0.0621, 0.0033, 0.031, 0.0509, -0.0548, -0.0121, -0.0276, 0.0176, -0.04, 0.0382, -0.0737, 0.0202, -0.0314, -0.0702, 0.0685, -0.0928, 0.0698, -0.0484, 0.0541, -0.0539, 0.0895, 0.0076, -0.0134, -0.0116, 0.0227, -0.0361, -0.0729, -0.0068, -0.0501, 0.0137, -0.0134, 0.0039, -0.0463, 0.0289, -0.0336, -0.0731, -0.0362, -0.0195, 0.0466, -0.0132, 0.0336, 0.0108, 0.0219, -0.0702, -0.0117, -0.0285, 0.0644, -0.0806, 0.002, -0.0603, 0.0365, 0.0333, 0.0197, -0.037, 0.0983, 0.0011, 0.0436, 0.0506, -0.0089, -0.0134]))]
对于YARN上的Spark:
现在,我在使用YARN的群集上运行Spark上面的相同方法。这次我使用了以下配置设置来提高性能和速度。
sc.stop()
from pyspark import SparkContext
from pyspark.conf import SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setAppName('Learnfit_RSS')\
.set("spark.executor.memory", "30g")\
.set("spark.executor.cores",5)\
.set("spark.executor.instances",8)\
.set("spark.yarn.executor.memoryOverhead",1024)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
无论我在上面的配置中放置什么,这总是给我这个Java堆空间的OutOfMemory错误。
2017-02-09 04:44:54,776 INFO org.apache.spark.scheduler.TaskSetManager (Logging.scala:logInfo(58)) - Starting task 33.0 in stage 2.0 (TID 35, 107-02-c02.sc1.altiscale.com, partition 33,PROCESS_LOCAL, 2627307 bytes)
Exception in thread "dispatcher-event-loop-3" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:103)
at org.apache.spark.scheduler.Task$.serializeWithDependencies(Task.scala:200)
at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:460)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:252)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
正如您所看到的,这不是数据大小问题,因为相同的数据集在本地spark上工作正常,只有4g执行程序内存,并且只使用16GB内存的本地mac核心。在群集上,即使每个执行器最大可用内存为40G,它仍然会出现此错误。
在YARN群集上运行此代码时,请告知此处出现的问题。