Spark:使用Spark对大量文件进行数据处理说SocketException:Read timed out

时间:2015-04-09 13:24:21

标签: amazon-s3 mapreduce apache-spark spark-streaming pyspark

我在2台具有这些配置的机器上以独立模式运行Spark

  1. 500gb内存,4核,7.5 RAM
  2. 250gb内存,8个内核,15个RAM
  3. 我在8核机器上创建了一个主服务器和一个从服务器,为工作者提供了7个核心。我在4核机器上创建了另一个奴隶,有3个工作核心。用户界面显示分别为8核和4核的13.7和6.5 G可用RAM。

    现在,我必须在15天内处理用户评分的总和。我正在尝试使用Pyspark 这些数据存储在s3存储桶中日常目录中的每小时文件中,每个文件必须大约为100MB,例如

    S3:// some_bucket / 2015-04 / 2015-04-09 / data_files_hour1

    我正在阅读像这样的文件

    a = sc.textFile(files, 15).coalesce(7*sc.defaultParallelism) #to restrict partitions
    

    其中files是此格式的字符串' s:// some_bucket / 2015-04 / 2015-04-09 / *,s3:// some_bucket / 2015-04 / 2015-04- 09 / *'

    然后我会做一系列地图和过滤器并保留结果

    a.persist(StorageLevel.MEMORY_ONLY_SER)
    

    然后我需要做一个reduceByKey来获得超过几天的总分。

    b = a.reduceByKey(lambda x, y: x+y).map(aggregate)
    b.persist(StorageLevel.MEMORY_ONLY_SER)
    

    然后我需要对用户评价的项目的实际条款进行redis调用,所以我这样调用mapPartitions

    final_scores = b.mapPartitions(get_tags)
    
    每次调用时,

    get_tags 函数都会创建一个redis连接,并调用redis并生成一个(用户,项目,速率)元组 (redis哈希存储在4core中)

    我已将SparkConf的设置调整为

    conf = (SparkConf().setAppName(APP_NAME).setMaster(master)
            .set("spark.executor.memory", "5g")
            .set("spark.akka.timeout", "10000")
            .set("spark.akka.frameSize", "1000")
            .set("spark.task.cpus", "5")
            .set("spark.cores.max", "10")
            .set("spark.serializer",      "org.apache.spark.serializer.KryoSerializer")
            .set("spark.kryoserializer.buffer.max.mb", "10")
            .set("spark.shuffle.consolidateFiles", "True")
            .set("spark.files.fetchTimeout", "500")
            .set("spark.task.maxFailures", "5"))
    

    我在客户端模式下使用2g的驱动程序内存运行作业,因为此处似乎不支持群集模式。 上述过程需要很长时间才能持续2天。数据(约2.5小时)并在14天内完全放弃。

    这里需要改进什么?

    1. 这个基础设施在RAM和核心方面是否不足(这是离线的,可能需要数小时,但必须在5个小时左右完成)
    2. 我应该增加/减少分区数吗?
    3. Redis可能会减慢系统速度,但是键的数量太大而无法拨打一次。
    4. 我不确定任务失败的地方,阅读文件或减少。
    5. 我不应该在Scala中使用Python给出更好的Spark API,这对提高效率也有帮助吗?
    6. 这是异常追踪

      Lost task 4.1 in stage 0.0 (TID 11, <node>): java.net.SocketTimeoutException: Read timed out
          at java.net.SocketInputStream.socketRead0(Native Method)
          at java.net.SocketInputStream.read(SocketInputStream.java:152)
          at java.net.SocketInputStream.read(SocketInputStream.java:122)
          at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
          at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
          at sun.security.ssl.InputRecord.read(InputRecord.java:509)
          at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
          at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
          at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
          at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
          at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
          at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
          at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
          at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
          at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:227)
          at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
          at org.apache.http.util.EntityUtils.consume(EntityUtils.java:88)
          at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.releaseConnection(HttpMethodReleaseInputStream.java:102)
          at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.close(HttpMethodReleaseInputStream.java:194)
          at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:152)
          at org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:89)
          at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
          at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:126)
          at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
          at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:236)
          at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
          at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
          at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
          at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
          at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
          at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
          at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
          at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:93)
          at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:92)
          at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
          at scala.collection.Iterator$class.foreach(Iterator.scala:727)
          at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
          at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:405)
          at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
          at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
          at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)
      

      我真的可以使用一些帮助,提前谢谢

      以下是我的主要代码

      def main(sc): f=get_files() a=sc.textFile(f, 15) .coalesce(7*sc.defaultParallelism) .map(lambda line: line.split(",")) .filter(len(line)>0) .map(lambda line: (line[18], line[2], line[13], line[15])).map(scoring) .map(lambda line: ((line[0], line[1]), line[2])).persist(StorageLevel.MEMORY_ONLY_SER) b=a.reduceByKey(lambda x, y: x+y).map(aggregate) b.persist(StorageLevel.MEMORY_ONLY_SER) c=taggings.mapPartitions(get_tags) c.saveAsTextFile("f") a.unpersist() b.unpersist()

      get_tags函数是

      def get_tags(partition):
       rh = redis.Redis(host=settings['REDIS_HOST'], port=settings['REDIS_PORT'], db=0)
       for element in partition:
          user = element[0]
          song = element[1]
          rating = element[2]
          tags = rh.hget(settings['REDIS_HASH'], song)
          if tags:
              tags = json.loads(tags)
          else:
              tags = scrape(song, rh)
          if tags:
              for tag in tags:
                  yield (user, tag, rating)
      

      get_files函数如下:

      def get_files():
       paths = get_path_from_dates(DAYS)
       base_path = 's3n://acc_key:sec_key@bucket/'
       files = list()
       for path in paths:
          fle = base_path+path+'/file_format.*'
          files.append(fle)
       return ','.join(files)
      

      get_path_from_dates(DAYS)是

      def get_path_from_dates(last):
       days = list()
       t = 0
       while t <= last:
          d = today - timedelta(days=t)
          path = d.strftime('%Y-%m')+'/'+d.strftime('%Y-%m-%d')
          days.append(path)
          t += 1
       return days
      

2 个答案:

答案 0 :(得分:0)

作为一个小优化,我创建了两个单独的任务,一个是从s3读取并获得加法和,第二个是从redis读取转换。第一个任务具有大量分区,因为有大约2300个文件要读取。第二个具有更少数量的分区以防止redis连接延迟,并且只有一个文件可供读取,这在EC2集群本身上。这只是局部的,仍在寻找改善的建议...

答案 1 :(得分:0)

我处于类似的用例中:在具有300,000多个分区的RDD上执行coalesce。不同之处在于我使用的是s3a(SocketTimeoutException中的S3AFileSystem.waitAysncCopy)。最后,通过设置更大的fs.s3a.connection.timeout(Hadoop的core-site.xml)来解决问题。希望你能得到一个线索。