Question

我正在尝试在RStudio 0.99.484和Hadoop-2.3.0（Windows版本）中使用MRkmeans。使用一个文件（包含755 * 1682个实际值，大小为21 MB）作为输入数据，它已成功完成，但使用另一个文件（包含4832 * 3952个实际值，大小为317 MB），我有一些错误， map-reduce失败，所有MR过程和错误都显示如下。我的问题解决了如果我们在rmr.options（backend.parameters）中使用更大的尺寸？如果是的话，我需要一个示例代码。

rmr: DEPRECATED: Please use 'rm -r' instead.
rmr: `/Users/SETUPC~1/AppData/Local/Temp/RtmpQ9MVgC/file10f06a465c65': No     such file or directory
rmr: DEPRECATED: Please use 'rm -r' instead.
rmr: `/Users/SETUPC~1/AppData/Local/Temp/RtmpQ9MVgC/file10f0634072aa': No such file or directory
15/10/19 21:49:56 WARN zlib.ZlibFactory: Failed to load/initialize native-    zlib library
15/10/19 21:49:56 INFO compress.CodecPool: Got brand-new compressor [.deflate]
packageJobJar: [/C:/tmp/hadoop-Koohi/hadoop-unjar740024213403447693/] []        C:\Users\SETUPC~1\AppData\Local\Temp\streamjob2283559356588490466.jar tmpDir=null
15/10/19 21:54:03 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/10/19 21:54:03 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/10/19 21:54:12 INFO mapred.FileInputFormat: Total input paths to process : 1
15/10/19 21:54:13 INFO mapreduce.JobSubmitter: number of splits:2
15/10/19 21:54:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1445275456322_0003
15/10/19 21:54:15 INFO impl.YarnClientImpl: Submitted application application_1445275456322_0003
15/10/19 21:54:15 INFO mapreduce.Job: The url to track the job:       http://Hamidreza:8088/proxy/application_1445275456322_0003/
15/10/19 21:54:15 INFO mapreduce.Job: Running job: job_1445275456322_0003
15/10/19 21:54:34 INFO mapreduce.Job: Job job_1445275456322_0003 running in uber mode : false
15/10/19 21:54:34 INFO mapreduce.Job:  map 0% reduce 0%
15/10/19 21:55:04 INFO mapreduce.Job:  map 1% reduce 0%
15/10/19 21:56:07 INFO mapreduce.Job:  map 9% reduce 0%
15/10/19 21:56:31 INFO mapreduce.Job:  map 10% reduce 0%
15/10/19 21:56:41 INFO mapreduce.Job:  map 11% reduce 0%
15/10/19 21:56:55 INFO mapreduce.Job:  map 19% reduce 0%
15/10/19 21:56:58 INFO mapreduce.Job:  map 20% reduce 0%
15/10/19 21:57:07 INFO mapreduce.Job:  map 21% reduce 0%
15/10/19 21:57:19 INFO mapreduce.Job:  map 26% reduce 0%
15/10/19 21:57:25 INFO mapreduce.Job:  map 27% reduce 0%
15/10/19 21:57:28 INFO mapreduce.Job:  map 31% reduce 0%
15/10/19 21:57:31 INFO mapreduce.Job:  map 39% reduce 0%
15/10/19 21:57:34 INFO mapreduce.Job:  map 46% reduce 0%
15/10/19 21:57:44 INFO mapreduce.Job:  map 47% reduce 0%
15/10/19 21:57:47 INFO mapreduce.Job:  map 50% reduce 0%
15/10/19 21:57:49 INFO mapreduce.Job:  map 66% reduce 0%
15/10/19 21:57:50 INFO mapreduce.Job:  map 67% reduce 0%
15/10/19 21:57:50 INFO mapreduce.Job: Task Id : attempt_1445275456322_0003_m_000000_0, Status : FAILED
Container     [pid=container_1445275456322_0003_01_000002,containerID=container_1445275456322_0003_01_000002] is running beyond physical memory limits. Current usage: 1.1 GB of 1 GB physical memory used; 1.3 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1445275456322_0003_01_000002 :
    |- PID CPU_TIME(MILLIS) VMEM(BYTES) WORKING_SET(BYTES)
    |- 176 15 716800 2641920
    |- 6680 17515 979025920 955031552
    |- 5660 0 512000 1769472
    |- 6288 31 1675264 2793472
    |- 6976 11296 363868160 241926144
    |- 2816 0 1736704 2416640

Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137

15/10/19 21:57:51 INFO mapreduce.Job:  map 17% reduce 0%
15/10/19 21:58:12 INFO mapreduce.Job:  map 18% reduce 0%
15/10/19 21:58:13 INFO mapreduce.Job:  map 22% reduce 0%
15/10/19 21:58:50 INFO mapreduce.Job:  map 26% reduce 0%
15/10/19 21:58:55 INFO mapreduce.Job:  map 31% reduce 0%
15/10/19 21:59:10 INFO mapreduce.Job:  map 47% reduce 0%
15/10/19 21:59:11 INFO mapreduce.Job:  map 51% reduce 0%
15/10/19 21:59:13 INFO mapreduce.Job:  map 60% reduce 0%
15/10/19 21:59:17 INFO mapreduce.Job:  map 63% reduce 0%
15/10/19 21:59:28 INFO mapreduce.Job: Task Id :     attempt_1445275456322_0003_m_000000_1, Status : FAILED
Container [pid=container_1445275456322_0003_01_000004,containerID=container_1445275456322_0003_01_000004] is running beyond physical memory limits. Current usage: 1.2 GB of 1 GB physical memory used; 1.3 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1445275456322_0003_01_000004 :
    |- PID CPU_TIME(MILLIS) VMEM(BYTES) WORKING_SET(BYTES)
    |- 5420 0 716800 2641920
    |- 1420 62 1671168 2785280
    |- 5432 13531 375529472 302137344
    |- 4016 15 507904 1765376
    |- 4204 17125 971837440 951898112
    |- 4208 15 1732608 2404352

Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137

15/10/19 21:59:29 INFO mapreduce.Job:  map 30% reduce 0%
15/10/19 21:59:35 INFO mapreduce.Job:  map 33% reduce 0%
15/10/19 21:59:53 INFO mapreduce.Job:  map 34% reduce 0%
15/10/19 21:59:56 INFO mapreduce.Job:  map 50% reduce 0%
15/10/19 22:00:03 INFO mapreduce.Job:  map 72% reduce 0%
15/10/19 22:00:06 INFO mapreduce.Job:  map 83% reduce 0%
15/10/19 22:00:16 INFO mapreduce.Job:  map 100% reduce 0%
15/10/19 22:00:16 INFO mapreduce.Job: Task Id : attempt_1445275456322_0003_m_000000_2, Status : FAILED
Container            [pid=container_1445275456322_0003_01_000005,containerID=container_1445275456322_0003_01_000005] is running beyond physical memory limits. Current usage: 1.2 GB of 1 GB physical memory used; 1.3 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1445275456322_0003_01_000005 :
    |- PID CPU_TIME(MILLIS) VMEM(BYTES) WORKING_SET(BYTES)
    |- 5904 15 1732608 2412544
    |- 6872 0 712704 2629632
    |- 4664 14546 971898880 951922688
    |- 3632 78 1667072 2785280
    |- 6092 0 512000 1769472
    |- 6924 13203 371974144 314916864

Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
15/10/19 22:00:17 INFO mapreduce.Job:  map 50% reduce 0%
15/10/19 22:00:20 INFO mapreduce.Job:  map 50% reduce 17%
15/10/19 22:00:27 INFO mapreduce.Job:  map 76% reduce 17%
15/10/19 22:00:30 INFO mapreduce.Job:  map 83% reduce 17%
15/10/19 22:00:38 INFO mapreduce.Job:  map 100% reduce 17%
15/10/19 22:00:39 INFO mapreduce.Job:  map 100% reduce 100%
15/10/19 22:00:41 INFO mapreduce.Job: Job job_1445275456322_0003 failed with    state FAILED due to: Task failed task_1445275456322_0003_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

15/10/19 22:00:45 INFO mapreduce.Job: Counters: 40
File System Counters
    FILE: Number of bytes read=0
    FILE: Number of bytes written=79441152
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=63636256
    HDFS: Number of bytes written=0
    HDFS: Number of read operations=5
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=0
Job Counters 
    Failed map tasks=4
    Killed map tasks=1
    Killed reduce tasks=1
    Launched map tasks=6
    Launched reduce tasks=1
    Other local map tasks=4
    Data-local map tasks=2
    Total time spent by all maps in occupied slots (ms)=714657
    Total time spent by all reduces in occupied slots (ms)=39170
    Total time spent by all map tasks (ms)=714657
    Total time spent by all reduce tasks (ms)=39170
    Total vcore-seconds taken by all map tasks=714657
    Total vcore-seconds taken by all reduce tasks=39170
    Total megabyte-seconds taken by all map tasks=731808768
    Total megabyte-seconds taken by all reduce tasks=40110080
Map-Reduce Framework
    Map input records=78
    Map output records=56
    Map output bytes=79348969
    Map output materialized bytes=79349223
    Input split bytes=93
    Combine input records=0
    Spilled Records=56
    Failed Shuffles=0
    Merged Map outputs=0
    GC time elapsed (ms)=4670
    CPU time spent (ms)=161251
    Physical memory (bytes) snapshot=373673984
    Virtual memory (bytes) snapshot=395513856
    Total committed heap usage (bytes)=306708480
File Input Format Counters 
    Bytes Read=63636163
15/10/19 22:00:45 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!

Error in mr(map = map, reduce = reduce, combine = combine,     vectorized.reduce,  : 
  hadoop streaming failed with error code 1 In addition: Warning message:
running command '/hadoop-2.3.0/bin/hadoop jar /hadoop-    2.3.0/share/hadoop/tools/lib/hadoop-streaming-2.3.0.jar   -D         "stream.map.input=typedbytes"     -D     "stream.map.output=typedbytes"     -D         "stream.reduce.input=typedbytes"     -D     "stream.reduce.output=typedbytes"         -D     "mapreduce.map.java.opts=-Xmx400M"     -D         "mapreduce.reduce.java.opts=-Xmx400M"     -files         "/Users/SETUPC~1/AppData/Local/Temp/RtmpQ9MVgC/rmr-local-    env10f0780c2119,/Users/SETUPC~1/AppData/Local/Temp/RtmpQ9MVgC/rmr-global-    env10f03b794070,/Users/SETUPC~1/AppData/Local/Temp/RtmpQ9MVgC/rmr-streaming-    map10f06b4f59ee,/Users/SETUPC~1/AppData/Local/Temp/RtmpQ9MVgC/rmr-streaming-reduce10f054f5e9e"     -input     "/tmp/file10f08e55037"     -output         "/tmp/file10f03d086dcc"     -mapper     "Rscript --vanilla ./rmr-streaming-    map10f06b4f59ee"      -reducer     "Rscript --vanilla ./rmr-streaming-    reduce10f054f5e9e"     -inputformat         "org.apache.hadoop.streaming.AutoInputFormat"     -outputformat     "o [... truncated]

Answer 1

如果你指的是软件包的tests目录中的文件，它并不是真的意味着这么广泛的数据，也不清楚你应该使用kmeans，当你有大量列的行时。如果你有k个中心，D维和P点，你就可以将kD参数拟合到P点。如果D和P的大小差不多，我认为这不是一个统计上合理的程序。即使我错了也是如此，数据按行划分。列数没有可扩展性。您需要研究不同的算法。目前尚不清楚目标数据的大小。 300MB并不是真正的mapreduce尺寸。这种内存问题通常会发生，因为每个容器都将其所有内存分配给一个java进程，并且R进程没有留下任何内存。请参阅帮助（“hadoop.settings”）。

无法在R中运行MRKmeans以获取更大的文件

1 个答案: