如何减少sqoop导出的日志大小

时间:2017-11-14 00:16:59

标签: sql-server hadoop sqoop

有没有办法控制sqoop export创建的日志的大小?尝试将一系列parquet文件从hadoop群集导出到microsoft sql server,并发现在映射器作业中的某个点之后,进度变得非常慢/冻结。查看hadoop Resourcemanager的当前理论是sqoop作业中的日志填充的大小会导致进程冻结。

hadoop的新用户,我们将不胜感激。感谢。

更新

从资源管理器Web界面查看sqoop jar应用程序的某个冻结映射任务作业的syslog输出,日志输出如下所示:

2017-11-14 16:26:52,243 DEBUG [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: [ 8758 8840 ]
2017-11-14 16:26:52,243 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: reading next wrapped RPC packet
2017-11-14 16:26:52,243 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 sending #280
2017-11-14 16:26:52,243 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.security.SaslRpcClient: wrapping token of length:751
2017-11-14 16:26:52,246 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: unwrapping token of length:62
2017-11-14 16:26:52,246 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 got value #280
2017-11-14 16:26:52,246 DEBUG [communication thread] org.apache.hadoop.ipc.RPC: Call: statusUpdate 3
2017-11-14 16:26:55,252 DEBUG [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: [ 8758 8840 ]
2017-11-14 16:26:55,252 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: reading next wrapped RPC packet
2017-11-14 16:26:55,252 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 sending #281
2017-11-14 16:26:55,252 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.security.SaslRpcClient: wrapping token of length:751
2017-11-14 16:26:55,254 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: unwrapping token of length:62
2017-11-14 16:26:55,255 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 got value #281
2017-11-14 16:26:55,255 DEBUG [communication thread] org.apache.hadoop.ipc.RPC: Call: statusUpdate 3

此外,让流程全天运行,似乎sqoop作业确实完成了,但需要很长时间(大约500 MB的.tsv数据大约需要4个小时)。

1 个答案:

答案 0 :(得分:0)

响应发布问题的标题,控制sqoop命令的日志输出的方法是编辑$ HADOOP_HOME / etc / hadoop目录中的log4j.properties文件(因为sqoop {{3使用它来继承它的日志属性(虽然我可以告诉你apparently}可能not是这种情况,或者在sqoop调用中使用generic参数与 - D前缀,例如:

sqoop2

然而,我从帖子正文中得到的最初理论是,来自sqoop作业的日志填充的大小导致进程冻结,似乎没有成功。映射任务的日志大小在sqoop export \ -Dyarn.app.mapreduce.am.log.level=WARN\ -Dmapreduce.map.log.level=WARN \ -Dmapreduce.reduce.log.level=WARN \ --connect "$connectionstring" \ --driver com.microsoft.sqlserver.jdbc.SQLServerDriver \ --table $tablename \ --export-dir /tmp/${tablename^^}_export \ --num-mappers 24 \ --direct \ --batch \ --input-fields-terminated-by '\t' ui中降至0字节,但系统仍然运行良好的流量达到某个百分比,然后降低到非常慢的速度。