Hadoop尝试在AWS多区域配置中写入Cassandra

时间:2014-05-15 17:45:02

标签: hadoop amazon-web-services amazon-ec2 cassandra

我在AWS中运行多DC Cassandra(开源,而不是DSE)群集,其中一个DC(us-west-2)设置用于分析,另一个(us-east)是事务存储。我使用NetworkTopologyStrategy和EC2 snitch,以及我的Hadoop配置中的LOCAL_ONE的一致性级别。 Hadoop 可以在没有问题的情况下从Cassandra中读取,但尝试写入会产生超时异常

运行nodetool status表示已正确配置DC:

Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Owns   Host ID                               Token                                    Rack
UN  x.x.x.x       1.01 GB     9.9%   9e7f4393-7ac9-4559-b3ff-de48be50016f  -9127921345534057723                     2a
UN  x.x.x.x       1001.16 MB  11.4%  d0760383-c3dd-474c-9261-239b71dba3f1  -9221279003374097975                     2b
UN  x.x.x.x       1.05 GB     11.7%  3f09fbf5-0d85-4283-9009-0ec0e29223c0  -9140104347498952504                     2c
Datacenter: us-east
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Owns   Host ID                               Token                                    Rack
UN  x.x.x.x       1.1 GB     11.3%  5bbd2de4-e1d2-4a17-9f40-034f60b35954  -9061054426204373981                     1b
UN  x.x.x.x       1.15 GB    11.5%  e34c590e-6176-45b2-a8f9-18b4a9a80032  -9216519687724118609                     1c
UN  x.x.x.x       1.18 GB    10.9%  fa0b0a1a-f156-40fc-a267-970d1eb9cddb  -9207673937991303291                     1a
UN  x.x.x.x       1.46 GB    10.7%  b18ae406-c9ec-42b7-a365-b0c6e2fe582f  -9206671929961171506                     1a
UN  x.x.x.x       1.13 GB    11.4%  1ac9c1c5-55ad-4048-b1ba-3b9768933ecc  -9146100851344467112                     1c
UN  x.x.x.x       1.53 GB    11.2%  dad665bb-68d9-4811-b421-f33333261867  -9178920986366339267                     1b

使用ColumnFamilyOutputFormat:

进行堆栈跟踪
java.io.IOException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
    at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter$RangeClient.run(ColumnFamilyRecordWriter.java:224)
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
    at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
    at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
    at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41)
    at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:123)
    at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter$RangeClient.run(ColumnFamilyRecordWriter.java:215)
Caused by: java.net.ConnectException: Connection timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:579)
    at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
    ... 4 more

...并使用CqlOutputFormat:

java.io.IOException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
    at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:271)
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
    at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
    at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
    at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41)
    at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:123)
    at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:262)
Caused by: java.net.ConnectException: Connection timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:579)
    at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
    ... 4 more

两条痕迹最终都指向AbstractColumnFamilyOutputFormat.createAuthenticatedClient(host, port, conf)

然后我打开了该源并向异常添加了一些细节,因此它将输出它连接的主机名,从而产生了这个跟踪:

java.io.IOException: java.lang.Exception: Unable to connect to host [hostname]
    at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:271)
Caused by: java.lang.Exception: Unable to connect to host [hostname]
    at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:139)
    at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:262)
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
    at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
    at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
    at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41)
    at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:124)
    ... 1 more
Caused by: java.net.ConnectException: Connection timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:579)
    at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
    ... 4 more

问题是[hostname]是一台不在分析群集中的机器(它位于我们东部)。为什么它不能自动地知道这一点,特别是当读取工作正常时?无论DC如何,它似乎都在尝试环中的所有节点。

对于记录,使用CqlOutputFormatColumnFamilyOutputFormat和使用CqlStorageCassandraStorage的Pig通过写入失败。

2 个答案:

答案 0 :(得分:0)

我会说,尝试将cassandra.yaml中的write_request_timeout_in_ms设置为某个非常高的数字,看看是否有帮助。节点本身可能存在问题,当它仍然显示为正在响应时没有响应。如果它仍然超时,请重新启动您怀疑导致该问题的节点上的服务。

答案 1 :(得分:0)

这个问题归结为两件事:

  1. 对于多区域EC2设置,Cassandra要求将broadcast_address设置为公共IP,将listen_address设置为内部IP。在大多数情况下,你会希望rpc_address是内部IP,但这可能会破坏Cassandra的Hadoop客户端,该客户端根据broadcast_address确定要与之通信的端点。

  2. Cassandra的Hadoop客户端(特别是RingCache)不尊重节点发现的数据中心,并试图发现环中的所有节点 - 包括非本地节点。它尊重实际写入的一致性级别,但在我们的情况下,由于#1,它从未到达那里。

  3. 我提交了一张票并提交了一个补丁来解决这些问题:

    https://issues.apache.org/jira/browse/CASSANDRA-7252