我有一个.NET 3.5 C#应用程序,它将2000-6000字节数据包发送到运行sles 10的linux机器上。这些机器在同一个子网上。
大约90%的时间,一切正常。 linux机器处理我的请求并在5-15ms内响应。但是大约有10%的时间有大约200毫秒到800毫秒的延迟。
查看linux机器上的日志,似乎延迟就在我的最后。也就是说,如果我对socket.Send(...)的调用在1:15:00.000返回并且我在1:15:00.210得到响应,那么linux机器上的日志表明它在1:15收到了请求:00.200然后在10ms内处理。 (我正在使用System.Diagnostics.Stopwatch在我的机器上进行计时。)
要进行调试,我使用wireshark捕获了流量。这是交通。在8号和9号之间是发生600毫秒的延迟。 (137.34.210.108是我的机器,137.34.210.95是linux机器)。
"1","11:56:27.380318","137.34.210.95","137.34.210.108","TCP","20700 > 17479 [PSH, ACK] Seq=1 Ack=1 Win=32767 Len=76"
"2","11:56:27.380393","HewlettP_29:37:0f","Broadcast","ARP","Who has 137.34.210.95? Tell 137.34.210.108"
"3","11:56:27.380558","HewlettP_29:39:93","HewlettP_29:37:0f","ARP","137.34.210.95 is at 00:1b:78:29:39:93"
"4","11:56:27.380564","137.34.210.108","137.34.210.95","TCP","17479 > 20700 [ACK] Seq=1 Ack=77 Win=65459 [TCP CHECKSUM INCORRECT] Len=0"
"5","12:04:48.096892","HewlettP_29:37:0f","Broadcast","ARP","Who has 137.34.210.95? Tell 137.34.210.108"
"6","12:04:48.097216","HewlettP_29:39:93","HewlettP_29:37:0f","ARP","137.34.210.95 is at 00:1b:78:29:39:93"
"7","12:04:48.097229","137.34.210.108","137.34.210.95","TCP","17480 > 20600 [PSH, ACK] Seq=1 Ack=1 Win=64198 [TCP CHECKSUM INCORRECT] Len=458"
"8","12:04:48.097457","137.34.210.95","137.34.210.108","TCP","20600 > 17480 [ACK] Seq=1 Ack=4294964377 Win=32767 Len=0 SLE=1 SRE=459"
"9","12:04:49.700966","137.34.210.108","137.34.210.95","TCP","17479 > 20700 [ACK] Seq=1 Ack=77 Win=65459 [TCP CHECKSUM INCORRECT] Len=1460"
"10","12:04:49.701190","137.34.210.108","137.34.210.95","TCP","[TCP Retransmission] 17480 > 20600 [ACK] Seq=4294964377 Ack=1 Win=64198 [TCP CHECKSUM INCORRECT] Len=1460"
"11","12:04:49.703970","137.34.210.95","137.34.210.108","TCP","20600 > 17480 [ACK] Seq=1 Ack=4294965837 Win=32767 Len=0 SLE=1 SRE=459"
"12","12:04:49.703993","137.34.210.108","137.34.210.95","TCP","[TCP Retransmission] 17480 > 20600 [ACK] Seq=4294965837 Ack=1 Win=64198 [TCP CHECKSUM INCORRECT] Len=1460"
"13","12:04:49.704002","137.34.210.108","137.34.210.95","TCP","[TCP Retransmission] 17480 > 20600 [PSH, ACK] Seq=1 Ack=1 Win=64198 [TCP CHECKSUM INCORRECT] Len=458"
"14","12:04:49.704211","137.34.210.95","137.34.210.108","TCP","20600 > 17480 [ACK] Seq=1 Ack=459 Win=32767 Len=0"
"15","12:04:49.704215","137.34.210.95","137.34.210.108","TCP","[TCP Dup ACK 14#1] 20600 > 17480 [ACK] Seq=1 Ack=459 Win=32767 Len=0 SLE=1 SRE=459"
"16","12:04:49.705425","137.34.210.95","137.34.210.108","TCP","20700 > 17479 [PSH, ACK] Seq=77 Ack=1461 Win=32767 Len=44"
有人可以帮我解释一下吗?我看到正在发生重传。但我不确定为什么。交换机显示没有丢弃的数据包。即使数据包丢失,为什么重新传输需要600毫秒?
我认为这(http://support.microsoft.com/kb/328890)可能与200ms延迟有关,但我尝试改变TcpAckFrequency并没有帮助。
谢谢, 麦克
答案 0 :(得分:3)
让我们首先修剪一些Wireshark输出。我们可以在数据包2,3,5和6中丢弃ARP。看看其余部分,你有两组流量。数据包8和9是两个不同的连接,因此您无法比较它们。但是,7,8和10是一个连接的一部分,所以让我们检查一下。
数据包7是发送到Linux机器的458字节数据,TCP序列号为1.但是,Linux机箱返回的ACK是4294964377.这意味着Wireshark显示相对的TCP值和Linux机箱不是为数据包7发送ACK,而是为较早的数据包发送ACK。然后,您的PC等待后续ACK,当它没有得到时,重新传输所需的数据。在这种情况下,来自包7的458个字节以及先前的1002个字节。这就是为什么来自分组10的序列号与来自分组8的ACK匹配的原因。
不幸的是,这并没有告诉您数据被删除的原因。数据包8显示了Linux框,表明它仍然有一个完整的32k输入缓冲区可用于此连接(“Win = 32767”)。
答案 1 :(得分:0)
这只显示Linux机器上的TCP数据包,但我建议使用'netstat -s'命令查看ip stats。重新传输的一个原因可能是套接字缓冲区溢出,将使用此命令显示。
答案 2 :(得分:0)
我不记得Windows是否拥有它,但在UNIX上你启用了TCP_NODELAY
。
这会禁用TCP的Nagle算法,这会使系统等待一小段时间,以防更多数据被添加到发送缓冲区。
int nodelay = 1;
setsockopt(s, IPPROTO_TCP, TCP_NODELAY, &nodelay, sizeof(nodelay));