Question

从Service Fabric SDK 2.0.135升级到2.3.301之后，我们开始遇到Service Fabric参与者或服务无法在Service Fabric Explorer中显示为健康状态的情况。一旦处于此状态，通过ActorProxy或ServiceProxy对actor或服务的任何调用将挂起5分钟，最后给出TimeoutException。一旦处于这种状态，演员或服务就不会自行恢复 - 即使离开一小时。唯一的解决方案是重置actor或服务所在的节点，重新部署actor或服务（完全相同的EXE），重置整个集群或重新引导所有集群计算机。

在部署或重新部署SF应用程序后，它通常会进入此状态。

在使用Service Fabric的最后一年（从SDK v1.3开始），我们从未遇到过这个问题。它仅在移至2.3.301之后才开始。

似乎是随机而不一致的。我们解决方案中的13个SF应用程序中的哪个应用程序也是随机的。

有没有人对我们如何解决这个问题有任何想法？这似乎是Service Fabric最新版本中的一个错误，但也许我们在最后做错了。

感谢任何帮助。

以下是一些额外的信息，我希望这些信息有助于了解我们在此问题上遇到的问题。

非常感谢

步骤

我真的没有采取措施来始终如一地重现这个问题。这就是我有时观察到的。

我编译然后从Visual Studio重新部署我的SF项目（Debug - ＆gt; Start Without Debugging）
Visual Studio表示已成功部署项目
Service Fabric Explorer将我的所有服务显示为Healthy，包括Data-Binding
有问题的SF项目有2个参与者属于单个EXE。 Service Fabric Explorer显示在不同节点上运行的每个actor。
Windows任务管理器显示EXE的两个正在运行的副本，这是有道理的，因为有两个节点运行EXE。

同样，我们的QA在使用PowerShell直接部署到Azure后遇到了问题。（他没有从Visual Studio部署。）

回顾

Visual Studio表示部署成功
Service Fabric Explorer显示一切正常
任务管理器显示EXE的两个正在运行的副本

当我看到失败

我有一个SF服务使用ServiceProxy或ActorProxy类调用另一个SF服务。我们在整个解决方案中实现了这一目标，结合了13种不同的应用程序和大约25种不同的服务和演员。自从我们于2015年11月开始使用Service Fabric SDK v1.3以来，它已成功运行。

现在，在升级到2.3.301后，我们定期出现一个随机Actor或Service进入一种状态，当从ServiceProxy或ActorProxy调用时，它无法响应对方法的调用。挂起5分钟后，我们收到System.Timeout异常，并显示以下消息：

如果在服务繁忙或长时间丢弃消息，则会发生这种情况运行操作并花费比配置的操作更多的时间超时。

请注意，该服务不忙，也不执行长时间运行。作为演员，该服务根本不进行任何正在进行的操作。它只是暴露了其他服务可以使用的公共方法。它从第一次通话就失败了。

实际上，跟踪向我们表明，甚至演员永远中的方法的第一行也会被调用。它好像Service Fabric通信基础设施无法传递消息。

何时开始

过去12个月，我们从未见过这个问题。

现在，自从上周升级Service Fabric以来，我们经常在各种条件下看到这个问题。

我们升级到Service Fabric SDK 2.3.301.9590和Service Fabric 5.3.301.9590。

首先，团队中的每个开发人员都独立地遇到了这个问题，每个人都认为这只是我们机器的一个短暂问题。 Service Fabric确实存在一些问题，所以我们接受这个并继续前进。但后来我们开始互相抱怨，意识到我们都在看到它。即便是我们的QA也会在我们的环境中看到它即将投入生产。

同样，这只是在我们上周升级到Service Fabric的最新版本时开始的。

以前，我们运行的是Service Fabric SDK 2.0.135。

我们通过安装SDK v 2.3.301升级了我们的代码库，打开了我们的每个解决方案并允许Visual Studio进行升级。

环境

我正在i7上运行全新安装的Windows 10 Enterprise（不到两周前安装），内存为16 GB。我全新安装了Visual Studio 2015 Update 3和SF 2.3.301.9590。我安装了一切干净。没有升级。

这也发生在我的所有同事机器上（不同年龄，配置和“新鲜度”）。它偶尔发生在我们每个人身上。

最关键的是，这也发生在Azure上的Service Fabric VM上。这些是我们的QA在一个月前使用Azure上的Service Fabric VM的标准模板创建的计算机。它预装了5.3.301.9590。他没有手动安装Service Fabric的任何更新。在开发人员升级到新版本之后，我们的基于SF的应用程序在Azure（或我们自己的开发机器）上没有遇到此问题。

这不是我的机器，也不仅仅是开发环境。对我们所有人来说，唯一一致的变化是更新SF版本。

原因

我们不知道是什么原因造成的。

通常在部署新的SF应用程序后立即发生。是的，我们确实等待SF通常需要2到3分钟才能完成＃34;部署后。我们已经离开了一个小时或更长时间，它永远不会有效。

有趣的是，我认为我有一个SF服务工作正常然后突然停止工作但这是在我们意识到存在问题所以我没有看为了它。我无法确定。

解决方法

一旦我们在“无法访问”状态下拥有SF服务，Service Fabric就不会再次退出该状态。该应用程序完全无法使用。取得了不同程度的成功，我们采取以下措施：

重新部署无法访问的SF应用程序
重新启动节点（通过转到。来通过Service Fabric Explorer 节点，单击省略号按钮并单击“重新启动”选项）托管无法访问的SF服务＆amp;演员
重新启动整个SF群集（停止然后启动）
重新启动运行SF节点的所有计算机
重置整个群集并重新部署所有内容（最后但是它已经有必要了几次）

有趣的是，使用任务管理器杀死有问题的进程无济于事。如果我终止了违规流程，Service Fabric会重新启动它（如预期的那样），但它仍然无法响应消息。

因此，问题似乎与Service Fabric本身有关，而与EXE无关。

当然，这些根本不是“解决方案”，因为它们会使整个应用程序无法访问，直到SF可以重新启动/重新平衡。即使重新启动一些节点也会使一堆东西脱机。

基本上，这对我们来说是一个阻碍。我们不可能将我们的应用程序投入生产（甚至测试版），Service Fabric的行为就像这样。

使用服务代理或Actor代理时的C＃异常：

由ActorProxy或ServicePRoxy抛出的异常的JSON渲染

"exception": {
    "ClassName": "System.TimeoutException",
    "Message": "This can happen if message is dropped when service is busy or its long running operation and taking more time than configured Operation Timeout.",
    "Data": null,
    "InnerException": null,
    "HelpURL": null,
    "StackTraceString": "   at Microsoft.ServiceFabric.Services.Communication.Client.ServicePartitionClient`1.<InvokeWithRetryAsync>d__7`1.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.ServiceFabric.Services.Remoting.Client.ServiceRemotingPartitionClient.<InvokeAsync>d__8.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.ServiceFabric.Services.Remoting.Builder.ProxyBase.<InvokeAsync>d__0.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.ServiceFabric.Services.Remoting.Builder.ProxyBase.<ContinueWithResult>d__7`1.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()\r\n   at RenderingCachingEngine.RenderingCachingEngine.<Render>d__10.MoveNext() in C:\\Code\\Ink\\Dev\\Current\\Source\\Rendering Service Fabric\\RenderingCachingEngine\\RenderingCachingEngine.cs:line 381",
    "RemoteStackTraceString": null,
    "RemoteStackIndex": 0,
    "ExceptionMethod": "8\nMoveNext\nMicrosoft.ServiceFabric.Services, Version=5.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35\nMicrosoft.ServiceFabric.Services.Communication.Client.ServicePartitionClient`1+<InvokeWithRetryAsync>d__7`1\nVoid MoveNext()",
    "HResult": -2146233083,
    "Source": "Microsoft.ServiceFabric.Services",
    "WatsonBuckets": null
  }

以下是Service Fabric Info的JSON呈现：

  "serviceFabricInfo": {
    "serviceFabricServiceName": "fabric:/Rendering/RenderingCachingEngine",
    "serviceFabricServiceTypeName": "RenderingCachingEngineType",
    "serviceFabricReplicaId": 131225099453058851,
    "serviceFabricPartitionId": "e400087d-8a08-4dab-bcdd-1f5ce82f374f",
    "serviceFabricApplicationName": "fabric:/Rendering",
    "serviceFabricApplicationTypeName": "RenderingType",
    "serviceFabricNodeName": "_Node_4"
  }

重新部署时的事件查看器日志

Windows事件查看器确实在“应用程序和服务日志 - ＆gt;”下显示了一些值得注意的日志。 Microsoft-Service Fabric - ＆gt;系统管理员”。

在我重新部署应用程序的更新版本时发生了以下日志（请注意，DataBinding.exe是包含我的两个SF演员的EXE的名称）：

Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          11/2/2016 2:38:53 PM
Event ID:      256
Task Category: Common
Level:         Error
Keywords:      Default
User:          NETWORK SERVICE
Computer:      shayward10.ovx.local
Description:
WriteNode failed. HRESULT=-2147467259, Output=CustomOutput
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{CBD93BC2-71E5-4566-B3A7-595D8EECA6E8}" />
    <EventID>256</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>1</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2016-11-02T18:38:53.678587200Z" />
    <EventRecordID>7620</EventRecordID>
    <Correlation />
    <Execution ProcessID="4440" ThreadID="7360" />
    <Channel>Microsoft-ServiceFabric/Admin</Channel>
    <Computer>shayward10.ovx.local</Computer>
    <Security UserID="S-1-5-20" />
  </System>
  <EventData>
    <Data Name="id">
    </Data>
    <Data Name="type">XmlLiteWriter</Data>
    <Data Name="text">WriteNode failed. HRESULT=-2147467259, Output=CustomOutput</Data>
  </EventData>
</Event>

Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          11/2/2016 2:38:54 PM
Event ID:      23073
Task Category: Hosting
Level:         Warning
Keywords:      Default
User:          SYSTEM
Computer:      shayward10.ovx.local
Description:
ServiceHostProcess: DataBinding.exe for ApplicationId 805915c7-456c-49d3-af95-62cc44650664 terminated unexpectedly with exit code 3221225786 on node id bf865279ba277deb864a976fbf4c200e
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{CBD93BC2-71E5-4566-B3A7-595D8EECA6E8}" />
    <EventID>23073</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>90</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2016-11-02T18:38:54.820567800Z" />
    <EventRecordID>7621</EventRecordID>
    <Correlation />
    <Execution ProcessID="6944" ThreadID="3812" />
    <Channel>Microsoft-ServiceFabric/Admin</Channel>
    <Computer>shayward10.ovx.local</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <EventData>
    <Data Name="id">bf865279ba277deb864a976fbf4c200e</Data>
    <Data Name="AppId">805915c7-456c-49d3-af95-62cc44650664</Data>
    <Data Name="ReturnCode">3221225786</Data>
    <Data Name="ProcessName">DataBinding.exe</Data>
  </EventData>
</Event>

Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          11/2/2016 2:38:56 PM
Event ID:      256
Task Category: Common
Level:         Error
Keywords:      Default
User:          NETWORK SERVICE
Computer:      shayward10.ovx.local
Description:
WriteNode failed. HRESULT=-2147467259, Output=CustomOutput
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{CBD93BC2-71E5-4566-B3A7-595D8EECA6E8}" />
    <EventID>256</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>1</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2016-11-02T18:38:56.261857600Z" />
    <EventRecordID>7627</EventRecordID>
    <Correlation />
    <Execution ProcessID="4440" ThreadID="8564" />
    <Channel>Microsoft-ServiceFabric/Admin</Channel>
    <Computer>shayward10.ovx.local</Computer>
    <Security UserID="S-1-5-20" />
  </System>
  <EventData>
    <Data Name="id">
    </Data>
    <Data Name="type">XmlLiteWriter</Data>
    <Data Name="text">WriteNode failed. HRESULT=-2147467259, Output=CustomOutput</Data>
  </EventData>
</Event>

事件查看器在超时时记录

一旦服务处于无法访问状态，尝试调用它会在每个请求上产生以下日志（等待5分钟后）：

Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          11/2/2016 2:44:55 PM
Event ID:      44289
Task Category: FabricTransport
Level:         Warning
Keywords:      Default
User:          NETWORK SERVICE
Computer:      shayward10.ovx.local
Description:
Error While Sending Message : FABRIC_E_TIMEOUT
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{CBD93BC2-71E5-4566-B3A7-595D8EECA6E8}" />
    <EventID>44289</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>173</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2016-11-02T18:44:55.349048200Z" />
    <EventRecordID>7629</EventRecordID>
    <Correlation />
    <Execution ProcessID="18600" ThreadID="8076" />
    <Channel>Microsoft-ServiceFabric/Admin</Channel>
    <Computer>shayward10.ovx.local</Computer>
    <Security UserID="S-1-5-20" />
  </System>
 <EventData>
    <Data Name="id">
    </Data>
    <Data Name="type">ServiceCommunicationClient</Data>
    <Data Name="text">Error While Sending Message : FABRIC_E_TIMEOUT</Data>
  </EventData>
</Event>

Answer 1

此问题可能发生在两种情况下。

如果您的ActorService方法处理超过默认超时，则需要更改OperationTimeout值。默认情况下是5分钟。如果要更改超时，可以通过在客户端程序集中添加程序集 FabricTransportServiceRemotingProviderAttribute 来更改它。

https://msdn.microsoft.com/en-us/library/microsoft.servicefabric.services.remoting.fabrictransport.fabrictransportserviceremotingproviderattribute.aspx

如果不是第一种情况，那么您可以尝试以下缓解已知错误。
- 在Actor Service端点的Service Manifest中指定端口0。默认情况下，ActorEndpoint将在ServiceManifest中列出，但端口将不在那里。

这是在您进行更改后查找ActorService的方式。

<Endpoint Name="Actor1ActorServiceEndpoint" Port="0" />

我们已经意识到了这个问题，并且正在解决问题。

Answer 2

如果它帮助了我们在长时间运行（超过5分钟）操作时看到这些超时的人。根据Suchi关于FabricTransportServiceRemotingProviderAttribute的提示，我们在SF项目AssemblyInfo.cs中添加了以下行，以将超时时间增加到1小时。

[assembly: FabricTransportServiceRemotingProvider(OperationTimeoutInSeconds = 3600)]
[assembly: FabricTransportActorRemotingProvider(OperationTimeoutInSeconds = 3600)]

（另请注意，如果您使用Azure Service Buses，则最长锁定时间为5分钟，因此您必须实施一些锁定续订代码以支持长时间运行操作）

升级到SDK 2.3.301之后，Service Fabric Actor或Service变得无法访问

2 个答案: