Posted by: Mudassir Ali | May 31, 2010

Information to Gather for Failover Root Cause

Information to Gather for Failover Root Cause

Information Necessary to perform Root Cause Analysis on Unity Failover

The Information needed to troubleshoot why a Unity server running versions 4.x, 5.x, and 7.x failed over:

Mandatory: Should be Included when Opening TAC SR Desired: Follow up Information that should be collected once Mandatory Information has been collected.

Unity failover normally occurs for a few possible reasons:

  1. Call received on the secondary server
  2. The secondary server loses communication with the primary server (primary does not respond for 30 seconds)
  3. Failover is manually initiated using Failover Monitor
  4. Port Lockup
  5. SQL replication failures
  6. Slow MAPI interaction with exchange

Mandatory information to get:

  1. GUSI Cab files from both servers. This will tell us why it failed over.
  2. For full root cause analysis, the CCM and SDL Traces from all the nodes in the cluster are necessary.(if Unity is integrated with CM). Traces should span at least 30 minutes prior to the failure. Refer to this link for CM trace collection –

Set Up Cisco CallManager Traces for Cisco Technical Support

For a more detailed root cause analysis the following traces will be needed:

* Unity Diagnostic Tool Traces: AvCsMgr, AvCsNodeMgr, svchost from BOTH servers

The following traces need to be configured in advance, otherwise the traces may not contain sufficient information.

Configuring the traces

On the Unity server, go into Cisco Unity Tools Depot.

* Expand the section “Diagnostic Tools”

* Double Click on “Unity Diagnostic Tool” This will open the Unity

Diagnostic Tool.

* In the right pane, click on “Reset to Default Traces”. This will launch a wizard.

* Check the box beside “Reset to Default Traces” and then click Finish.

You will return to the Unity Diagnostic Tool

* Click on “Configure Macro Traces” This will launch a wizard.

* Click Next. This will take you to the “Configure Macro Traces” screen where all of the components are listed.

Select:

* Call Flow Diagnostics

* Conversation State Traces

* Call Control (Miu) Traces

Click on Next, Finish.

Click on “Configure Micro Traces” This will launch a wizard.

* Click Next. This will take you to the “Configure Micro Traces” screen where all of the components are listed.

* Do not uncheck any traces that are already selected

Place checkmarks beside the following:

In the Micro page:

* CDE – all

* Conversations – all

* Doh – all

* Malex – all

* MiuCall – 10,11

* MiuGeneral – 12, 13, 14, 16

* MiuMethods 13 through 15

* NodeMgr – 10-18

* Skinny – all but keep alive

* Click on Next, Finish.

* Close Unity Diagnostic Tool.

Once the issue recurs you will need to retrieve the trace files. The first thing is to make a note of the date/time that the error occurred.

== Retrieving the Traces

* On the Unity server, go into Cisco Unity Tools Depot.

* Expand the section “Diagnostic Tools”

* Double Click on “Unity Diagnostic Tool” This will open the Unity Diagnostic Tool.

* Click on “Gather Log Files”. This will launch a wizard.

* Choose “Select Logs”

* Click on “Browse” and specify a location that will be easy to find.

* Click Next. This will take you to the “Select Logs to Gather” screen.

* Check/select the following log files that will contain the time of the errors. The names of the files contain the date/time of the first timestamp in the log.

* AvCsMgr

* svchost

* AvCsNodeMgr

* Click Next. The tool will then place the log files into the folder you specified.

* Retrieve the files from the folder and zip them to the service request.

Look for certain defects that could trigger failovers –

CSCsc62081 – CCM recieve SDL OOS and trigger the failover – CM issue.

CSCsi50517 – Failover: SQL replication jobs can fail and provide no warning – Unity SQL.

CSCsc62073 – Locations Out of Bandwidth causes unexpected Unity Failover – CM issue

CSCsb23638 – Unity port may ring-no-answer after not receiving StartMediaTransmit – TSP issue.

CSCse00439 – disconnect right before supervised transfer may lead to failover – TSP issue.

CSCsi65508 – Unity TSP port failback detection fails – CM/TSP.

CSCsh35344 – failed xfer initiate may cause delay in clearing port – TSP issue.

CSCse43664 – Supervised xfer cleanup may result in delay answering next call – TSP issue.

CSCsj13401 – Unity failover results from OpenSSL errors – TSP issue.

CSCsc91972 – Poor Exchange performance can cause Unity to stop answering callsPoor Exchange performance can cause Unity to stop answering calls.

Failover events reference –

Failover Guide

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: