Guide for the Maintenance and Problem Solving for Servers from DELL Inc ®

From help.aarenet.com
Jump to: navigation, search


Note The features and/or parameters listed in this article may not be available from your telephone service provider.



Home Support

 

 

Download PDF

 



Introduction

The VoIP Switch Administrator and/or server service personnel find here information for DELL server maintenance and trouble shooting:

  • Best practice when a hardware HW problem is indicated
  • Server monitoring with "DELL OpenManage Server Administrator (OMSA)" and "Xymon monitor"
  • Checking and indicating of hardware problems
  • Procedure for replacing defect HW parts with DELL
  • Treating server hardware problems
  • Treating RAID and hard-disk HD problems


Warning

The instructions given in this document can jeopardize the server functionality!

Depending on the server responsibilities within the VoIP Switch the telephony service can be endangered! The "VoIP Switch Supplier" cannot accept any responsibility due to wrongdoing of the executing personnel.

If there are uncertainties, contact the "VoIP Switch Supplier Support"!





→ Top

Best Practice When a Hardware HW Problem is Indicated

It is assumed that from any source a hardware problem of a server is indicated, e.g.:

  • Monitor Log
  • Alerting email
  • SMTP trap
  • system engineer observation
  • etc


Best Practice
  1. Access the server's "OpenManage Server Administrator (OMSA)" GUI nav Show me how ...
     
  2. Check the server's hardware problem nav Show me how ...
     
  3. Prepare documentation for a ticket at the DELL support:
     
  4. Organize the hardware part replacement if needed nav Show me how ...
     
  5. Treat the hardware problem:





→ Top

Server Monitoring


→ Top

Manual Server Monitoring With DELL's "Server Administrator (OMSA)"

DELL OpenManage Server Administrator (OMSA) is a software agent that provides a comprehensive, one-to-one systems management solution in two ways: from an integrated, Web browser-based graphical user interface (GUI) and from a command line interface (CLI) through the operating system.


Note

In this chapter enough information is given for being dangerous!

If there are uncertainties contact the "DELL Support" or the "VoIP Switch Supplier Support".





→ Top

Access the "OpenManage Server Administrator (OMSA)"

Connect with any Web browser to the server's "OpenManage Server Administrator (OMSA)" GUI:

  1. Insert the following URI:
    https://<IP_ADDRESS>:1311
    Example:
    https://172.100.100.100:1311
  2. Insert the user "root" login credentials:
    • Username: root
    • Password: the server root password




→ Top

Check the Type of Server and Service Tags

Access the server's "OpenManage Server Administrator (OMSA)" GUI.

Check the server type:

  • In the OMSA home page menu bar at the top the server type is listed, e.g.: "PowerEdge620"
or
  • Menu "System" → Tab "Properties" → Tab "Summary"


Check the Service Tag:

  • Menu "System" → Tab "Properties" → Tab "Summary"
In frame "Main System Chassis" the Service Tag is displayed, e.g. : 47X....
In frame "Main System Chassis" the "Express Service Code" is displayed, e.g. : 9187....




→ Top

Check the Server's Hardware Status

Access the server's "OpenManage Server Administrator (OMSA)" GUI.

Check the Server's Hardware Status:

  • Menu "System" → Tab "Properties" → Tab "Health"
  • Click "Main System Chassis"
The status of all server hardware components is displayed and can be checked in detail.




→ Top

Check the Server's and RAID and Hard-Disk HD Status

Access the server's "OpenManage Server Administrator (OMSA)" GUI.

Check the RAID Controller Type:

  • Menu "System" → Tab "Properties" → Tab "Health"
  • Click "Storage"
In frame "RAID Controller(s)" the RAID controller type is displayed, e.g. : "PERC 6/i integrated"


Check the RAID Controller Status:

  • Menu "System" → Tab "Properties" → Tab "Health"
  • Click "Storage"
In frame "RAID Controller(s)" the name and status of the RAID is displayed: "Virtual Disk 0 RAID-1"




→ Top

Check the Hard-Disk HD Replication Status

Access the server's "OpenManage Server Administrator (OMSA)" GUI.

Check the Hard-Disk HD Status:
You have to dig in via the left navigation tree:

  • Menu "Storage" → Menu "PERC ..." → Menu "Connector ..." → Menu "Enclosure ..." → Menu "Physical Disks ..."
Check the disk state: Column "State"

States:

  • Online:
  • The disk is online and productive working in the RAID. The replication is working.
  • Ready:
  • The disk is ready for integration into a RAID. The replication is not active.
  • Rebuilding:
  • The disc is currently integrated into the RAID. The progress is displayed in %.


If there is an indication of a hard-disk replication problematic then check in chapter "Treating RAID and Hard-Disk Problems" about further maintenance actions.




→ Top

Get the Server's Log Data

Access the server's "OpenManage Server Administrator (OMSA)" GUI.

Get the OMSA log:

  • Menu "System" → Tab "Logs"
  • Save the "Embedded System Management (ESM) Log" on the server:
Click "Save AS" and follow the instructions
  • Copy the saved EMS Log file to the support directory of the case




→ Top

Server Monitoring by Xymon

The VoIP Switch default monitor Xymon is described in "VoIP Switch Monitoring"




→ Top

Indication of a Server Hardware Defect

Indication "Xymon Event":
Monitor Log, Email or SMTP Trap may contain the following information:

Indication:
<HOST_NAME> "snmptrapd" "failure"



Description:
The server indicates any hardware failure:

  • Failed power module
  • Failed main board
  • Failed RAID controller
  • Failed hard-disk
  • Any other hardware problem


Consequences:

Warning

It may be a SEVERE server condition that must be immediately investigated and treated!


→ For the VoIP Switch telephony service:

  • Depends on the VoIP Switch components running on the server

→ For the operations:

  • Depends on the VoIP Switch components running on the server

→ For the user:

  • Depends on the VoIP Switch components running on the server


Solution:
The server must be repaired or exchanged.


Action:

  1. Check the details on the server with the "Server Administrator (OMSA)"
  2. Organize DELL repair parts according the maintenance agreement with your "VoIP Switch Supplier"
  3. Repair the server:
    • Fix main board
    • Fix RAID controller
    • Fix or wear out batteries
    • Fix fan
    • Fix RAM modules
    or
    • Processing of hardware problems that can be done hot, e.g.:




→ Top

Indication of a Server Hard-Disk or RAID Controller Problem

Indication "Xymon Event":
Monitor Log, Email or SMTP Trap may contain the following information:

Indication:
<HOST_NAME> "snmptrapd" "degraded"



Description:
The server indicates a problem with the virtual disk:

  • Failed RAID controller
  • Failed hard-disk
  • Failed hard-disk replication


Consequences:

Warning

SEVERE server condition that must be immediately investigated and treated!


→ For the VoIP Switch telephony service:

  • Depends on the VoIP Switch components running on the server

→ For the operations:

  • Depends on the VoIP Switch components running on the server

→ For the user:

  • Depends on the VoIP Switch components running on the server


Solution:
The RAID controller must be repaired or a hard-disk exchanged.


Action:

  1. Check the details on the server with the "Server Administrator (OMSA)"
  2. Organize DELL repair parts according the maintenance agreement with your "VoIP Switch Supplier"
  3. Repair the server:
    or
    • Processing of hardware problems that can be done hot, e.g.:




→ Top

Procedure for Replacing Defect HW Parts with DELL

The procedure for exchanging defect hardware HW of DELL servers' is different from country to country and may also change from time to time.

The following basic procedure for HW exchange seems more or less stable:

  1. Detect the HW problem
  2. Make sure to have ready the DELL server details:
    • Server Type
    • Service-Tag number or the "ExpressService Code"
    • Check the guaranty time of the server
  3. Report DELL support
    • DELL will analyze the case and order more information if needed
  4. DELL will organize and send the exchange part
  5. The VoIP Switch Administrator has to organize the replacing of the part
    Usually this has to be done within 1 - 3 working days
  6. The VoIP Switch Administrator has to make ready the defect part for returning it to DELL
    • Do not dispose the defect part!
    Either the defect part will be picked up at the location or it has to be send back to DELL.




→ Top

Treating Server Hardware Problems

The VoIP Switch Administrator and/or server service personnel find here instructions for managing HW defects.




→ Top

Default Process for Fixing Hardware Problems

Indication:

  • Xymon Event either email and/or SNMP trap:
  • The provider's system monitoring indicates no access to the server
  • Server Administrator (OMSA): Displays the error condition
  • Server Display: The server front display is yellow and indicates the error condition
  • Server Console: The server doesn't respond to console input


Description:
Any hardware problem.
Most probably:

  • Defect main board
  • Defect RAID controller
  • Defect or wear out batteries
  • Defect fan
  • Defect power module


Note

The telephony service for the customers is not endangered as long only one server fails!
It becomes disastrous if the two LoadBalancer servers or all ServiceCenter servers are not working anymore.



Consequences:

Warning

It may be a SEVERE server condition that must be immediately investigated and treated!


→ For the VoIP Switch telephony service:

  • Depends on the VoIP Switch components running on the server
  • If a ServiceCenter server fails the capability of concurrent connection handling may decline.


→ For the operations:

  • Depends on the VoIP Switch components running on the server


→ For the user:

  • Depends on the VoIP Switch components running on the server


Solution:
The server must be repaired or exchanged.


Action:
Analyze the situation and organize spare parts:

  1. Check the details on the server with the "Server Administrator (OMSA)"
  2. Organize DELL repair parts according the maintenance agreement with your "VoIP Switch Supplier"


Treat the VoIP Switch operation if the defect stops the proper server functionality :

  1. Disable Xymon Alarming
  2. Stop provider alarming
  3. Graceful pre-bar the VoIP Switch component


Repair the server:
If the main board or RAID controller had to be replaced then follow these special instructions:


If the power-module or hard-disk have to be replaced, see:


Warning For the following actions the server casing has to be opened!


The effects of EMC must be considered and the appropriate precautions must be taken to prevent further hard ware damage.


  1. Shut down and power off the server if the part has to be replaced on the main board
  2. Repair the server → Follow the server manufacturer's instructions!


Put back the server to normal working state:

  1. Start the server (if needed):
    → This automatically starts the VoIP Switch components!
  2. Checks:
    1. Check the server status with "Server Administrator (OMSA)"
    2. Check in the ConfigCenter if all VoIP Switch components on the sever are ok:
      ConfigCenter GUI → Menu "System" → Menu "Components"
    3. Check if the Xymon monitor doesn't show any error


If the VoIP Switch doesn't get back to normal telephony service operation:

  1. Investigate what is wrong and solve it
  2. Contact the "VoIP Switch Supplier Support" for helping setting up the server and recovering the missing VoIP Switch functionality


Enable the alarming again:

  1. Enable Xymon Alarming
  2. Start provider alarming




→ Top

Fix Defect Main Board or RAID Controller

See section "Default Process for Fixing Hardware Problems" for the general description of the problem.


Actions:

Repair the server:

  1. Shut down and power off the server if the part has to be replaced on the main board
  2. Repair the server hardware → Follow the server manufacturer's instructions
  3. Connect a VGA monitor to the console port of the server


If the RAID controller was repaired then there will be still a RAID problem continue at "Default Process for Fixing RAID Problems", Case 2


If the main board was repaired continue here:

  1. Insert the original hard-disk 1 in bay 0 (do not insert the hard-disk 2 yet)


Put back the server to normal working state:

  1. Power on and start the server
    → This automatically starts the VoIP Switch components!
  2. Checks:
    1. Check the console output on the VGA monitor if any exceptions are displayed during the BIOS booting
      → If the booting stucks during virtual hard disk initialization (RAID controller) then check the replication issues .
    2. Check the server status with "Server Administrator (OMSA)"
    3. Check in the ConfigCenter if all VoIP Switch components on the sever are ok:
      ConfigCenter GUI → Menu "System" → Menu "Components"
    4. Check if the Xymon monitor doesn't show any error:
      → After a certain time all supervised objects should get green except the missing hard-disk 2


If the VoIP Switch doesn't get back to normal telephony service operation:

  1. Investigate what is wrong and solve it
  2. Contact the "VoIP Switch Supplier Support" for helping setting up the server and recovering the missing VoIP Switch functionality


When the server and the telephony service are working correctly again then:

  1. Insert the original hard-disk 2 in bay 1


Enable the alarming again:

  1. Enable Xymon Alarming
  2. Start provider alarming




→ Top

Fix Defect Power Module

Indication:

  • Xymon Event either email and/or SNMP trap:
  • Server Administrator (OMSA): Displays the error condition
  • Server Display: The server front display is yellow and indicates the error condition


Description:
Defect power module


Consequences:

Note

This erroneous condition must be checked and treated within reasonable time!


→ For the VoIP Switch telephony service:

  • No immediate consequences
  • The server is running just with one power module


→ For the operations:

  • No immediate consequences


→ For the user:

  • No immediate consequences


Solution:
The power module must be replaced


Actions:

Analyze the situation and organize spare parts:

  1. Check the details on the server with the "Server Administrator (OMSA)"
  2. Organize DELL repair parts according the maintenance agreement with your "VoIP Switch Supplier"


Treat the VoIP Switch operation if the defect stops the proper server functionality :

  1. Disable Xymon Alarming
  2. Stop provider alarming


Replace the power module:

  1. Remove the defect power module (hot plug out possible)
  2. Insert the new power module (hot plug in possible)
  3. Connect the power cord


Put back the server to normal working state:

  1. Checks:
    1. Check the server status with "Server Administrator (OMSA)"
    2. Check if the Xymon monitor doesn't show any error


If the server doesn't go back to normal operation:

  1. Investigate what is wrong and solve it
  2. Contact the "VoIP Switch Supplier Support" for helping recovering the server


Enable the alarming again:

  1. Enable Xymon Alarming
  2. Start provider alarming




→ Top

Treating RAID and Hard-Disk Problems

All servers of the VoIP Switch run a RAID type 1 which mirrors the contents of the two installed hard-disks. The "RAID controller" manages the replication between the two hard-disks.


Several conditions may interrupt the hard-disk replication and/or degrade the RAID virtual disk:

  • Main board defect
  • RAID controller defect
  • Hard-disk defect


The consequences are that the server is not running at all or only with one hard-disk. The good news is as long one hard-disk is running the server will work as expected.


Note

These types of defect have to be solved as fast as possible!





→ Top

Fix Defect Hard Disk

Indication:

  • Xymon Event either email and/or SNMP trap:
  • Server Administrator (OMSA): Displays the error condition
  • Server Display: The server front display is yellow and indicates the error condition


Description:
Defect hard-disk


Consequences:

Note

This erroneous condition must be checked and treated within reasonable time!


→ For the VoIP Switch telephony service:

  • No immediate consequences
  • The server is running just with one hard-disk


→ For the operations:

  • No immediate consequences


→ For the user:

  • No immediate consequences


Solution:
The hard-disk must be replaced


Actions:

Analyze the situation and organize spare parts:

  1. Check the details on the server with the "Server Administrator (OMSA)"
  2. Organize DELL repair parts according the maintenance agreement with your "VoIP Switch Supplier"


Treat the VoIP Switch operation if the defect stops the proper server functionality :

  1. Disable Xymon Alarming
  2. Stop provider alarming


Replace the hard-disk:

  1. Remove the defect hard-disk (hot plug out possible)
  2. Insert the new hard-disk (hot plug in possible):
    → If the hard-disk is brand-new the replication starts immediately
    → If the hard-disk was already used then the replication may not start automatically then check the instructions at " Default Process for Fixing RAID Problems", Case 1 .


Put back the server to normal working state:

  1. Checks:
    1. Check if the hard-disk replication is in progress
    2. Check the server status with "Server Administrator (OMSA)"
    3. Check if the Xymon monitor doesn't show any error


If the server doesn't go back to normal operation:

  1. Investigate what is wrong and solve it
  2. Contact the "VoIP Switch Supplier Support" for helping setting up the hard-disk replication


Enable the alarming again:

  1. Enable Xymon Alarming
  2. Start provider alarming



→ Top

Default Process for Fixing RAID Problems

Indication:

  • Xymon Event either email and/or SNMP trap:
  • The provider's system monitoring may indicate no access to the server
  • Server Administrator (OMSA): Displays the error condition
  • Server Display: The server front display is yellow and indicates the error condition
  • Server Console: The server may not respond to console input


Description:
Any hardware problem.
Most probably:

  • Defect RAID controller
  • Defect hard-disk


Consequences:

Warning

It may be a SEVERE server condition that must be immediately investigated and treated!


→ For the VoIP Switch telephony service:

  • Depends on the VoIP Switch components running on the server
  • If a ServiceCenter server fails the capability of concurrent connection handling may decline.


→ For the operations:

  • Depends on the VoIP Switch components running on the server


→ For the user:

  • Depends on the VoIP Switch components running on the server


Solution:
The server must be repaired or exchanged.


Action:

A) Analyze the degrade situation and organize spare parts:

  1. Check the details on the server with the "Server Administrator (OMSA)"
  2. Check the VoIP Switch documentation for the server type and used RAID controller
  3. Organize DELL repair parts according the maintenance agreement with your "VoIP Switch Supplier"


B) Treat the VoIP Switch operation if the defect stops the proper server functionality :

  1. Disable Xymon Alarming
  2. Stop provider alarming
  3. :support_switch#supportSwitchPreBar Graceful pre-bar the VoIP Switch component


C) Evaluate the repair case for DELL RAID controller type: PERC5 / PERC 6 / H310 Mini / H320 Mini / H330 Mini:


Case 1: "One Hard-Disk Defect"
Precondition:
  • Main board is ok
  • RAID controller is ok
  • 1 operative hard-disk is ok
  • Server is still operative within the VoIP Switch
  • The replacement hard-disk has the same form factor and size of bytes


To-Do:
  1. Remove the defect hard-disk (hot plug-out is no problem)
  2. Insert the new hard-disk (hot plug-in is no problem) either:
    • a brand-new hard-disk
    • an already used spare hard-disk
  3. Check the hard-disk replication status
→ If the replication did not start automatically then start the replication manually !


Case 2: "Main Board or RAID Controller Defect:
Precondition:
  • The main board RAID controller are repaired according description above
  • 2 operative hard-disks are ok
  • Server is shut down
  • Disconnect all Ethernet patch cables from the server GB ports.
  • Connect a VGA monitor and USB keyboard and mouse tot the console port of the server


To-Do:
  1. Insert the original hard-disk 1 in bay 0 (do not insert the hard-disk 2 yet)
  2. Power up the server
  3. Check the console output on the VGA monitor:
    During the BIOS startup the following message may be displayed:
    Foreign configuration(n) found on adapter.
    Press any key … or 'F' to import foreign configuration and continue.
  4. If requested press key F on the keyboard!
    Note:
    If you miss to press F then restart the BIOS booting by pressing the keys [Ctrl Alt Delete] else the server booting stops after the BIOS start up.
  5. Check the console output on the VGA monitor:
    A security question may be displayed which enables you to stop the procedure:
    All of the disk from your previous configuration are gone. If this is an unexpected message ...
  6. Do not press any key!
    Note:
    If no key is pressed then the RAID controller takes over the hard-disk as part of its new virtual disk.
     
    → Wait until the server has booted!
     
  7. Insert the original hard-disk 2 in bay 1
  8. Check the hard-disk replication status
    Note:
    It is very probable that the replication did not start automatically!
    Then:
    At Menu "Storage" a yellow warning triangle is displayed
    Upon click on "Storage" the status is displayed:
    Virtual Disk 0: degraded
→ If the replication did not start automatically then start the replication manually !


For all other cases:


C) Put back the server to normal working state:

  1. If needed connect all Ethernet patch cables to the correct server GB ports
  2. Checks:
    1. Check the server status with "Server Administrator (OMSA)"
    2. Check in the ConfigCenter if all VoIP Switch components on the sever are ok:
      ConfigCenter GUI → Menu "System" → Menu "Components"
    3. Check if the Xymon monitor doesn't show any error


D) If the VoIP Switch doesn't get back to normal telephony service operation:

  1. Investigate what is wrong and solve it
  2. Contact the "VoIP Switch Supplier Support" for helping setting up the server and recovering the missing VoIP Switch functionality


E) Enable the alarming again:

  1. Enable Xymon Alarming
  2. Start provider alarming




→ Top

Manually Restart the Hard-Disk Replication

In this situation the RAID's virtual disk is in state degraded (only one hard-disk is operative, but two are expected). The RAID controller will automatically grab a free "hot spare" hard-disk and associate it with its degraded virtual disk and start the replication.


Restart the hard-disk replication manually:

  1. Connect with any Web browser to the server's "Server Administrator (OMSA)" GUI:
    • Login as user "root"
     
  2. From the inserted 2nd hard-disk the foreign RAID configuration has to be deleted:
    → Menu "Storage" → Menu "PERC xxxxx"
    → Select at [ Available Task ]: "Clear Foreign Configuration"
    <tt>→ Click button [ Execute ]
    <tt>→ Confirm the security check click button [ Clear ]
     
  3. The inserted 2nd hard-disk has to be declared as "hot spare":
    <tt>→ Menu "Storage" → Menu "PERC xxxxx" → "Connector 0" → Menu "Enclosure (Backplane)" → Menu "Physical Disks"
    → Select at [ Available Task ]: "Assign Global Hot Spare"
    <tt>→ Click button [ Execute ]
     
  4. Check the virtual disk replication state:
    <tt>→ Column "State"


If the hard-disk replication is not starting then contact the appropriate DELL Support or the "VoIP Switch Supplier Support".




→ Top


© Aarenet Inc 2018

Version: 3.0     Author:  Aarenet     Date: May 2017