Guide for the Maintenance and Problem Solving for Servers from DELL Inc ®
Note | The features and/or parameters listed in this article may not be available from your telephone service provider. |
|
|
|
Introduction
The VoIP Switch Administrator and/or server service personnel find here information for DELL server maintenance and trouble shooting:
- Best practice when a hardware HW problem is indicated
- Server monitoring with "DELL OpenManage Server Administrator (OMSA)" and "Xymon monitor"
- Checking and indicating of hardware problems
- Procedure for replacing defect HW parts with DELL
- Treating server hardware problems
- Treating RAID and hard-disk HD problems
Warning |
The instructions given in this document can jeopardize the server functionality! Depending on the server responsibilities within the VoIP Switch the telephony service can be endangered! The "VoIP Switch Supplier" cannot accept any responsibility due to wrongdoing of the executing personnel. If there are uncertainties, contact the "VoIP Switch Supplier Support"! |
Contents
- 1 Best Practice When a Hardware HW Problem is Indicated
- 2 Server Monitoring
- 3 Procedure for Replacing Defect HW Parts with DELL
- 4 Treating Server Hardware Problems
- 5 Treating RAID and Hard-Disk Problems
Best Practice When a Hardware HW Problem is Indicated
It is assumed that from any source a hardware problem of a server is indicated, e.g.:
- Monitor Log
- Alerting email
- SMTP trap
- system engineer observation
- etc
Best Practice |
|
Server Monitoring
Manual Server Monitoring With DELL's "Server Administrator (OMSA)"
DELL OpenManage Server Administrator (OMSA) is a software agent that provides a comprehensive, one-to-one systems management solution in two ways: from an integrated, Web browser-based graphical user interface (GUI) and from a command line interface (CLI) through the operating system.
Note |
In this chapter enough information is given for being dangerous! If there are uncertainties contact the "DELL Support" or the "VoIP Switch Supplier Support". |
Access the "OpenManage Server Administrator (OMSA)"
Connect with any Web browser to the server's "OpenManage Server Administrator (OMSA)" GUI:
- Insert the following URI:
- https://<IP_ADDRESS>:1311
- Example:
- https://172.100.100.100:1311
- Insert the user "root" login credentials:
- Username: root
- Password: the server root password
Check the Type of Server and Service Tags
Access the server's "OpenManage Server Administrator (OMSA)" GUI.
Check the server type:
- In the OMSA home page menu bar at the top the server type is listed, e.g.: "PowerEdge620"
- or
- Menu "System" → Tab "Properties" → Tab "Summary"
Check the Service Tag:
- Menu "System" → Tab "Properties" → Tab "Summary"
- In frame "Main System Chassis" the Service Tag is displayed, e.g. : 47X....
- In frame "Main System Chassis" the "Express Service Code" is displayed, e.g. : 9187....
Check the Server's Hardware Status
Access the server's "OpenManage Server Administrator (OMSA)" GUI.
Check the Server's Hardware Status:
- Menu "System" → Tab "Properties" → Tab "Health"
- Click "Main System Chassis"
- The status of all server hardware components is displayed and can be checked in detail.
Check the Server's and RAID and Hard-Disk HD Status
Access the server's "OpenManage Server Administrator (OMSA)" GUI.
Check the RAID Controller Type:
- Menu "System" → Tab "Properties" → Tab "Health"
- Click "Storage"
- In frame "RAID Controller(s)" the RAID controller type is displayed, e.g. : "PERC 6/i integrated"
Check the RAID Controller Status:
- Menu "System" → Tab "Properties" → Tab "Health"
- Click "Storage"
- In frame "RAID Controller(s)" the name and status of the RAID is displayed: "Virtual Disk 0 RAID-1"
Check the Hard-Disk HD Replication Status
Access the server's "OpenManage Server Administrator (OMSA)" GUI.
Check the Hard-Disk HD Status:
You have to dig in via the left navigation tree:
- Menu "Storage" → Menu "PERC ..." → Menu "Connector ..." → Menu "Enclosure ..." → Menu "Physical Disks ..."
- Check the disk state: Column "State"
States:
- Online:
- The disk is online and productive working in the RAID. The replication is working.
- Ready:
- The disk is ready for integration into a RAID. The replication is not active.
- Rebuilding:
- The disc is currently integrated into the RAID. The progress is displayed in %.
If there is an indication of a hard-disk replication problematic then check in chapter "Treating RAID and Hard-Disk Problems" about further maintenance actions.
Get the Server's Log Data
Access the server's "OpenManage Server Administrator (OMSA)" GUI.
Get the OMSA log:
- Menu "System" → Tab "Logs"
- Save the "Embedded System Management (ESM) Log" on the server:
- Click "Save AS" and follow the instructions
- Copy the saved EMS Log file to the support directory of the case
Server Monitoring by Xymon
The VoIP Switch default monitor Xymon is described in "VoIP Switch Monitoring"
Indication of a Server Hardware Defect
Indication "Xymon Event":
Monitor Log, Email or SMTP Trap may contain the following information:
Indication: |
<HOST_NAME> "snmptrapd" "failure" |
Description:
The server indicates any hardware failure:
- Failed power module
- Failed main board
- Failed RAID controller
- Failed hard-disk
- Any other hardware problem
Consequences:
Warning |
It may be a SEVERE server condition that must be immediately investigated and treated! |
→ For the VoIP Switch telephony service:
- Depends on the VoIP Switch components running on the server
→ For the operations:
- Depends on the VoIP Switch components running on the server
→ For the user:
- Depends on the VoIP Switch components running on the server
Solution:
The server must be repaired or exchanged.
Action:
- Check the details on the server with the "Server Administrator (OMSA)"
- Organize DELL repair parts according the maintenance agreement with your "VoIP Switch Supplier"
- Direct at DELL support
- Contact the "VoIP Switch Supplier Support"
- Repair the server:
- Default processing of hardware problems that forces to shutdown the server, e.g.:
- Fix main board
- Fix RAID controller
- Fix or wear out batteries
- Fix fan
- Fix RAM modules
- or
- Processing of hardware problems that can be done hot, e.g.:
Indication of a Server Hard-Disk or RAID Controller Problem
Indication "Xymon Event":
Monitor Log, Email or SMTP Trap may contain the following information:
Indication: |
<HOST_NAME> "snmptrapd" "degraded" |
Description:
The server indicates a problem with the virtual disk:
- Failed RAID controller
- Failed hard-disk
- Failed hard-disk replication
Consequences:
Warning |
SEVERE server condition that must be immediately investigated and treated! |
→ For the VoIP Switch telephony service:
- Depends on the VoIP Switch components running on the server
→ For the operations:
- Depends on the VoIP Switch components running on the server
→ For the user:
- Depends on the VoIP Switch components running on the server
Solution:
The RAID controller must be repaired or a hard-disk exchanged.
Action:
- Check the details on the server with the "Server Administrator (OMSA)"
- Organize DELL repair parts according the maintenance agreement with your "VoIP Switch Supplier"
- Direct at DELL support
- Contact the "VoIP Switch Supplier Support"
- Repair the server:
- Default processing of hardware problems that forces to shutdown the server, e.g.:
- or
- Processing of hardware problems that can be done hot, e.g.:
Procedure for Replacing Defect HW Parts with DELL
The procedure for exchanging defect hardware HW of DELL servers' is different from country to country and may also change from time to time.
The following basic procedure for HW exchange seems more or less stable:
- Detect the HW problem
- Make sure to have ready the DELL server details:
- Server Type
- Service-Tag number or the "ExpressService Code"
- Check the guaranty time of the server
- Report DELL support
- DELL will analyze the case and order more information if needed
- DELL will organize and send the exchange part
- The VoIP Switch Administrator has to organize the replacing of the part
- Usually this has to be done within 1 - 3 working days
- The VoIP Switch Administrator has to make ready the defect part for returning it to DELL
- Do not dispose the defect part!
- Either the defect part will be picked up at the location or it has to be send back to DELL.
Treating Server Hardware Problems
The VoIP Switch Administrator and/or server service personnel find here instructions for managing HW defects.
Default Process for Fixing Hardware Problems
Indication:
- Xymon Event either email and/or SNMP trap:
- The provider's system monitoring indicates no access to the server
- Server Administrator (OMSA): Displays the error condition
- Server Display: The server front display is yellow and indicates the error condition
- Server Console: The server doesn't respond to console input
Description:
Any hardware problem.
Most probably:
- Defect main board
- Defect RAID controller
- Defect or wear out batteries
- Defect fan
- Defect power module
Note |
The telephony service for the customers is not endangered as long only one server fails!
|
Consequences:
Warning |
It may be a SEVERE server condition that must be immediately investigated and treated! |
→ For the VoIP Switch telephony service:
- Depends on the VoIP Switch components running on the server
- If a ServiceCenter server fails the capability of concurrent connection handling may decline.
→ For the operations:
- Depends on the VoIP Switch components running on the server
→ For the user:
- Depends on the VoIP Switch components running on the server
Solution:
The server must be repaired or exchanged.
Action:
Analyze the situation and organize spare parts:
- Check the details on the server with the "Server Administrator (OMSA)"
- Organize DELL repair parts according the maintenance agreement with your "VoIP Switch Supplier"
- Direct at DELL support
- Contact the "VoIP Switch Supplier Support"
Treat the VoIP Switch operation if the defect stops the proper server functionality :
- Disable Xymon Alarming
- Stop provider alarming
- Graceful pre-bar the VoIP Switch component
Repair the server:
If the main board or RAID controller had to be replaced then follow these special instructions:
If the power-module or hard-disk have to be replaced, see:
Warning | For the following actions the server casing has to be opened!
|
- Shut down and power off the server if the part has to be replaced on the main board
- Repair the server → Follow the server manufacturer's instructions!
Put back the server to normal working state:
- Start the server (if needed):
- → This automatically starts the VoIP Switch components!
- Checks:
- Check the server status with "Server Administrator (OMSA)"
- Check in the ConfigCenter if all VoIP Switch components on the sever are ok:
- ConfigCenter GUI → Menu "System" → Menu "Components"
- Check if the Xymon monitor doesn't show any error
If the VoIP Switch doesn't get back to normal telephony service operation:
- Investigate what is wrong and solve it
- Contact the "VoIP Switch Supplier Support" for helping setting up the server and recovering the missing VoIP Switch functionality
Enable the alarming again:
- Enable Xymon Alarming
- Start provider alarming
Fix Defect Main Board or RAID Controller
See section "Default Process for Fixing Hardware Problems" for the general description of the problem.
Actions:
Repair the server:
- Shut down and power off the server if the part has to be replaced on the main board
- Repair the server hardware → Follow the server manufacturer's instructions
- Connect a VGA monitor to the console port of the server
If the RAID controller was repaired then there will be still a RAID problem continue at "Default Process for Fixing RAID Problems", Case 2
If the main board was repaired continue here:
- Insert the original hard-disk 1 in bay 0 (do not insert the hard-disk 2 yet)
Put back the server to normal working state:
- Power on and start the server
- → This automatically starts the VoIP Switch components!
- Checks:
- Check the console output on the VGA monitor if any exceptions are displayed during the BIOS booting
- → If the booting stucks during virtual hard disk initialization (RAID controller) then check the replication issues .
- Check the server status with "Server Administrator (OMSA)"
- Check in the ConfigCenter if all VoIP Switch components on the sever are ok:
- ConfigCenter GUI → Menu "System" → Menu "Components"
- Check if the Xymon monitor doesn't show any error:
- → After a certain time all supervised objects should get green except the missing hard-disk 2
- Check the console output on the VGA monitor if any exceptions are displayed during the BIOS booting
If the VoIP Switch doesn't get back to normal telephony service operation:
- Investigate what is wrong and solve it
- Contact the "VoIP Switch Supplier Support" for helping setting up the server and recovering the missing VoIP Switch functionality
When the server and the telephony service are working correctly again then:
- Insert the original hard-disk 2 in bay 1
- Check with "Server Administrator (OMSA)" if the RAID controller started automatically the hard disk replication if not then restart the replication manually
Enable the alarming again:
- Enable Xymon Alarming
- Start provider alarming
Fix Defect Power Module
Indication:
- Xymon Event either email and/or SNMP trap:
- Server Administrator (OMSA): Displays the error condition
- Server Display: The server front display is yellow and indicates the error condition
Description:
Defect power module
Consequences:
Note |
This erroneous condition must be checked and treated within reasonable time! |
→ For the VoIP Switch telephony service:
- No immediate consequences
- The server is running just with one power module
→ For the operations:
- No immediate consequences
→ For the user:
- No immediate consequences
Solution:
The power module must be replaced
Actions:
Analyze the situation and organize spare parts:
- Check the details on the server with the "Server Administrator (OMSA)"
- Organize DELL repair parts according the maintenance agreement with your "VoIP Switch Supplier"
- Direct at DELL support
- Contact the "VoIP Switch Supplier Support"
Treat the VoIP Switch operation if the defect stops the proper server functionality :
- Disable Xymon Alarming
- Stop provider alarming
Replace the power module:
- Remove the defect power module (hot plug out possible)
- Insert the new power module (hot plug in possible)
- Connect the power cord
Put back the server to normal working state:
- Checks:
- Check the server status with "Server Administrator (OMSA)"
- Check if the Xymon monitor doesn't show any error
If the server doesn't go back to normal operation:
- Investigate what is wrong and solve it
- Contact the "VoIP Switch Supplier Support" for helping recovering the server
Enable the alarming again:
- Enable Xymon Alarming
- Start provider alarming
Treating RAID and Hard-Disk Problems
All servers of the VoIP Switch run a RAID type 1 which mirrors the contents of the two installed hard-disks. The "RAID controller" manages the replication between the two hard-disks.
Several conditions may interrupt the hard-disk replication and/or degrade the RAID virtual disk:
- Main board defect
- RAID controller defect
- Hard-disk defect
The consequences are that the server is not running at all or only with one hard-disk.
The good news is as long one hard-disk is running the server will work as expected.
Note |
These types of defect have to be solved as fast as possible! |
Fix Defect Hard Disk
Indication:
- Xymon Event either email and/or SNMP trap:
- Server Administrator (OMSA): Displays the error condition
- Server Display: The server front display is yellow and indicates the error condition
Description:
Defect hard-disk
Consequences:
Note |
This erroneous condition must be checked and treated within reasonable time! |
→ For the VoIP Switch telephony service:
- No immediate consequences
- The server is running just with one hard-disk
→ For the operations:
- No immediate consequences
→ For the user:
- No immediate consequences
Solution:
The hard-disk must be replaced
Actions:
Analyze the situation and organize spare parts:
- Check the details on the server with the "Server Administrator (OMSA)"
- Organize DELL repair parts according the maintenance agreement with your "VoIP Switch Supplier"
- Direct at DELL support
- Contact the "VoIP Switch Supplier Support"
Treat the VoIP Switch operation if the defect stops the proper server functionality :
- Disable Xymon Alarming
- Stop provider alarming
Replace the hard-disk:
- Remove the defect hard-disk (hot plug out possible)
- Insert the new hard-disk (hot plug in possible):
- → If the hard-disk is brand-new the replication starts immediately
- → If the hard-disk was already used then the replication may not start automatically then check the instructions at " Default Process for Fixing RAID Problems", Case 1 .
Put back the server to normal working state:
- Checks:
- Check if the hard-disk replication is in progress
- Check the server status with "Server Administrator (OMSA)"
- Check if the Xymon monitor doesn't show any error
If the server doesn't go back to normal operation:
- Investigate what is wrong and solve it
- Contact the "VoIP Switch Supplier Support" for helping setting up the hard-disk replication
Enable the alarming again:
- Enable Xymon Alarming
- Start provider alarming
Default Process for Fixing RAID Problems
Indication:
- Xymon Event either email and/or SNMP trap:
- The provider's system monitoring may indicate no access to the server
- Server Administrator (OMSA): Displays the error condition
- Server Display: The server front display is yellow and indicates the error condition
- Server Console: The server may not respond to console input
Description:
Any hardware problem.
Most probably:
- Defect RAID controller
- Defect hard-disk
Consequences:
Warning |
It may be a SEVERE server condition that must be immediately investigated and treated! |
→ For the VoIP Switch telephony service:
- Depends on the VoIP Switch components running on the server
- If a ServiceCenter server fails the capability of concurrent connection handling may decline.
→ For the operations:
- Depends on the VoIP Switch components running on the server
→ For the user:
- Depends on the VoIP Switch components running on the server
Solution:
The server must be repaired or exchanged.
Action:
A) Analyze the degrade situation and organize spare parts:
- Check the details on the server with the "Server Administrator (OMSA)"
- Check the VoIP Switch documentation for the server type and used RAID controller
- Organize DELL repair parts according the maintenance agreement with your "VoIP Switch Supplier"
- Direct at DELL support
- Contact the "VoIP Switch Supplier Support"
B) Treat the VoIP Switch operation if the defect stops the proper server functionality :
- Disable Xymon Alarming
- Stop provider alarming
- :support_switch#supportSwitchPreBar Graceful pre-bar the VoIP Switch component
C) Evaluate the repair case for DELL RAID controller type: PERC5 / PERC 6 / H310 Mini / H320 Mini / H330 Mini:
- Case 1: "One Hard-Disk Defect"
- Precondition:
- Main board is ok
- RAID controller is ok
- 1 operative hard-disk is ok
- Server is still operative within the VoIP Switch
- The replacement hard-disk has the same form factor and size of bytes
- Precondition:
- To-Do:
- Remove the defect hard-disk (hot plug-out is no problem)
- Insert the new hard-disk (hot plug-in is no problem) either:
- a brand-new hard-disk
- an already used spare hard-disk
- Check the hard-disk replication status
- → If the replication did not start automatically then start the replication manually !
- To-Do:
- Case 2: "Main Board or RAID Controller Defect:
- Precondition:
- The main board RAID controller are repaired according description above
- 2 operative hard-disks are ok
- Server is shut down
- Disconnect all Ethernet patch cables from the server GB ports.
- Connect a VGA monitor and USB keyboard and mouse tot the console port of the server
- Precondition:
- To-Do:
- Insert the original hard-disk 1 in bay 0 (do not insert the hard-disk 2 yet)
- Power up the server
- Check the console output on the VGA monitor:
- During the BIOS startup the following message may be displayed:
- Foreign configuration(n) found on adapter.
- Press any key … or 'F' to import foreign configuration and continue.
- During the BIOS startup the following message may be displayed:
- If requested press key F on the keyboard!
- Note:
- If you miss to press F then restart the BIOS booting by pressing the keys [Ctrl Alt Delete] else the server booting stops after the BIOS start up.
- Note:
- Check the console output on the VGA monitor:
- A security question may be displayed which enables you to stop the procedure:
- All of the disk from your previous configuration are gone. If this is an unexpected message ...
- Do not press any key!
- Note:
- If no key is pressed then the RAID controller takes over the hard-disk as part of its new virtual disk.
- → Wait until the server has booted!
- Note:
- Insert the original hard-disk 2 in bay 1
- Check the hard-disk replication status
- Note:
- It is very probable that the replication did not start automatically!
- Then:
- At Menu "Storage" a yellow warning triangle is displayed
- Upon click on "Storage" the status is displayed:
- Virtual Disk 0: degraded
- → If the replication did not start automatically then start the replication manually !
- To-Do:
- For all other cases:
- Contact the "VoIP Switch Supplier Support" for helping setting up the server and recovering the missing VoIP Switch functionality
C) Put back the server to normal working state:
- If needed connect all Ethernet patch cables to the correct server GB ports
- Checks:
- Check the server status with "Server Administrator (OMSA)"
- Check in the ConfigCenter if all VoIP Switch components on the sever are ok:
- ConfigCenter GUI → Menu "System" → Menu "Components"
- Check if the Xymon monitor doesn't show any error
D) If the VoIP Switch doesn't get back to normal telephony service operation:
- Investigate what is wrong and solve it
- Contact the "VoIP Switch Supplier Support" for helping setting up the server and recovering the missing VoIP Switch functionality
E) Enable the alarming again:
- Enable Xymon Alarming
- Start provider alarming
Manually Restart the Hard-Disk Replication
In this situation the RAID's virtual disk is in state degraded (only one hard-disk is operative, but two are expected). The RAID controller will automatically grab a free "hot spare" hard-disk and associate it with its degraded virtual disk and start the replication.
Restart the hard-disk replication manually:
- Connect with any Web browser to the server's "Server Administrator (OMSA)" GUI:
- Login as user "root"
- From the inserted 2nd hard-disk the foreign RAID configuration has to be deleted:
- → Menu "Storage" → Menu "PERC xxxxx"
- → Select at [ Available Task ]: "Clear Foreign Configuration"
- <tt>→ Click button [ Execute ]
- <tt>→ Confirm the security check click button [ Clear ]
- <tt>→ Click button [ Execute ]
- → Select at [ Available Task ]: "Clear Foreign Configuration"
- → Menu "Storage" → Menu "PERC xxxxx"
- The inserted 2nd hard-disk has to be declared as "hot spare":
- <tt>→ Menu "Storage" → Menu "PERC xxxxx" → "Connector 0" → Menu "Enclosure (Backplane)" → Menu "Physical Disks"
- → Select at [ Available Task ]: "Assign Global Hot Spare"
- <tt>→ Click button [ Execute ]
- → Select at [ Available Task ]: "Assign Global Hot Spare"
- <tt>→ Menu "Storage" → Menu "PERC xxxxx" → "Connector 0" → Menu "Enclosure (Backplane)" → Menu "Physical Disks"
- Check the virtual disk replication state:
- <tt>→ Column "State"
If the hard-disk replication is not starting then contact the appropriate DELL Support or the "VoIP Switch Supplier Support".
© Aarenet Inc 2018
Version: 3.0
Author: Aarenet
Date: May 2017