Thursday, 20 March 2014

Nagios Remote Host Configuration



Nagios, as you might be knowing is an open source Host monitoring solution. The Nagios core is free and open source. If you need professional services and support you can go for Nagios XI.  You can check more details at 
Nagios website . 

The Basic Architecture is like this, we have a Nagios Remote Host and a Nagios Server. The Remote host has some services running on it which are to be centrally monitored from Nagios server and alert generated in case of any failure or impending failure. The Remote Host ( also known as Nagios Client) sends its service status to the Nagios Server via NRPE.(Nagios Remote Plugin Executer). NRPE lets you monitor variety of services like status of Oracle, HTTP website URL status, free disk space, CPU load among others.

Here i describe the method for setting up a Nagios Remote Host to be monitored from a Nagios Host. I assume you already have a running setup with Nagios Server properly configured and want to monitor the different services running on your Nagios Remote Hosts.

Lets begin .....

Steps to be performed on Nagios server:-

1. Go to the Nagios configuration directory and edit hosts.cfg file and enter the hostname and ip address of nagios remote host. This will make an entry for the particular host and is used to identify the remote host.

2. Most Nagios Administrators classify their Remote Hosts in various groups based on OS type (i.e Solaris, Linux,Windows etc) or based on Datacenter location name (i.e DC-US, DC-EU, DC-APAC). For this purpose, we need to edit hostgroup.cfg file and enter the hostname of the remote host (declared in previous step) in a particular hostgroup we want it to be (e.g Solaris-Group, Linux-Group or DC-APAC )

3. For the purpose of monitoring the various services, we need to edit services.cfg file and input the hostname of remote host to the corresponding service type. For example to monitor Oracle instance on a Database server, we need to enter the hostname of the server in Oracle-check service type definitions as shown below:-

define service{
        use                             generic-service
        host_name                       ggndb01, chndb02, hydoradb
        service_description             Oracle-Check
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              10
        normal_check_interval           5
        retry_check_interval            3
        contact_groups                  nagios-admins,nagios-sms
        notification_interval           3600
        notification_period             24x7
        notification_options            w,u,c
        check_command                   check_nrpe!check_oracle



4. After making all required changes in the three files mentioned above, we need to reload the nagios service for the changes to take effect.

   #/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
   #/etc/rc.d/init.d/nagios reload

The new configuration will be active on web interface now. Login to nagios and verify that the changes are reflected.

Steps to be performed on Remote Host (for NRPE)


Installation and configuration of NRPE on Nagios client (also called remote host) is a little tricky and involved procedure. Before proceeding further with the steps, please download nrpe and nagios plugins source code from Nagios website.

1. Create a nagios user who will communicate with Nagios server for sending service status details

 # useradd -c “nagios system user” -d /usr/local/nagios -m nagios
 # chown nagios:nagios /usr/local/nagios/

2. Extract the plugin and nrpe source code.

 # gunzip nagios-plugins-1.3.8.tar.gz
 # tar -xvf nagios-plugins-1.3.8.tar
 # gunzip nrpe-2.12.tar.gz
 # tar -xvf nrpe-2.12.tar

3. Compile the nagios plugins. These plugins can also be executed locally to check service status

 # cd nagios-plugins-1.3.8
 # ./configure
 # make
 # make install

4. Check whether the plugin are working fine or not. Remember this is a local check only and at this point of time, communication with Nagios server is not yet established.

 # /usr/local/nagios/libexec/check_disk -w 10 -c 5 -p /var

5. Compile the NRPE

 # cd nrpe-2.12
 # ./configure
 # make
 # make install

6. After compilation of NRPE on remote host is completed, nrpe.cfg file be generated. Modify the nrpe.cfg  file as per your needs. Remember this needs to located on the remote host and not on the Nagios server. All the checks which will be performed from Nagios server need to be entered into the remote host nrpe.cfg file. Example below:-

command[check_users]=/usr/local/nagios/libexec/check_users -w 5 -c 10
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
    
Enter all the checks to be performed like for check_disk, check_load, check_http and check_users into this file. Also we need to allow the nagios server to communicate with remore host in this file.
To enable the remote execution of NRPE add the IP address of Nagios server in nrpe.cfg :-
  
allowed_hosts=127.0.0.1,10.237.93.68

7. Configure the NRPE as a service in normal framework so that it can be started or stopped like other Unix services. Add the below line in /etc/services

 nrpe 5666/tcp # NRPE

8. Add the following line in /etc/inet/inetd.conf

   On Solaris
   nrpe stream tcp nowait nagios /usr/sfw/sbin/tcpd /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -i
   On Linux
   nrpe stream tcp nowait nagios /usr/sbin/tcpd /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg --inetd

9. On Linux systems we can start the nrpe service as follows
   # service nrpe start
   On Solaris systems
   # svcadm enable svc:/network/nrpe
   # svcadm restart svc:/network/nrpe


That's it. You can start monitoring your Remote hosts services from Nagios Server.


Monday, 17 March 2014

VCS Interview questions answers

1) What is split brain and amnesia prevention in cluster ?

2) Suppose one of the high-priority heartbeat connection between nodes is lost, what will be the condition of the cluster known as ? What action VCS will take in such a scenario ?

When one of the high-priority heartbeat connectivity between nodes is lost and there is only one remaining heartbeat link, VCS will place the node in a special membership category known as jeopardy membership.
In such case, VCS will autodisable the SG amd the servicegroup state will not change i.e offline or online servicegroups continue to be in that state. VCS  prevents any failover from happening to prevent data corruption.
3) During patching if we want to stop a servicegroup from failover, what actions we will take ?
It is a best practice to freeze a servicegroup during server patching activity. When you freeze a Group, VCS will take no action on that Group or its Resources. It will not try to bring the servicegroup online on any other node.  After the maintenance is over, you can bring the resources online and VCS will refresh its view at that time.

4) How do you check logs of a servicegroup and a resource in VCS ? How will you troubleshoot if a resource has faulted ?
The default VCS log directory is /var/VRTSvcs/log
The main event log of VCS is /var/VRTSvcs/log/engine_A.log. This file is the best place to begin troubleshooting for a failed resource.
Individual agent types have their own log files e.g Mount_A.log , Apache_A.log or Weblogic_A.log. These log files contain more  detailed info than the engine_A.log.

Checking for the word 'clean' can provide clues related to the cause of failure.

5) What is the main purpose of llt and had daemons ?
LLT is the transport mechanism of VCS and is responsible for load balancing of cluster communications and maintaining heartbeat. HAD is the main VCS deamon and is responsible for taking operator input and performing the relevant actions. HAD also takes all types of corrective actions required.

6) What are the difference between LLT and GAB ?
LLT and GAB purpose and differences
LLT is the layer 2 protocol developed by Veritas. It takes care of the heartbeat connection and is a carrier.
GAB distributes the cluster config among nodes. GAB uses LLT as its tranport mechanism for distributing cluster configuration changes.
HAD communicates with GAB and maintains / tracks all cluster configuration. Uses main.cf file to build cluster config. HAD also takes all types of corrective actions required.

7) What is the difference between high priority and low priority link in VCS ?
High priority link is used for transmitting cluster communication and configuration information between nodes and to GAB as well as for heartbeat communications.
Low priority link is used only for heartbeat in normal scenario, but in case of failure of high priority link it can take over the task of transmitting cluster communication also.

8) What are the components in VCS i/o fencing setup ?
Following are the components required for IO fencing in VCS
i)    Coordinator diskgroup with 3 disks
ii)    Data diskgroup
iii)    Dynamic multipathing software (VXDMP)

9) What is a Jeopardy condition in VCS ? What happens to the ServiceGroup and Resources running on a system which is under Jeopardy condition ?
Jeopardy membership condition occurs when a node in a cluster is having only one heartbeat connection remaining with the rest of the cluster. At this point, VCS cannot reliably distinguish between a node failure or network failure if the last heartbeat interconnect also fails. Hence under jeopardy condition, VCS prevents the ServiceGroup from failover. The Applications and ServiceGroup running on the node keep on running as usual and will not be failed over in case of a Node failure. But in case of a resource or group fault, the ServiceGroup fails over to available systems in the cluster. This is a safety mechanism to prevent data corruption.

10) During RACE condition for membership arbitration in case of a node or link failure, how VCS will determine the eligible host for aquiring the lock on co-ordinator disks ? Which sub-cluster will win the RACE and based on what logic ?
During a RACE condition, the partitioned nodes will form a sub-cluster and try to acquire the co-ordinatore disks. Among the nodes in the sub-cluster , the node with the lowest LLT ID will run for the RACE on behalf of itself and other nodes in sub-cluster. If it is successful it will eject keys of other systems (i.e nodes which are not part of the newly formed sub-cluster) from the co-ordinator disks and send a WON_RACE communication to nodes in its sub-cluster. Other nodes which fail to acquire the disks will panic.