GitHub - dylanklc/Merlin---ISST: Our fork / variant of op5 Merlin - used for our distributed version of Nagios

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
merlin-isst		merlin-isst
README		README

Repository files navigation

This fork is based on Merlin stable 0.6.7.

http://www.op5.org/community/projects/merlin

CREDITS:

First and foremost, Andreas Ericsson and op5 for this project - it has a 
terrific database abstraction layer now and promises even better functionality
in the near future with regards to creating easy distributed Nagios setups.

Our managers at Comcast, for allowing us to share our work with the open 
source community (thank you, this is one of the many reasons Comcast is
a great place to work!):
* Jason Livingood
* Mike Fischer
* Eric Scholz

Team members at Comcast who have contributed to this variant:
* Jack Kidwell
* Ryan Richins
* Shaofeng Yang
* Max Schubert

What is different between the op5 version and ours?
===================================================

Major changes between the op5 Merlin and our version + our overall Nagios
methodology:

Our architecture:
* Polling tier per project -  projects range from 1000 - 10000+ hosts
* Common notification tier -  all pollers send notification requests to
                              notification tier, which has a rules engine
                              to perform notification actions.
* Common data trending tier - We use PNP plus a custom in-house long term
                              data trending solution that stores raw values
                              for very long periods of time.
* Common database tier      - H/A master-master replicated database tier with
                              GSLB name for all pollers to use; can be 
                              expanded to master1...masterN circular 
                              replication for large horizontal scaling
* Common web UI tier        - Centralized operational view, command and 
                              control, administration for NOC, admin, 
                              configuration etc use.

In our vision the operational SAs who maintain and manage projects also have
full responsibility for their pollers - they do releases of Nagios 
configurations when they want, they ensure their pollers are up 24x7, they 
re-distribute checks as needed, they watch poller capacity.  We provide 
mentoring, training, and bake in as much information (key performance 
indicators, capacity warnings, etc) as we can into our software as possible 
to let them do just about everything they need to do with regards to 
monitoring without our team having to be in the middle of the process.

Methodology
===========
* All pollers within a project write to a common database - schema has been 
  updated to allow for this.
* Centralized (off poller hosts) UI does
   * Display of current host and service states
   * Remote command and control to all remote pollers
   * Re-distribution of checks
* Centralized admin UI
   * Loads a users' objects.pre-cache into staging tables
   * Lets the user select which pollers checks should run on
   * Creates objects.pre-cache files for each poller along with
   a common retention.dat file
   * Pushes all code to pollers
   * Triggers restart

Re-distribution of checks
=========================
* User creates configuration tree in SVN
* User makes a tag of their configuration tree
* User additionally checks in an gzipped objects.pre-cache into tag
* From administrative UI (we are working on this now):
   * User selects their release tag from SVN
   * Admin UI pulls down release
   * Populates staging versions of host/service etc tables with objects.pre-cache
   * User distributes checks through UI
   * User triggers distribution process
* Distribution code
   * Create per-poller objects.pre-cache file
   * Create common retention.dat file
   * Push code to all pollers affected
   * Restart all affected pollers


Database
=========
* Each poller updates instance_id with it's packed IP address -> unsigned
  long as the unique poller identifier - this lets us know where to
  send remote commands to from our centralized UI as well.
* Added is_online field to program_status so that new pollers can be added
  to the potential poller list for our administrative UI to use for check
  distribution without the poller being active.
* Object import script is *only called when a team wants to do a release* -
  instead of being run on every poller, it is run on the admin web host.
  It populates staging tables that then allow the administrator to 
  distribute hosts and their related services (our distribution granularity
  is at the host level) among pollers that are available for polling.  
* Once distribution is done from an admin perspective, objects.cache files
  are created and pushed to each poller - these then drive data back into
  the database as the remote pollers are restarted.
* Field naming - where possible we aligned field names with retention.dat
  names / objects.cache names
* Views - database schema includes views we use in our in-process UI, all
          views can safely be removed from the schema without affecting
          Merlin operations (we have a single source of truth for our
          database on our middle tier / web UI project that writes out
          the Merlin schema to keep the two in sync).

Merlin:
==========
* Code added to populate instance_id based on the local hosts' IP address -
  the IP address is based on a PTR lookup of the host's hostname as returned
  by the C gethostname() call.
* De-duplicated comment and downtime events by only responding to *_ADD
  NEB callback types.