One thing I learnt from my experience managing a huge network is that complex rules are bad! You never know when a particular node gets left out or determine which set of rules applies on a particular node.
When things get big and unmangeable use thumb rules. Following are certain high level thumb rule that I follow to keep things under control. It will be interesting to get inputs from others as well.
For e.g, In alerting, conditions should be used only to define the exclusions not inclusions. i.e, you don't want to create several alert policies like this:
Alert me in case of high CPU, if it is a Core switch
Alert me in case of high CPU, if it is a Distribution switch
Alert me in case of high CPU, if it is a Access switch
Because you never know when another guy adds a critical device without defining the custom properties properly (or if there is a typo in the custom property) only to be found out after user complains. Instead, you should define conditions that state which ones to leave out:
Alert me in case of high CPU, unless it is an extended switch
That is to say from alerting & config backup perspective we always need to error on the safe side. But why should a device be there in NPM if it need not be alerted?
Likewise backup everything should be the mantra unless it is really an unmanageable device. But then why should a device be in NCM? I will come to these points later.
But under what scenarios these simple rules explode in to several duplicate rules with just minor variations? There are several possibilities. Say the devices being monitored don't have the same threshold. It is certainly unreasonable to expect core switch and access switch to have same CPU load (access switch is normally more loaded, for those interested to know). Or what if different parties want to be notified of events related to different bunch of devices.
How to control the rule growth in that case? For thresholds, I used to have different policies for different class of equipments. Finally I moved to complex policies, using custom property variables as threshold conditions. You can find more discussion on this here.
I am yet to come across a solution for defining different alert rules for different recipients. Once again, I think custom properties can help but I haven't tested it. i.e, to have a field hold the recipients of mail alert for events related to particular node (more like stakeholders of particular devices).
There are scenarios where a node needs to be added to NPM (for capacity planning and report generation purposes) but needn't be alerted. Likewise there are nodes that should be in NCM (for inventory purposes) but should not be config backed up (say Standby firewalls, Unmanageable devices etc.,) I am certainly not a fan of creating backup jobs and adding nodes to it manually. I would rather include all devices in my schedule and deal with failures separately. How do you guys go about it? I am visualizing devices belonging to one of the following 4 quadrants:
Backup | Don't Backup | |
---|---|---|
Alert | ||
Don't Alert |
Technically, I can create two new properties Alert & Backup and define conditions like these:
Alert me about this node, unless Custom property Alert is No
Backup this node, unless Custom property Backup is No
I even don't want some devices to be SSH'ed by NCM. For e.g, the standby firewalls. If someone deploys a command to them by mistake it results in some really fancy incidents. How do you guys disable SSH access to certain devices? I use wrong port number.
Any ideas or personal experiences are welcome to be discussed. I wish to learn more from the community.