Managing the diversity of requirements from a Network monitoring perspective and keeping it scalable

One thing I learnt from my experience managing a huge network is that complex rules are bad! You never know when a particular node gets left out or determine which set of rules applies on a particular node.

When things get big and unmangeable use thumb rules. Following are certain high level thumb rule that I follow to keep things under control. It will be interesting to get inputs from others as well.

For e.g, In alerting, conditions should be used only to define the exclusions not inclusions. i.e, you don't want to create several alert policies like this:

Alert me in case of high CPU, if it is a Core switch

Alert me in case of high CPU, if it is a Distribution switch

Alert me in case of high CPU, if it is a Access switch

Because you never know when another guy adds a critical device without defining the custom properties properly (or if there is a typo in the custom property) only to be found out after user complains. Instead, you should define conditions that state which ones to leave out:

Alert me in case of high CPU, unless it is an extended switch

That is to say from alerting & config backup perspective we always need to error on the safe side. But why should a device be there in NPM if it need not be alerted?

Likewise backup everything should be the mantra unless it is really an unmanageable device. But then why should a device be in NCM? I will come to these points later.

But under what scenarios these simple rules explode in to several duplicate rules with just minor variations? There are several possibilities. Say the devices being monitored don't have the same threshold. It is certainly unreasonable to expect core switch and access switch to have same CPU load (access switch is normally more loaded, for those interested to know). Or what if different parties want to be notified of events related to different bunch of devices.

How to control the rule growth in that case? For thresholds, I used to have different policies for different class of equipments. Finally I moved to complex policies, using custom property variables as threshold conditions. You can find more discussion on this here.

I am yet to come across a solution for defining different alert rules for different recipients. Once again, I think custom properties can help but I haven't tested it. i.e, to have a field hold the recipients of mail alert for events related to particular node (more like stakeholders of particular devices).

There are scenarios where a node needs to be added to NPM (for capacity planning and report generation purposes) but needn't be alerted. Likewise there are nodes that should be in NCM (for inventory purposes) but should not be config backed up (say Standby firewalls, Unmanageable devices etc.,) I am certainly not a fan of creating backup jobs and adding nodes to it manually. I would rather include all devices in my schedule and deal with failures separately. How do you guys go about it? I am visualizing devices belonging to one of the following 4 quadrants:

	Backup	Don't Backup
Alert
Don't Alert

Technically, I can create two new properties Alert & Backup and define conditions like these:

Alert me about this node, unless Custom property Alert is No

Backup this node, unless Custom property Backup is No

I even don't want some devices to be SSH'ed by NCM. For e.g, the standby firewalls. If someone deploys a command to them by mistake it results in some really fancy incidents. How do you guys disable SSH access to certain devices? I use wrong port number.

Any ideas or personal experiences are welcome to be discussed. I wish to learn more from the community.

Managing the diversity of requirements from a Network monitoring perspective and keeping it scalable

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112