Check out our White Paper Series!
A complete library of helpful advice and survival guides for every aspect of system monitoring and control.
1-800-693-0351
Have a specific question? Ask our team of expert engineers and get a specific answer!
Sign up for the next DPS Factory Training!
Whether you're new to our equipment or you've used it for years, DPS factory training is the best way to get more from your monitoring.
Reserve Your Seat TodayThere's a pattern I see over and over: a team knows they should improve monitoring, but day-to-day work keeps it on the back burner - until a second incident forces action.
As I told a city-government client on a recent call: "We call that the second event model. We'll have a conversation, we'll do a quote, and sometimes it just sits until the powers that be get a shock to the system. I once had somebody blow the dust off an eight-year-old quote."
That idea shows up throughout my book "100% Uptime", too. Don't wait for the "second incident" to relearn the same lesson. Instead, use the first one to fix root causes and get proactive visibility in place.
When you do, you get rid of exhausting firefighting. Instead, you'll have calm, just-in-time maintenance and fewer after-hours emergencies.
Our client works for a city government that maintains radio/network sites. The team's vision started small, as many good ideas do, but then it expanded.
Client: "We had a lot of scope creep from what started as a battery monitoring project."
Client: "It expanded to include access control and cameras, so we're really looking to do something all-encompassing."
They also wanted serious redundancy ("how can I back that up and then how can I back that up?") and were planning to include RF subsystems alongside power, environmental, and security signals.
Then, reality hit:
Client: "The city recently, rather abruptly, found out we're poor. We had a massive drop in sales tax revenue."
Ambitions didn't disappear, but headcount did:
Client: "We need to have a real vision, especially with fewer people around."
This is exactly where disciplined monitoring pays off. When budgets shrink, situational awareness becomes the multiplier that lets a small team cover a big footprint - because you only roll trucks when the data says you must.
The team wasn't starting from zero. They had some legacy gear:
Client: "We have some older monitoring boxes that give us temperature, door, and other contact closures. But we really want to expand the scope of what we're doing."
They also had a clear, executive-friendly way to organize the problem - four monitoring domains:
Client: "We had four major domains: electrical and power systems, safety and security, environmental, and RF power."
That framing is useful for anyone planning a build-out:
Your goal is to move beyond a few "islands" of data and build a single, coherent picture of site health.
Nothing makes the case for holistic monitoring like a story where everything looked fine - until it wasn't. The team's generators were run-tested monthly. Then a real storm hit:
Client: "No problem. I've got a generator sitting on 1,000 gallons of diesel and I've got 3,600 amp-hours of battery. But the generator didn't start - and it had just been tested the Thursday before!"
The root cause? Not fuel... Not batteries...
Client: "The low-coolant alarm sensor was fouled, so it had plenty of coolant, but the sensor didn't know it. It would have been nice to know that in advance."
Two practical lessons pop out:
Without basic sensors and history, you can't tune run/stop conditions or catch "invisible" failures before they become outages.
The first pillar of high-reliability monitoring is breadth. To get a true picture of site health, bias toward capturing more data rather than less, within a "reasonable" scope.
That means instrumenting this client's four domains with some basic monitoring tech:
Just as important as what you collect is how you collect it. Use RTUs as "boots on the ground" to normalize contact closures, analogs, and protocol data (SNMP/Modbus/DNP3/TL1) into one coherent stream headed to your master. That architecture avoids fractured systems and ensures the right details reach the right people quickly.
Why the emphasis on breadth and normalization? Because once you have enough signals, you can see the causal chain (commercial power fail, then a generator start attempt, then a coolant sensor fault, then a failed start).
That's when simple logic becomes powerful. Combining alarms ("AND" commercial-power-fail with low-battery, for example) drives the correct action at the right moment - like starting a generator only when it's truly needed.
For our city client, the "all reasonable sources" mindset directly answers the pain we heard:
Bottom line: If you only collect "some" data, you'll only solve "some" problems. A modest expansion to cover the four domains - using mostly simple sensors - transforms your response from reactive to predictive. That's what keeps radios up, protects equipment, and lets a lean team sleep at night.
Once you're collecting the right signals, the next job is making sure the right people see the right detail at the right time. Our client already had the right instinct:
Client: "We'd like to do escalating notifications. For example, this one's an email, this one's the NMS, this one's a text message, and this one's a phone or radio call. We can do that."
That's the blueprint! Your system should support multiple channels and escalation logic. After a reasonable delay, an unacknowledged alarm should step up from NMS to email to text to a live call - or skip straight to the right person when severity demands it.
Me: "That's what you'll do with these 'Notification' columns in the NetGuardian. You can send everything to those two SNMP managers, everything except a few to your email, and then only a few select alarms to the audio output. You get to choose how you want to interact with it."
Two additional practices make this scalable:
Me: "Once you develop this RTU profile of alarm names and settings, the idea is to download it and then upload it into the next device so you get a head start on your configuration."
Me: "We use defined blocks. We call it our display map. We use the old-school telco 'display and points' concept. For example, the NetGuardian 832A has discrete alarms numered 1 through 32, and the second set is reserved for the larger 864A model."
The client's team was already doing the legwork to make this scalable:
Client: "That's why we had some folks go out and do the inventory - to figure out what's maximum. Then we added 10 or 20% in case there's a site we didn't think of."
That inventory (plus a sensible display map) prevents "alarm sprawl," keeps training simple, and makes escalations predictable instead of chaotic.
Once your remote-site data is flowing and your people are getting alerted appropriately, the next step is to shorten response times with simple automation. Start with the obvious, high-confidence plays.
You've already seen how a generator can fail despite passing a weekly run test. You shouldn't wait for a human to discover a stuck sensor or a bad start battery at 2 a.m. Use Boolean logic and timers to catch those conditions and act on them instantly.
We talked about this at the HVAC layer (which is really just comfort-control logic applied to uptime):
Me: "With the HVAC Controller, you can say to your system: on commercial power, run as much as you want. But on generator power, you can't run more than one. We can also set heating and cooling windows for start and stop triggers. And if someone gets out to the site and turns on Comfort Mode to be more comfortable during maintenance work, it will time out after a set period so they can't forget."
Even better is tying that "comfort" override to real-world presence:
Client: "Could that Comfort Mode be associated with the motion detector, or a Boolean logic of a closed door and lack of motion?"
Me: "We'd probably just add that to the firmware for you as part of your order. That doesn't sound like a very challenging modification."
That's the pattern: codify the things you already do manually. Use simple AND/OR logic and timers to guard against misuse and corner cases. Do it for generators, HVAC, access, and RF transmitters. Automation doesn't replace judgment - it buys you time by handling routine responses in seconds, not minutes. Automation can also take action locally during communication outages when you and your central NMS aren't able to talk to the local RTU.
With collection, distribution, and some automation in place, you finally have the breathing room to tune. This is where small improvements compound into fewer outages and lower OpEx spending.
A small example from our walkthrough:
Me: "Notice the thresholds I created for this analog value: 10% and 20% on the low side, then 75% and 85% on the high side. If the value is below 20% or above 75%, you'd get an alert."
That kind of tiered thresholding helps you avoid nuisance pages while still capturing early warnings. And visibility turns maintenance into a scheduled task, not a scramble:
Me: "This is that temperature+airflow sensor, showing only 32%. This is a bad situation. The filter has gotten clogged because 100% would have been normal. You press calibrate when you install, then it ticks down from there if the airflow slows down. If something goes wrong, the blower fails, or filters clog, you'll see it."
As you see airflow decay across multiple sites, you can time filter changes proactively (and stop wasting trips for premature swaps). Do the same with start-battery trends, rectifier voltage, temperature rise times, and RF power trends.
This is also the stage where you prune noisy alarms that teach you nothing. If something wakes someone up at night, it should either be rare (a true emergency) or genuinely actionable. Everything else belongs as quiet telemetry that informs your weekly review, not your phone on your nightstand.
The client's reality is familiar to every public agency and most enterprises right now:
Client: "I don't even have the money in the budget this year. At this point it's tire-kicking, but I'd definitely share what we have with you to help me define the project."
This doesn't have to be a problem. It's actually an opportunity to sequence your rollout:
Client: "Oh yeah, we're very standardized across our sites. We imagine pairs on a 66 block and have them all planned out."
Me: "They've long been able to send northbound SNMP or email. You can send everything to that SNMP manager, or assign it to multiple - up to eight notification devices. We can process SNMP coming into the NetGuardian now, too."
When you're ready to move, we'll meet you where you are:
Me: "If you send over the spec, we'll do an initial proposal and probably revise it once or twice after talking with you."
And we'll keep the engineering close to the conversation so small changes don't turn into big delays:
Me: "That's the advantage of keeping the engineers on-site at our headquarters. We're more flexible. If we can do something that saves you time, effort, and urgency, that's good."
Our client's generator story is the kind of "near miss" that drives real action. It's also a reminder that good luck during a test is not the same as resilience during a storm. The fix is straightforward:
That's how you help a smaller team cover a bigger map with fewer surprises. It's also how you avoid the pattern I described to this client:
Me: "We call that the second event model. We'll do a quote, and sometimes it just sits until the powers that be get a shock to the system. I once had somebody blow the dust off an eight-year-old quote."
Don't wait for a second incident to prove the point. Finish your inventory. Map your points. Centralize your alerts. Automate the obvious. Review and tune. You'll protect radios, people, budgets - and your sleep!
Andrew Erickson
Andrew Erickson is an Application Engineer at DPS Telecom, a manufacturer of semi-custom remote alarm monitoring systems based in Fresno, California. Andrew brings more than 18 years of experience building site monitoring solutions, developing intuitive user interfaces and documentation, and opt...