1291

Get a Live Demo

You need to see DPS gear in action. Get a live demo with our engineers.

White Paper Series

Check out our White Paper Series!

A complete library of helpful advice and survival guides for every aspect of system monitoring and control.

DPS is here to help.

1-800-693-0351

Have a specific question? Ask our team of expert engineers and get a specific answer!

Learn the Easy Way

Sign up for the next DPS Factory Training!

DPS Factory Training

Whether you're new to our equipment or you've used it for years, DPS factory training is the best way to get more from your monitoring.

Reserve Your Seat Today

Generator Monitoring Failure: How One City Improved Remote Site Reliability

By Andrew Erickson

September 19, 2025

Share: 

There's a pattern I see over and over: a team knows they should improve monitoring, but day-to-day work keeps it on the back burner - until a second incident forces action.

As I told a city-government client on a recent call: "We call that the second event model. We'll have a conversation, we'll do a quote, and sometimes it just sits until the powers that be get a shock to the system. I once had somebody blow the dust off an eight-year-old quote."

That idea shows up throughout my book "100% Uptime", too. Don't wait for the "second incident" to relearn the same lesson. Instead, use the first one to fix root causes and get proactive visibility in place.

When you do, you get rid of exhausting firefighting. Instead, you'll have calm, just-in-time maintenance and fewer after-hours emergencies.

Generator Site

The client backdrop: big goals, tighter budgets

Our client works for a city government that maintains radio/network sites. The team's vision started small, as many good ideas do, but then it expanded.

Client: "We had a lot of scope creep from what started as a battery monitoring project."
Client: "It expanded to include access control and cameras, so we're really looking to do something all-encompassing."

They also wanted serious redundancy ("how can I back that up and then how can I back that up?") and were planning to include RF subsystems alongside power, environmental, and security signals.

Then, reality hit:

Client: "The city recently, rather abruptly, found out we're poor. We had a massive drop in sales tax revenue."

Ambitions didn't disappear, but headcount did:

Client: "We need to have a real vision, especially with fewer people around."

This is exactly where disciplined monitoring pays off. When budgets shrink, situational awareness becomes the multiplier that lets a small team cover a big footprint - because you only roll trucks when the data says you must.

What was actually happening on the ground

The team wasn't starting from zero. They had some legacy gear:

Client: "We have some older monitoring boxes that give us temperature, door, and other contact closures. But we really want to expand the scope of what we're doing."

They also had a clear, executive-friendly way to organize the problem - four monitoring domains:

Client: "We had four major domains: electrical and power systems, safety and security, environmental, and RF power."

That framing is useful for anyone planning a build-out:

  • Electrical & Power. DC plants (rectifiers, strings), AC transfer switches, generator, commercial power, plus "start battery". This, in the client's words, "seems to be where they don't work."
  • Safety & Security. Doors, motion, and perhaps cameras - focused on simple sensors that provide immediate value before adding video surveillance.
  • Environmental. Temperature and airflow to protect entire racks and avoid HVAC abuse.
  • RF Power. Forward/reflected power, combiner outputs, and dry contacts from RF systems.

Your goal is to move beyond a few "islands" of data and build a single, coherent picture of site health.

A relatable failure: A generator that wouldn't start

Nothing makes the case for holistic monitoring like a story where everything looked fine - until it wasn't. The team's generators were run-tested monthly. Then a real storm hit:

Client: "No problem. I've got a generator sitting on 1,000 gallons of diesel and I've got 3,600 amp-hours of battery. But the generator didn't start - and it had just been tested the Thursday before!"

The root cause? Not fuel... Not batteries...

Client: "The low-coolant alarm sensor was fouled, so it had plenty of coolant, but the sensor didn't know it. It would have been nice to know that in advance."

Two practical lessons pop out:

  1. Test conditions aren't real conditions. A weekly spin-up doesn't recreate grid failures, long runtime, or the exact sensor states you'll see at 2 a.m.
  2. Minimal signals miss a big piece of the story. If you only know "Generator Running" or "Fail," you'll miss the subtler early warnings (coolant level, failed-start attempts, start-battery health).

Without basic sensors and history, you can't tune run/stop conditions or catch "invisible" failures before they become outages.

You must collect alarms from all reasonable sources

The first pillar of high-reliability monitoring is breadth. To get a true picture of site health, bias toward capturing more data rather than less, within a "reasonable" scope.

That means instrumenting this client's four domains with some basic monitoring tech:

Power & Electrical:

  • Rectifiers & Battery Plants: Analog voltage into the RTU, plus "major/minor" contacts if available. Trend voltage for early-warning of failing strings.
  • Generator: Oil pressure, run time, failed start, engine temperature, excessive speed. Start-battery voltage as a dedicated analog. These prevent exactly the "tested Thursday, failed Monday" scenario.
  • Commercial Power & Transfer Switch: Discrete state plus runtime logic.

Environmental:

  • Temperature & Airflow: A single temperature reading tells you about HVAC failures, heat load, and even fire. Airflow exposes clogged filters and degraded cooling.
  • Leak/Smoke/Gas: Simple sensors that protect entire rooms of equipment from very real explosive and moisture risks.

Safety & Security:

  • Doors & Motion: Basic contact closures and motion sensors cut theft, vandalism, and "mystery outages" caused by unintended access.

RF Power:

  • Forward/Reflected Power & Combiner Outputs: Discrete or analog sensing (plus any vendor dry contacts) for transmit health.

Just as important as what you collect is how you collect it. Use RTUs as "boots on the ground" to normalize contact closures, analogs, and protocol data (SNMP/Modbus/DNP3/TL1) into one coherent stream headed to your master. That architecture avoids fractured systems and ensures the right details reach the right people quickly.

Why the emphasis on breadth and normalization? Because once you have enough signals, you can see the causal chain (commercial power fail, then a generator start attempt, then a coolant sensor fault, then a failed start).

That's when simple logic becomes powerful. Combining alarms ("AND" commercial-power-fail with low-battery, for example) drives the correct action at the right moment - like starting a generator only when it's truly needed.

For our city client, the "all reasonable sources" mindset directly answers the pain we heard:

  • Monthly generator tests didn't catch a sensor that lies, so they must add coolant/temperature/oil/failed-start and start-battery monitoring.
  • Legacy boxes created islands of temperature/door data, so they should consolidate via RTUs & an alarm master to correlate power, environmental, security, and RF events.
  • Fewer staff must cover more sites, so broader visibility reduces windshield time and lets you visit "according to reality, not your best guess."

Bottom line: If you only collect "some" data, you'll only solve "some" problems. A modest expansion to cover the four domains - using mostly simple sensors - transforms your response from reactive to predictive. That's what keeps radios up, protects equipment, and lets a lean team sleep at night.

Aggregate and distribute detail promptly

Once you're collecting the right signals, the next job is making sure the right people see the right detail at the right time. Our client already had the right instinct:

Client: "We'd like to do escalating notifications. For example, this one's an email, this one's the NMS, this one's a text message, and this one's a phone or radio call. We can do that."

That's the blueprint! Your system should support multiple channels and escalation logic. After a reasonable delay, an unacknowledged alarm should step up from NMS to email to text to a live call - or skip straight to the right person when severity demands it.

Me: "That's what you'll do with these 'Notification' columns in the NetGuardian. You can send everything to those two SNMP managers, everything except a few to your email, and then only a few select alarms to the audio output. You get to choose how you want to interact with it."

Two additional practices make this scalable:

  • Standardize once, reuse everywhere.
    Me: "Once you develop this RTU profile of alarm names and settings, the idea is to download it and then upload it into the next device so you get a head start on your configuration."
  • Name and map your points consistently.
    Me: "We use defined blocks. We call it our display map. We use the old-school telco 'display and points' concept. For example, the NetGuardian 832A has discrete alarms numered 1 through 32, and the second set is reserved for the larger 864A model."

The client's team was already doing the legwork to make this scalable:

Client: "That's why we had some folks go out and do the inventory - to figure out what's maximum. Then we added 10 or 20% in case there's a site we didn't think of."

That inventory (plus a sensible display map) prevents "alarm sprawl," keeps training simple, and makes escalations predictable instead of chaotic.

Turn known failure modes into automation

Once your remote-site data is flowing and your people are getting alerted appropriately, the next step is to shorten response times with simple automation. Start with the obvious, high-confidence plays.

You've already seen how a generator can fail despite passing a weekly run test. You shouldn't wait for a human to discover a stuck sensor or a bad start battery at 2 a.m. Use Boolean logic and timers to catch those conditions and act on them instantly.

We talked about this at the HVAC layer (which is really just comfort-control logic applied to uptime):

Me: "With the HVAC Controller, you can say to your system: on commercial power, run as much as you want. But on generator power, you can't run more than one. We can also set heating and cooling windows for start and stop triggers. And if someone gets out to the site and turns on Comfort Mode to be more comfortable during maintenance work, it will time out after a set period so they can't forget."

Even better is tying that "comfort" override to real-world presence:

Client: "Could that Comfort Mode be associated with the motion detector, or a Boolean logic of a closed door and lack of motion?"
Me: "We'd probably just add that to the firmware for you as part of your order. That doesn't sound like a very challenging modification."

That's the pattern: codify the things you already do manually. Use simple AND/OR logic and timers to guard against misuse and corner cases. Do it for generators, HVAC, access, and RF transmitters. Automation doesn't replace judgment - it buys you time by handling routine responses in seconds, not minutes. Automation can also take action locally during communication outages when you and your central NMS aren't able to talk to the local RTU.

Log, review, and tune

With collection, distribution, and some automation in place, you finally have the breathing room to tune. This is where small improvements compound into fewer outages and lower OpEx spending.

A small example from our walkthrough:

Me: "Notice the thresholds I created for this analog value: 10% and 20% on the low side, then 75% and 85% on the high side. If the value is below 20% or above 75%, you'd get an alert."

That kind of tiered thresholding helps you avoid nuisance pages while still capturing early warnings. And visibility turns maintenance into a scheduled task, not a scramble:

Me: "This is that temperature+airflow sensor, showing only 32%. This is a bad situation. The filter has gotten clogged because 100% would have been normal. You press calibrate when you install, then it ticks down from there if the airflow slows down. If something goes wrong, the blower fails, or filters clog, you'll see it."

As you see airflow decay across multiple sites, you can time filter changes proactively (and stop wasting trips for premature swaps). Do the same with start-battery trends, rectifier voltage, temperature rise times, and RF power trends.

This is also the stage where you prune noisy alarms that teach you nothing. If something wakes someone up at night, it should either be rare (a true emergency) or genuinely actionable. Everything else belongs as quiet telemetry that informs your weekly review, not your phone on your nightstand.

How to work within your constrained budget

The client's reality is familiar to every public agency and most enterprises right now:

Client: "I don't even have the money in the budget this year. At this point it's tire-kicking, but I'd definitely share what we have with you to help me define the project."

This doesn't have to be a problem. It's actually an opportunity to sequence your rollout:

  1. Inventory & standardize your alarm points
    Client: "Oh yeah, we're very standardized across our sites. We imagine pairs on a 66 block and have them all planned out."
  2. Phase the points in order of risk.
    Start with power chain (commercial power, transfer switch, generator with start-battery analog, rectifier/battery plant), then environment (temp/airflow/leak/smoke), then security (doors/motion), then RF. You'll prevent the most expensive incidents first.
  3. Aggregate alerts, then escalate sanely.
    Get away from islands of web interfaces. Centralize into a master or a unified NMS feed.
    Me: "They've long been able to send northbound SNMP or email. You can send everything to that SNMP manager, or assign it to multiple - up to eight notification devices. We can process SNMP coming into the NetGuardian now, too."
  4. Automate the obvious.
    Start with simple "echo" actions (failed start → retry; after-hours door → camera preset). Add AND/OR logic where it prevents on-generator overrun and dead batteries. Tie "comfort" overrides to motion/door sensors and time them out to avoid human forgetfulness that wastes power.
  5. Lock in small wins with reviews.
    Hold a monthly 30-minute "alarm cleanup" to retire noisy points and adjust thresholds based on what the logs actually show. Capture lessons in your alarm text ("If X, check Y; bring part Z") so future techs don't have to rediscover them.

When you're ready to move, we'll meet you where you are:

Me: "If you send over the spec, we'll do an initial proposal and probably revise it once or twice after talking with you."

And we'll keep the engineering close to the conversation so small changes don't turn into big delays:

Me: "That's the advantage of keeping the engineers on-site at our headquarters. We're more flexible. If we can do something that saves you time, effort, and urgency, that's good."

Being proactive pays off quicker than you think

Our client's generator story is the kind of "near miss" that drives real action. It's also a reminder that good luck during a test is not the same as resilience during a storm. The fix is straightforward:

  • Collect from all reasonable sources across power, environment, security, and RF.
  • Aggregate and distribute details so the right person sees the right data at the right time.
  • Automate the known patterns so routine responses happen in seconds.
  • Log and tune so you steadily reduce noise and cost.

That's how you help a smaller team cover a bigger map with fewer surprises. It's also how you avoid the pattern I described to this client:

Me: "We call that the second event model. We'll do a quote, and sometimes it just sits until the powers that be get a shock to the system. I once had somebody blow the dust off an eight-year-old quote."

Don't wait for a second incident to prove the point. Finish your inventory. Map your points. Centralize your alerts. Automate the obvious. Review and tune. You'll protect radios, people, budgets - and your sleep!

Share: 
Andrew Erickson

Andrew Erickson

Andrew Erickson is an Application Engineer at DPS Telecom, a manufacturer of semi-custom remote alarm monitoring systems based in Fresno, California. Andrew brings more than 18 years of experience building site monitoring solutions, developing intuitive user interfaces and documentation, and opt...