Building a Health Score That Actually Predicts Churn (Not Just Describes It)

I've seen the same conversation happen at every CS team I've worked with or talked to. Someone builds a health score, usually over a weekend sprint with a spreadsheet and some intuition about what matters. They assign weights — usage gets 40%, NPS gets 20%, support tickets get 15%, engagement gets 25%. They build it into their CS platform. It runs for six months, and during that time it does two things reliably: it turns red on accounts that have already churned (too late), and it stays green on accounts that churn unexpectedly (wrong). Both outcomes erode trust in the score until CSMs start ignoring it.

The diagnosis is almost always the same: the score is descriptive rather than predictive. It measures the current state of an account's health, not the trajectory. And because it's weighted based on intuition rather than actual churn event data, the weights don't reflect which signals genuinely precede churn — they reflect which signals are available and seem important.

Building a health score that actually predicts churn requires a different starting point, a different weighting methodology, and a different evaluation standard.

Start With Churn Forensics, Not Signal Availability

The first step most teams skip is a systematic retrospective analysis of their own churn events. Before you decide which signals to include or how to weight them, you need to know what signals actually moved in your churned accounts and how far in advance.

Pull your last 18-24 months of churn data. For each churned account, trace these questions: What did product usage look like at 90, 60, 30, and 7 days before churn? What did support ticket volume and content look like? Did the primary point of contact go dark at some point, and when? Was there a billing event (failed payment, downgrade inquiry, billing page visit spike) before the churn date? Was there an NPS response in the 90 days before churn, and if so, what was the score?

This forensic work usually surfaces a surprise: the signals that correlate most strongly with churn in your customer base are often not the ones that CS teams assume are most important. At one growing SaaS company I consulted with briefly before starting Vendarix, the forensic analysis revealed that billing page visits in the 30 days before churn had a stronger correlation with upcoming churn than both NPS scores and support ticket volume — and it was the one signal nobody was monitoring.

The output of this analysis is a ranked signal list: which signals, and at what time horizon, showed the strongest correlation with churn outcomes. That ranked list is the input to your weighting model — not intuition, not availability, not what sounds important in a meeting.

The Architecture of a Predictive Health Score

A predictive health score differs from a descriptive one in three structural ways: it incorporates trend direction, not just current state; it weights leading signals more heavily than lagging ones; and it includes a rate-of-change component that catches fast-moving deterioration.

Current state vs. trend direction. A usage score that measures "DAU/WAU ratio = 0.45" tells you the current state. A trend-aware score that measures "DAU/WAU ratio = 0.45, down from 0.72 fourteen days ago" tells you something completely different. The first account might be stable at a low engagement level — low but not changing. The second is in free fall. A scoring model that treats both identically is missing the most important information available.

For every input signal, you want to model not just the current value but the 14-day and 30-day delta. A score moving from 72 to 65 over two weeks should trigger different weighting than a score that has been sitting at 65 for six weeks.

Leading signal weighting. Based on your churn forensics, assign higher weights to signals that appear earlier in the causal chain. Feature adoption velocity and time-to-value metrics typically show up 60-90 days before churn. Billing page behavior and support ticket semantics typically show up 20-45 days before churn. Session frequency and NPS typically show up 7-21 days before churn.

This means a model that's optimized for lead time — for giving you maximum runway to intervene — should weight the early signals more heavily, even though they're noisier and harder to interpret. Yes, feature adoption stagnation at 60 days sometimes resolves on its own. But a model that fires 60 days early and is right 60% of the time is more valuable to a CS team than a model that fires 10 days early and is right 90% of the time. The CSM can't do much in 10 days. They can do a lot in 60.

Rate-of-change detection. Gradual decline is one risk pattern. Sudden deterioration is another, and it needs separate treatment. An account whose health score moves from 71 to 68 over 30 days is very different from one that moves from 71 to 45 over 10 days. The second pattern is the one that requires immediate escalation, not just a routine check-in task.

A velocity component in your health score model — something that measures how fast the score is moving, not just where it is — lets you differentiate these patterns. Rapid deterioration should override tier thresholds: an account whose score drops more than 15 points in seven days should escalate to critical response regardless of its current absolute score.

Signal Selection: The Full Stack

A health score with enough signal breadth to be genuinely predictive needs to draw from at least three data sources: product analytics, support/communication tools, and billing systems. Most CS teams have all three available. The majority only connect one or two to their health score model.

Product analytics signals (highest lead time):

DAU/WAU ratio — frequency of active usage relative to team size
Feature adoption breadth — how many core features have been activated vs. available
Feature adoption velocity — rate at which new features are being adopted over time
Power user ratio — what percentage of licensed seats show high-frequency engagement
Time-to-value checkpoint hit rate — did the account reach its first/second value milestone on schedule

Support and communication signals (medium lead time):

Support ticket volume trend (7-day vs. 30-day baseline)
Ticket semantic category — general questions vs. friction escalations vs. intent signals (export, data portability, cancellation, alternatives)
Response latency from primary contact — are they responding to CSM outreach at the same rate they were 30 days ago?
Last touchpoint recency — days since any meaningful CSM-customer interaction

Billing signals (medium to short lead time):

Billing page visit frequency in last 30 days
Pricing page visits (existing customers visiting pricing = comparison shopping)
Payment failure events — even if resolved, a payment failure is a behavioral signal
Downgrade inquiry in support tickets

Calibration: What Good Actually Looks Like

Here's a concrete example from a scoring model we ran during Vendarix's early testing. The test cohort covered 140 accounts at a B2B SaaS company with $5.1M ARR over a 90-day window. We scored each account daily using the multi-source model described above, and compared outcomes to a traditional usage-plus-NPS model the company had been running.

Traditional model performance: of 11 accounts that churned during the window, the traditional model gave 7 of them a "Watch" or "At-Risk" rating within 21 days of churn. Four churned from a "Healthy" status — meaning the model gave zero warning. Average lead time for the 7 it caught: 14 days.

Multi-source predictive model performance: of the same 11 churned accounts, the predictive model gave 9 of them an "At-Risk" or "Critical" rating. Average lead time for those 9: 48 days. Two accounts still churned from healthy status — one due to an organizational restructure that had no behavioral signals, one due to budget cut that happened abruptly. Some churn is structurally unpredictable. The goal is to catch the behavioral churn early, not to claim perfect prediction.

The two metrics that matter for evaluating a health score model are: false negative rate at meaningful lead time (how many churns did the model completely miss?) and average lead time for true positives (how early did it catch the ones it caught?). A model that catches 8 out of 10 churns at 45 days average lead time is dramatically more useful than one that catches 9 out of 10 at 12 days average lead time.

Common Weighting Mistakes

A few patterns we see regularly that undermine predictive accuracy:

Overweighting NPS. NPS is a valuable relationship signal, but it's event-driven (survey sent), subject to selection bias (only satisfied users respond), and operates on a quarterly cadence that has nothing to do with when churn risk develops. Weighting it above 15% in a predictive model usually hurts precision by adding noise at exactly the moments when churn risk is developing between survey cycles.

Ignoring renewal proximity as a risk amplifier. An account with a health score of 62 and a renewal in 90 days is a very different situation from an account with a health score of 62 and a renewal in 300 days. Some models treat these identically. Renewal proximity should function as a weighting multiplier, not a standalone signal — it amplifies the urgency of every other signal as the renewal date approaches.

Static weights. Weights set at implementation don't update as your product, customer base, and churn patterns evolve. We recommend re-running churn forensics every six months and adjusting signal weights accordingly. A SaaS product that launched two major features since the model was built should have those features' adoption signals incorporated into the health score — and the weight adjustments will be counterintuitive if you don't look at actual churn correlation data first.

We're not saying building a predictive health score is simple — the data plumbing alone across product analytics, support tooling, and billing systems requires real infrastructure. But the alternative is a score that tells you what already happened, which is a history book when what you need is a weather forecast.