storage.m-khalifa.com / field notes
Field notes · infrastructure monitoring

The vendor thing on your floor is infrastructure.

Four articles, one argument. Every platform that arrives with its own console and its own definition of "used" ends up monitored by a tool that stops at that console's edge — and the operational picture scatters across portals nobody on the floor team can log into. The fix is always the same: normalize it into the model your team already trusts. These pieces work that idea through Outposts, replication, alerts, and capacity.

The articles

Each is a standalone read and each is WordPress-ready — the body markup below pastes into a Gutenberg custom-HTML block as-is, diagrams included.

Article 01 · Hybrid Cloud

Your AWS Outpost Is On-Prem Infrastructure That Doesn't Know It Yet

Why the gap between the cloud console and the physical rack costs more than the hardware.

Your AWS Outpost arrived six months ago. It's racked in the cage next to your Pure FlashArrays and your Nutanix nodes, running workloads that can't leave the building. The console says it's healthy. Your storage team has never logged in. Your virtualization admin doesn't know it exists. And nobody can tell you, right now, how much raw capacity is left on the physical rack versus how much you've already carved up for tenants.

That last gap is the whole problem. The number exists. It just lives somewhere none of the usual tools look.

The console shows a cloud. The rack is a data center.

AWS built the Outposts console to feel like the rest of AWS: EC2 instances, EBS volumes, CloudWatch graphs, and a green "Available" badge. From the console's point of view, the Outpost is one more Availability Zone. That works fine, right up until you ask a question the cloud model was never built to answer.

How full is the physical rack? The console shows the volumes you've provisioned. It doesn't show the physical capacity underneath them, and that's the number that decides when "add another workload" becomes "issue a purchase order." AWS does publish it, in CloudWatch, on a dashboard somebody has to build. Unless a person went looking for it on purpose, it's on nobody's screen.

Which volumes are orphaned? The console lists volumes by Availability Zone. It won't tell you which ones are attached to nothing, billing since the last migration, serving no workload. Unattached volumes can sit there forever until an administrator hunts them down. That's true on every platform, cloud or on-prem; the difference is whether anything is watching. Mostly, nothing is. Harness, in its 2025 FinOps in Focus report, found that only 39% of teams have real-time visibility into unused or orphaned resources. From the moment that waste starts accruing, identifying and eliminating it takes an average of 31 days.

Why are some instances missing memory metrics? Is that a configuration gap or a monitoring failure? You can't tell from the console. CloudWatch doesn't report memory or filesystem usage on its own; that takes the CloudWatch Agent inside each instance. Without the agent, the console shows nothing. And nothing looks exactly like zero. At 2 a.m., that's the difference between "ignore" and "page someone."

You ask these same questions about every platform on the floor. With the Outpost, though, the cloud tools treat it as cloud and the floor team treats it as someone else's problem. The device falls into the gap between them, and the gap is wide enough to hide real money.

Every vendor names the same thing differently

This is where mixed-environment monitoring gets frustrating, and it has nothing to do with AWS specifically.

Your Pure array reports capacity as used, available, and effective. Your VxRail cluster reports datastore capacity, provisioned versus consumed, with VMDK-level detail underneath. Your Nutanix nodes report storage capacity in TB with a separate physical-usage figure. The Outpost reports EBS volume sizes, per-instance-family utilization, and rack capacity as a CloudWatch metric in gigabytes per volume type.

Same concept. Storage capacity. Four vocabularies, four definitions of "used," four thresholds for "full," and three of the four consoles behind credentials your storage admin doesn't have.

Four names for the same number Pure Storage "effective used" (after data reduction) Dell VxRail "datastore consumed" (vCenter) Nutanix "storage pool usage" (Prism) AWS Outpost EBSVolumeTypeCapacityUtilizationGB VISUAL ONE INTELLIGENCE Used capacity % one definition · one threshold one trend line, every platform four consoles · four definitions · four thresholds for "full"
Every platform names used capacity differently. Normalization makes them one number.
PlatformCore capacity metricGranular viewWhere it lives
Pure StorageUsable vs. effective (after data reduction)Array / volumeStorage console
Dell VxRailDatastore provisioned vs. consumedVMDK / hostvCenter
NutanixPhysical raw vs. storage poolContainer / VMPrism
AWS OutpostEBS provisioned (GB per volume type)Volume / instance familyCloudWatch, IAM-gated

The translation problem isn't cosmetic. When someone asks "how much free space do we have," the answer depends on which screen they're standing in front of. When leadership asks how much runway is left before procurement lead times become a risk, the answer is a spreadsheet stitched from four sources, by hand, stale before the meeting starts. Now multiply that by IOPS, latency, memory, snapshots, and health. The blind spots stop being a mystery.

The on-prem cloud is still on-prem

Infrastructure teams have started managing Outposts differently than they did two years ago.

Early on, the rack was a cloud outpost. The cloud team owned it, the cloud budget covered it, cloud tools watched it. That held when one rack ran one workload.

It holds less well now. Teams are adding RDS for managed databases, EKS and ECS for containers, S3 on Outposts for local object storage, and Local Gateway network routing, all next to plain EC2 on the same rack.

At that point the rack stops behaving like a remote Region extension. Compute, storage, networking, a host layer, a storage pool, snapshots, health events, performance counters: that's a hyperconverged appliance sitting in your data center. The people who own the data center should be the ones watching it, the same way they watch the Dell VxRail or the HPE Alletra two cabinets down.

Treat the Outpost as cloud and it gets monitored by tools that stop at the console's edge. Treat it as infrastructure and it lands in the same capacity model you already trust for everything else.

The framing changes the outcome. A cloud dashboard will tell you a volume is slow. It won't tell you the rack has three weeks of capacity left at the current growth rate. Treat it as infrastructure and it lands in the same dashboards, the same alert rules, and the same capacity model you already trust for everything else.

What containers and object storage add to the blind spot

And the gap is widening, because Outposts isn't just block storage and VMs anymore.

S3 on Outposts gives you local object storage with the S3 API, provisioned against the rack's capacity. EKS runs Kubernetes with worker nodes on Outpost EC2 capacity while the control plane stays in-Region. ECS runs container tasks the same way. Each consumes the same finite physical pool, and each reports through a different AWS surface.

Think of the rack as one box with three meters bolted to the outside. The S3 meter watches a fixed allocation you sized when you ordered the rack; it doesn't budge when block storage fills up, and block storage doesn't budge when it does. The EKS console counts pods and nodes, not the EC2 capacity those nodes drain from the same box. ECS counts tasks the same way. Each meter is honest about its own slice. None of them shows the box.

Run Tanzu on vSphere or Karbon on Nutanix and the hyperconverged platform already folds container consumption into the same capacity view as the VMs. Outposts doesn't. The Kubernetes layer, the object layer, and the block layer each report in isolation, and reconciling them into one number is manual work nobody planned for.

What falls through the gap

None of what follows will surprise anyone who manages physical storage. What's surprising is how scattered the source data is, and how little of it reaches the people who need it.

Rack-level capacity isn't in the console view. This is the number that decides when you start the next purchase order, and an Outpost expansion is a procurement and logistics exercise measured in weeks. The console foregrounds the sum of provisioned volumes, which says nothing about what's left on the rack. The real figure sits in CloudWatch, split per volume type, waiting for someone to assemble it.

And over-provisioning is the norm, not the exception. Lucidity's State of Cloud Storage 2026 report, drawn from enterprise block-storage assessments across AWS, Azure, and Google Cloud, put average cloud disk utilization at roughly 30%. Most provisioned capacity is paid-for headroom doing nothing, and you can't manage a number your tooling never shows you.

Orphan detection isn't part of the storage workflow. The console lists volumes; it doesn't flag which are unattached, for how long, or at what cost. AWS does have checks that can surface unattached volumes, Trusted Advisor and AWS Config among them, but those are cloud-team tools behind cloud-team credentials, not alerts in the storage team's weekly review. On a small deployment that's a rounding error. At scale it compounds: avoidable cloud waste has held stubbornly steady for years, and almost no organization escapes it entirely.

Health is fragmented, not consolidated. The console rolls health up to a single status, but the real picture is spread across multiple services: connectivity in one place, per-asset hardware and lifecycle state in another, maintenance and event notifications in a third. The data exists. The unified picture doesn't. Unless someone correlates the signals by hand, the rack's true health never reaches the team that owns the floor.

Latency exists per volume, not across the rack. AWS publishes average read and write latency for individual volumes, and you can chart or alarm on any one of them. What it doesn't give you is the number a storage team reads first: the rack's latency as one device, on one trend line, next to the arrays. That view you assemble yourself, or you operate without it.

Snapshot growth isn't surfaced as a capacity-planning trend. And snapshots are not an edge case. They're AWS's default backup mechanism for EBS, automated through Data Lifecycle Manager policies. On an Outpost they can be stored locally, where they draw from the rack's fixed S3 allocation. That consumption never shows up in bucket metrics. Delete a snapshot and the capacity takes up to 72 hours to come back; AWS's own guidance is to set CloudWatch alarms so you don't run out. A daily snapshot policy with no cleanup stays invisible until it surfaces as a capacity shortfall.

Capacity isn't the only planning problem, either. A rack in your building has constraints in-Region services don't: hardware refresh cycles, maintenance windows, and expansion lead times measured in weeks. In-Region you click a button for more capacity; on an Outpost you start a procurement. Storage teams already run that planning discipline for every array on the floor, and the Outpost belongs in the same plan.

Normalization is the actual product

Anyone can script the CloudWatch API into a spreadsheet. That's a weekend. The part that changes how a team operates is normalization: translating the Outpost's metrics into the same language, units, and severity model the rest of the floor already speaks.

CloudWatch is the telemetry source, not the assembly.

Done properly, an EBS volume reads as a datastore. An EC2 instance reads as a VM. The rack reads as a storage array with usable capacity, used, free, IOPS, latency, and throughput, in the same terms as the FlashArray in the next cabinet. EKS worker consumption, ECS tasks, and S3 on Outposts capacity resolve against the same physical rack, so the picture reflects everything drawing on it.

From scattered AWS telemetry to one capacity model Outposts + EC2 APIs instances, volumes, snapshots, assets CloudWatch capacity, performance, connectivity AWS Health events and scheduled maintenance AWS OUTPOSTS RACK TELEMETRY VISUAL ONE INTELLIGENCE normalize: language, units, severity ONE CAPACITY VIEW Pure FlashArray62% NetApp FAS48% Dell VxRail71% AWS Outpost55% arrays · datastores · VMs · snapshots · health
One integration turns scattered AWS telemetry into the capacity model the rest of the floor already uses.

The payoff: a storage admin who has never opened the AWS console can read the Outpost next to every other device. One place, one vocabulary, one trend line. The rack shows up beside the NetApp FAS and the Unity. Its orphaned volumes appear in the same review that flags over-provisioned LUNs. Its health alerts use the same scale as the SAN. And capacity planning stops requiring someone to reconcile four portals into a two-week-old spreadsheet.

Questions infrastructure leaders are asking

Why not just use CloudWatch?

CloudWatch is the telemetry source, not the assembly. The raw counters, capacity figures, and connectivity state are all there. What isn't there is the work of turning them into one operational model: deriving the metrics a storage team actually reads, pulling the scattered health signals together, and normalizing it all against the other platforms on your floor. You can build that. Most teams don't, because the gap isn't painful enough to fund until the shortfall arrives.

Does this replace AWS monitoring or sit alongside it?

Alongside. CloudWatch stays the source. What changes is who can read it: storage and virtualization teams see the same data in the platform they already use for Pure, NetApp, and VxRail, without translating through the cloud team.

How does it handle multi-account Outposts?

Through standard AWS cross-account patterns: CloudWatch cross-account observability, IAM roles, or AWS Organizations. The monitoring account gets read-only access to each member account's metrics. One integration, one normalized data set, however many accounts share the hardware.

The bottom line

The gap between the console view and the data center floor isn't a defect. AWS built the console to abstract the hardware, and for in-Region EC2 that abstraction is exactly right. For a physical rack in your building, it scatters the operational picture across CloudWatch, AWS Health, and Outposts-specific APIs, precisely where infrastructure teams need one consolidated view.

The fix isn't another dashboard. It's the same fix that worked for every platform that arrived with its own console and its own definition of "used": normalize it into the framework your team already trusts, and let the rack speak the same language as everything else on the floor.

Your Outpost is infrastructure. Monitor it like infrastructure.

See your Outpost alongside the rest of your infrastructure, in one view, with the metrics and terminology you already use.

Book a Visual One Intelligence demo →

Sources: Harness FinOps in Focus 2025; Lucidity State of Cloud Storage 2026; AWS documentation on Amazon EBS local snapshots on Outposts and S3 on Outposts capacity management.

↑ back to all articles
Article 02 · Business Continuity

Your DR Dashboard Is Green. That Doesn't Mean Your RPO Is Met.

A healthy replication status tells you the link is up. It doesn't tell you how far behind your copy actually is.

An auditor asks the one question a DR program exists to answer: if the primary site dropped right now, how much data would you lose on the ERP volumes? You open the replication view. Everything's green. And you still can't answer, because "green" and "how much would we lose" are not the same measurement. On half your arrays they aren't even in the same console.

The status light is honest about what it reports. It just doesn't report the thing the auditor asked for.

Green is the link, not the promise

Two numbers own disaster recovery. RPO is how much data you can afford to lose. RTO is how long you can afford to be down. A health indicator speaks to neither one directly. It tells you the replication relationship is functioning, which is necessary and not sufficient.

The reason the gap matters is timing. Synchronous replication commits on both arrays before the host gets its acknowledgment, so the copy is always current and the RPO is zero. Asynchronous replication acknowledges locally and catches the remote up afterward, so there is always a gap, and the size of that gap is your RPO at this exact minute. SRDF/A ships changes in cycles, defaulting to about 15 seconds on current arrays. SnapMirror runs on a schedule and, to its credit, exposes a lag_time field directly. Hitachi Universal Replicator drains a journal, so your recovery point is journal fill versus drain rate. Pure ActiveDR runs continuous near-sync.

Here's what the light won't tell you: an async pair that's perfectly "healthy" can be seconds behind, or, if the link got throttled at 2 a.m. and nobody watched the lag climb, it can be hours behind. Same green dot either way. The recovery point is a number that moves minute to minute, and it lives in a different field than the status.

Every vendor names the same states differently

Synchronized, PAIR, snapmirrored-and-healthy, replicating: four vendors, one meaning. Split, PSUS, Broken-off, paused: four vendors, one meaning again, "stopped, whether on purpose or not." SSWS, Failed Over, promoted: the target is serving now.

ConceptDell SRDFNetApp SnapMirrorHitachi (TC / UR)Pure ActiveDR
Healthy / in syncSynchronized (S) / Consistent (A)Snapmirrored, healthyPAIRreplicating
StoppedSplit / SuspendedBroken-off / QuiescedPSUS / PSUEpaused
Target servingFailed OverBroken-off (writable)SSWSpromoted
Failover verbsymrdf failoversnapmirror breakhorctakeoverpurepod promote

An engineer fluent in one stack freezes in front of another, not because the concept changed but because the words did. When one console says PSUS and another says Broken-off, somebody has to already know those are the same state before they can compare the two. Nobody's doing that translation at 2 a.m. during an incident, which is exactly when it matters.

Countable copies versus one living copy

Mechanism decides the math, and the math decides what "compliant" even means. Snapshot-based replication keeps N discrete, immutable copies, so compliance is a real count: N of the M you expected. Journal, delta-set, and streaming replication keep one continuously updated copy, so compliance is binary: synced or not. Active-active metro is binary too: Active or Suspended. Roll all of those into one green dot and you've hidden which kind of protection you actually have.

Local copies and remote copies also protect against different failures, so they get counted separately. Eighteen local snapshots plus twenty remote copies is "18 local, 20 remote." It is never 38. A dashboard that adds them is lying to you politely.

A replication status is a light. An RPO is a number. Monitor the number.

Consistency groups, or write order you don't actually have

Put a database's data on one volume and its logs on another, and both have to fail over to the same instant or the recovered copy is inconsistent. Every vendor has the construct that guarantees this. Consistency group, protection group, RDF group, journal group, copy group: same idea, different label. A per-volume green light spread across volumes that were never grouped is the most expensive false comfort there is, because it looks protected and it recovers corrupt.

Normalization is measuring the promise, not the light

The honest DR view isn't a color. It's each relationship's actual recovery point, in one unit, sitting next to the RPO you promised the business, with every vendor's native state mapped into one lifecycle so that "healthy" means the same bar on the SRDF pair and the SnapMirror relationship. That's the part that takes work: the state translation, the lag normalization, the local-versus-remote accounting. The green dot is free. The measured number is the product.

And failover is a write-ownership transfer, not a copy operation. Every takeover leaves an inverted relationship that someone has to resync, reverse, or rebuild. A DR view that ends at "site B is up" is half a view. The return leg belongs in it too.

Four "green" lights, one measured RPO SRDF/Astate: Consistent · cycle ~15s SnapMirrorstate: Snapmirrored · lag_time 14m Universal Replicatorstate: PAIR · journal 12% Pure ActiveDRstate: replicating · lag ~2s map states → one lifecycle RPO: MET vs TARGET ERP block (SRDF/A)0.2s / 5s ✓ NAS (SnapMirror)14m / 15m ⚠ DB (UR journal)~30s / 60s ✓ Web (ActiveDR)2s / 30s ✓ local: 18 · remote: 20 (counted apart) every pair measured against its promise
Four vendor state names, four ways of expressing lag — normalized into one recovery-point figure per relationship, measured against the RPO you promised.

Questions DR owners are asking

Isn't a healthy status good enough?

Healthy means the mechanism is running. It doesn't tell you the current lag on an async pair, whether a consistency group is complete, or whether your last drill actually hit the target. Those are separate facts, and they're the ones that decide a real recovery.

Do we still need DR drills if everything's green?

Yes. The recovery point and time you actually achieve only show up when you test. The distance between the objective you set and the number you hit at the last drill is precisely the thing a DR program exists to close, and no status light measures it for you.

How do you compare async lag across vendors that report it differently?

Normalize each vendor's signal — SRDF/A cycle time, SnapMirror lag_time, UR journal state, ActiveDR lag — into one recovery-point figure in one unit, then hold it against the promised RPO. On one axis, the pairs finally become comparable.

The bottom line

The green light was never the deliverable. The deliverable is a recovery point you can state as a number and defend at the next audit. Map every vendor's states into one lifecycle, measure the actual lag against the promise, keep local and remote protection where you can see both, and write the failback leg into the plan before you need it.

Treat replication as a recovery guarantee you measure — not a light you glance at.

Read every replication relationship on every array against the RPO you promised — one lifecycle, one unit, one view.

Book a Visual One Intelligence demo →

Sources: Dell Solutions Enabler SRDF documentation; NetApp ONTAP SnapMirror command reference; Hitachi CCI / Universal Replicator guides; Pure Storage ActiveDR documentation. Cross-vendor state and command mappings verified against the companion Storage Field Guide — Replication Rosetta Stone.

↑ back to all articles
Article 03 · Operations

Your Storage Sends 400 Alerts a Day. That's Not Monitoring.

More alerts is less monitoring. Every array screams in its own severity scale, and the union of all of them teaches your team to look away.

The vSAN health service throws a few dozen warnings overnight. It does that most nights. So months ago, the team muted the channel. Then a disk group actually degrades. The warning lands in the muted channel with the other forty, and the first anyone hears of it is a VM that won't power on the next morning.

Nobody ignored a real alert. They ignored a channel that had cried wolf every night since spring. That's the failure mode, and you don't fix it by telling people to care harder.

Volume is the enemy of attention

A monitoring system has exactly one job: move one specific signal to one specific person in time to act on it. Every alert that isn't that erodes the odds the real one gets seen. Four hundred alerts a day isn't four hundred times the coverage. It's a filter, and the filter is a human deciding to stop looking.

So the raw count is the wrong target. A feed that fires constantly and a feed that never fires are the same feed operationally: nobody's reading either one. The number that matters is how many alerts a day actually require a human, and on most floors that number is small and buried.

Every array defines "critical" differently

Pure has its severities, NetApp has EMS levels, VMware paints health green-yellow-red, Dell and IBM have their own scales. A "warning" on one platform means call someone now; on another it means look at it Tuesday; on a third it's pure noise the vendor emits by design. Stack all of them into one feed without translating, and the word "critical" stops meaning anything, because it now means five different things at once.

The fix isn't fewer sources. It's one severity scale that every vendor's levels map into deliberately, so "critical" is the same bar on the SAN as it is on the hyperconverged cluster. Until that mapping exists, comparing urgency across platforms is guesswork wearing a color.

A channel that cries wolf every night gets muted. The fix is to make it worth watching.

Correlated events are one event

A controller reboots and throws thirty downstream alerts: paths down, datastores unreachable, latency spikes, HA warnings. That's one root cause wearing thirty costumes. Page on each and you've paged someone thirty times for a single problem, and buried the cause inside its own symptoms.

Correlation collapses those thirty into one incident, root cause on top, symptoms attached underneath. That's the difference between thirty pager buzzes at 3 a.m. and one message that says what broke and why. The events didn't disappear; they got organized into the shape a human can act on.

The alert nobody configured

The dangerous gaps aren't loud. They're silent. A metric that stopped reporting looks identical to a healthy zero. An instance with no memory agent reports nothing, and nothing renders as fine. A collector that quietly lost its credentials, a replication pair that stopped without flipping to error, a host that dropped off the poll: none of those trips a threshold, because there's no value to breach.

Absence isn't an alert on any single console. A model that knows what should be reporting can alarm on the silence. A raw event feed can't, because there's nothing in it to forward. This is the class of failure that costs the most and shows up the least, and it only becomes visible once something is tracking expected inputs, not just incoming ones.

Normalization brings the signal back

The move here is the same one that works for capacity and replication: stop treating each vendor's output as its own island. Normalize severity into one scale. Correlate related events into one incident. Suppress the informational chatter the vendors emit by default. Alarm on absence, not only on errors. Do that and the channel is worth watching again, because when it fires now, it means something.

400 raw events → the 3 that need a human RAW FEED · every vendor, every severity, unfiltered ~400 / day NORMALIZE one severity scale correlate → 1 incident suppress info noise alarm on absence WHAT PAGES A HUMAN Disk group degradedvSAN · 1 root cause, 12 symptoms Replication stoppedasync pair silent — no error thrown Collector went quietexpected input absent for 30 min everything else stays queryable
The raw events don't disappear — they get normalized, correlated, and ranked, so the three that need a human aren't buried under the ones that don't.

Questions operations teams are asking

Won't filtering hide real problems?

The opposite. Suppressing informational chatter and collapsing duplicate symptoms is what makes a real problem visible. You're not deleting anything — the raw events stay queryable. What changes is which of them earns the right to page a human.

How do you map one vendor's severity to another's?

You define one scale and map each platform's levels into it on purpose, the way you'd map fields in a data migration. A Pure warning and a VxRail warning get placed by what they mean for the workload, not by the accident of sharing a word.

What about an alert a vendor marks critical that isn't, for us?

That's a tuning decision, and it belongs in the normalization layer, not in a vendor console you can't change fleet-wide. Reclassify it once, centrally, and it's reclassified everywhere at once.

The bottom line

You don't beat alert fatigue by caring harder. You beat it by making the feed worth trusting: one severity scale, correlated incidents instead of duplicate symptoms, informational noise suppressed, and silence treated as a signal in its own right. Four hundred honest events become the three that need a human, and the channel stops being the thing everyone mutes.

Monitoring isn't catching everything. It's catching the one thing, in time to act.

One severity scale across every array, correlated incidents instead of symptom storms, and an alarm when a source goes quiet.

Book a Visual One Intelligence demo →

Sources: VMware vSAN Skyline Health, NetApp EMS/AutoSupport, and Pure, Dell, and IBM alerting documentation for per-vendor severity models. Severity-normalization and event-correlation approach drawn from production multi-vendor monitoring work.

↑ back to all articles
Article 04 · Capacity Planning

The Free-Space Number on Your Dashboard Is Probably Wrong

Raw, usable, effective, provisioned — four numbers, four vendors, and the one your dashboard shows is rarely the one your plan needs.

Leadership asks the simplest possible question: how much runway before we buy more? You pull the numbers. The Pure array says one thing, the NetApp says another, the hypervisor a third, and they disagree by double digits. Nobody's lying. They're answering four different questions that all happen to use the word "capacity."

Free space feels like it should be one fact. On a modern array it's at least four, and the console usually shows you the wrong one for the decision you're making.

Base-2 versus base-10, the 7% nobody agreed on

A gibibyte is 230 bytes. A gigabyte, the way procurement sheets and some tools use it, is 109. That's about a 7% gap, and it widens as you go up the scale — a TB and a TiB differ by nearly 10%. Arrays, operating systems, and purchase orders mix the two freely, usually without saying which they mean. Convert once, at a known boundary, and label the unit. Skip that and every downstream number inherits an argument nobody can win.

Usable, effective, raw — three answers to "how much"

Raw is the disk you bought. Usable is what's left after RAID or erasure coding and system overhead take their cut. Effective is usable multiplied by whatever data reduction the array happens to be achieving right now. Quote effective as if it were usable and you've promised runway that only exists while the current reduction ratio holds. And it's a ratio, not a constant: bring on a workload that doesn't dedupe or compress well — encrypted data is the classic one — and the effective number moves the wrong way while the physical disk fills at its own pace.

A data-reduction ratio is not one number

Data reduction, meaning dedupe plus compression, is not the same figure as total reduction, which also folds in thin-provisioning savings and sometimes snapshots. Different numbers, different denominators. Put the wrong one in a report and the ratio looks better than the array is actually delivering. The first person who checks it against physical consumption stops trusting your numbers, and trust is the whole job of a capacity report.

The console's favorite number and the purchase order's number are rarely the same one. Plan on the second.

Thin provisioning: the number the console loves and the plan can't use

On a thin-provisioned array, the sum of what you've provisioned can sit far above the physical capacity underneath it, by design. The console foregrounds provisioned, because provisioned is what tenants asked for and what looks like activity. But provisioned tells you nothing about physical runway, which is the number that decides when you start the purchase order.

If that sounds familiar, it's the same trap as the AWS Outpost that shows you the sum of your volumes and not the rack underneath. It isn't a cloud problem. It's every thin-provisioned array on the floor, and the fix is the same: find the physical number and put it where the planning happens.

Normalization: one definition of full, everywhere

Pick one definition of used and one of free. Decide up front whether you present base-2 or base-10. Convert exactly once, at the edge of the pipeline, and label it. Then make every array answer in those same terms, so the Pure array's "effective used," the NetApp's numbers, and the hypervisor's "datastore consumed" all resolve to the same axis. Now "how much runway" has one answer instead of four, and it's the physical answer the purchase order needs rather than the effective answer the marketing sheet prefers.

Four rungs, and the console shows the wrong one Raw purchased 100 TiB Usable (after RAID + overhead) ≈ 78 TiB Effective (usable × current DRR) ≈ 230 TiB* Provisioned to tenants 300 TiB physical limit (usable) * effective assumes the current reduction ratio holds — it is not guaranteed headroom provisioned > physical: the console's headline number can't answer "when do we buy?" Capacity planning lives on the usable rung. Most dashboards headline effective or provisioned.
Raw, usable, effective, and provisioned are four different numbers on one array. Capacity planning needs the usable rung; consoles tend to headline the other three.

Questions capacity owners are asking

Which number should capacity planning use?

Physical usable and physical free, in one consistent unit. Effective capacity is useful for cost conversations, but a purchase-order decision has to rest on the physical number, because the physical number is the one that actually runs out.

Is effective capacity dishonest?

No — it's real, as long as the reduction ratio holds. The dishonesty is quoting it as guaranteed headroom. Report it, label it as reduction-dependent, and keep the physical number right next to it so nobody plans on savings that a future workload can erase.

How do you compare capacity across vendors that define it differently?

Normalize each vendor's raw, usable, effective, and provisioned figures into one model with a single definition of used and free, converted to one unit. Then the Pure array and the NetApp read on the same axis, and "how full is the floor" becomes one number instead of a reconciliation exercise.

The bottom line

"How much space is left" only produces a wrong answer when four consoles each answer a different question. Decide what used and free mean, convert once, label the unit, and lead capacity planning with physical numbers rather than reduction-dependent ones. The runway figure gets boring and correct at the same time, which is exactly what you want it to be when it's sitting in front of a budget.

Pick one definition of full, convert once, and make every array answer in the same units.

One capacity model across every array and hypervisor — physical and effective side by side, in units you chose, on one trend line.

Book a Visual One Intelligence demo →

Sources: Pure Storage (usable vs. effective), NetApp, Dell, and Nutanix capacity documentation. Base-2/base-10 conversion and data-reduction definitions cross-checked against the companion Storage Field Guide capacity converter and its capacity gotchas.

↑ back to all articles