“We have RAC, so we’re covered for DR.” It’s one of the most expensive sentences in Oracle operations, and I’ve watched variations of it play out more than once. Real Application Clusters (RAC) and Data Guard both live under the “high availability” umbrella, so it’s easy to assume they’re interchangeable — or that having one means you don’t need the other. They are not interchangeable. They solve different failures, and the cost of confusing them is usually discovered at the worst possible time.
This is the long version of how I think about the choice. We’ll start where every good HA design starts — not with a feature, but with the failure you’re trying to survive — then work through what RAC and Data Guard each actually do, what they cost (in licensing and in complexity), how to reason about RTO and RPO, and finally a decision tree you can apply to a real system. Everything here targets Oracle 19c, the enterprise workhorse, with notes on where the newer releases — 23ai and the current 26ai — change the picture. It’s written from general industry practice and lab work — your environment will differ, so test before you trust.
The short version. RAC keeps you running through a node failure — but it’s one copy of your data on shared storage, so it is not disaster recovery. Data Guard keeps you running through site loss and corruption by maintaining an independent standby you fail over to. Neither saves you from a bad
DELETE— only backups and Flashback do. Set RTO and RPO with the business, then buy the cheapest combination that meets them.
Start with the failure, not the feature
Before you evaluate any technology, write down the failure modes you actually need to survive. There are four that matter for an Oracle database, and they are genuinely different problems:
- Instance or node failure — a database instance crashes, or the server it runs on dies.
- Site or region loss — a data center, availability zone, or whole region becomes unavailable.
- Data corruption — physical block corruption (bad storage, lost writes) or logical corruption.
- Human error — an accidental
DROP TABLE, a bad deploy, aDELETEwithout aWHEREclause.
No single feature covers all four. That is the entire reason this article exists. Here is the map we’ll spend the rest of the post justifying:
| Failure mode | RAC | Data Guard | Backups + Flashback |
|---|---|---|---|
| Instance / node failure | Yes | failover | No |
| Site / region loss | No | Yes | slow, if offsite |
| Block corruption | No | Yes | Yes |
| Human / logical error | No | No | Yes |
Notice that the bottom row — human error — is covered by neither RAC nor Data Guard. Hold that thought; it’s the mistake I see most often.
What RAC actually solves
RAC runs multiple database instances on multiple servers (nodes) against one shared copy of the database. The instances coordinate through Oracle Grid Infrastructure (Clusterware) and a private interconnect, using Cache Fusion to ship blocks between node memories. Clients connect through the SCAN listener and node VIPs, so a failed node’s sessions are redirected to survivors.
What that buys you:
- Instance and node resilience. If a node dies, the surviving instances keep serving the same database. There’s no “restore” and no “fail over to a copy” — the data was already open on the other nodes.
- Online scale-out for reads and writes. Add a node, add capacity, without re-architecting.
- Rolling maintenance. Patch or relocate one node at a time while the service stays up.
- Brownout masking. With application services and Application Continuity / TAF, in-flight work can be replayed or transparently redirected during a node loss.
You check on it with Clusterware and srvctl:
# Cluster resource overview
crsctl status resource -t
# Is the database up, and on which instances?
srvctl status database -d ORCLCDB
# Service placement (services are how you steer connections across nodes)
srvctl status service -d ORCLCDB
Now the part that the “RAC is our DR” crowd misses: every RAC instance points at the same storage. There is exactly one copy of your data. A storage array failure, a site outage, or a corrupt block is seen identically by all nodes. RAC gives you redundancy of compute, not redundancy of data.
A composite scenario (illustrative). Picture a shop running a healthy 3-node RAC cluster. Uptime dashboards are green for two years; leadership is told the database is “fully redundant.” Then a SAN controller pushes bad firmware and the shared LUNs go offline. All three nodes go down at once, because all three were reading the same storage. The cluster did exactly what it was designed to do — it just was never designed for that failure. That’s not a RAC flaw; it’s a design gap.
Licensing and complexity (the honest cost)
RAC is a separately licensed option on top of Oracle Database Enterprise Edition, priced per processor (or in the cloud, baked into certain shapes/editions). On top of license cost you’re taking on real operational weight: Clusterware, a redundant private interconnect, shared storage (typically ASM), and the skills to run all of it. That complexity is itself a source of outages if the team isn’t staffed for it — a RAC node eviction, where Clusterware fences a node it can’t verify is healthy, is the canonical 3am example.
RAC One Node is the pragmatic middle ground: a single active instance that Clusterware can fail over (or you can online-relocate) to another node, with online rolling patching — most of the availability benefit, far less of the multi-instance complexity, and you can scale up to full RAC later.
# RAC One Node: relocate the running instance to another node, online
srvctl relocate database -d ORCLCDB -node racnode2
What Data Guard actually solves
Data Guard maintains one or more standby databases — independent, physically separate copies of
your primary — kept in sync by shipping redo and applying it. A physical standby applies redo
block-for-block (Redo Apply); a logical standby reconstructs SQL (SQL Apply). For HA/DR, physical
standby is the default and the one I’ll focus on. The Data Guard Broker (dgmgrl) is how you should
manage it — it removes most of the manual ALTER DATABASE foot-guns.
What it buys you:
- Site and region survival. The standby is a different database on different storage, usually in a different location. Lose the primary site and you fail over to the standby.
- Corruption protection. Because the standby is an independent copy with its own writes, it doesn’t inherit the primary’s physical block corruption. With Active Data Guard, Automatic Block Media Recovery can transparently repair a corrupt block on either side from the other.
- A real failover/switchover target. Planned role transitions (switchover) for maintenance, and unplanned ones (failover) for disasters.
- Read offload and more (with Active Data Guard): an open read-only standby for reporting, offloaded backups, and snapshot standbys you can open read-write for testing and then flip back.
You watch role and lag with SQL and the broker:
-- Where am I, and what mode am I in?
SELECT database_role, open_mode, protection_mode, switchover_status
FROM v$database;
-- How far behind is apply? (the number that matters during an incident)
SELECT name, value, time_computed
FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag');
dgmgrl sys@ORCLCDB
DGMGRL> SHOW CONFIGURATION;
DGMGRL> SHOW DATABASE 'ORCLCDB_STBY';
-- Planned role swap (maintenance): primary and standby trade places
DGMGRL> SWITCHOVER TO 'ORCLCDB_STBY';
-- Unplanned (disaster): promote the standby
DGMGRL> FAILOVER TO 'ORCLCDB_STBY';
Protection modes set your RPO
Data Guard’s protection mode is the dial that trades data-loss risk against primary performance:
| Protection mode | Redo transport | Data loss (RPO) | Effect on primary |
|---|---|---|---|
| Maximum Performance (default) | ASYNC | Possible — seconds of redo | None |
| Maximum Availability | SYNC | Zero while in sync; falls back to ASYNC if the standby is unreachable | Small commit latency |
| Maximum Protection | SYNC | Zero, guaranteed | Primary stalls if no standby can acknowledge |
Most enterprises run Maximum Availability with SYNC transport to a nearby standby — zero data loss in normal operation, without the “halt production if the standby is down” behavior of Maximum Protection.
Going further: Fast-Start Failover and Far Sync
-
Fast-Start Failover (FSFO) adds automatic failover. A lightweight Observer process (run it on a third, independent host) watches both databases and promotes the standby automatically if the primary disappears — turning a 2am page into an event you read about in the morning.
DGMGRL> ENABLE FAST_START FAILOVER; DGMGRL> START OBSERVER; -
Far Sync solves the distance problem. SYNC gives you zero data loss but adds latency proportional to distance, so a DR site 2,000 km away can’t be SYNC without hurting production. A Far Sync instance — a tiny control-file-and-redo-only instance placed near the primary — receives redo SYNC (zero loss, low latency) and forwards it ASYNC to the distant standby. You get RPO ≈ 0 and geographic distance.
flowchart LR P[Primary<br/>Site A] -- SYNC redo, zero loss --> FS[Far Sync<br/>near primary] FS -- ASYNC redo, over distance --> S[Physical Standby<br/>Site B] OBS[FSFO Observer] -- watches --> P OBS -- watches --> S
Licensing note
Plain Data Guard (a physical standby in mount mode, doing Redo Apply) is included with Enterprise Edition — there’s no excuse not to have one. Active Data Guard — the open read-only standby, Automatic Block Media Recovery, Far Sync, and friends — is a separately licensed option. Decide deliberately which capabilities you’re actually licensed for.
A composite scenario (illustrative). A team has a standby and a green broker status, so DR is “done.” Nobody has ever run a switchover. During a real failover they discover apply has been lagging for weeks behind a quietly-stuck archive gap, the network team never opened the ports for client redirection, and the runbook references a host that was decommissioned. The technology worked; the operational readiness didn’t. A standby you’ve never failed over to is a hope, not a plan.
The combined topology: RAC + Data Guard
When you genuinely need both local zero-downtime and cross-site survival, you run RAC at each site with Data Guard between them. This is the heart of Oracle’s Maximum Availability Architecture (MAA): local node failures are absorbed by RAC with no failover at all, while a site loss triggers a Data Guard role transition.
It’s the gold standard, and it’s also the most expensive and most complex thing on the menu — you’re paying for (and operating) RAC and Active Data Guard, in two locations. The honest question is whether your RTO/RPO targets and the business cost of downtime justify it. MAA frames this as tiers, so you can match spend to requirement:
| MAA tier | Adds | Protects against |
|---|---|---|
| Bronze | Single instance + RMAN backups + Flashback | Corruption, human error (slow recovery) |
| Silver | + RAC or RAC One Node | Instance/node failure (near-zero RTO locally) |
| Gold | + Active Data Guard | Site loss, corruption; read offload |
| Platinum | + GoldenGate, Application Continuity, Edition-Based Redefinition | Zero-downtime maintenance, app-transparent failover |
A useful way to read this table: you don’t start at Gold. You start at Bronze and climb only as far as your RTO/RPO and budget require.
What MAA Gold actually looks like
It helps to picture the topology. RAC handles failures inside each site; Data Guard handles losing a site; and the Observer — deliberately in a third location — is what makes failover automatic without becoming a casualty of the outage it’s supposed to detect.
flowchart TB subgraph A[Primary site] R1[RAC node 1] --- R2[RAC node 2] R1 --- D1[Shared storage ASM] R2 --- D1 end subgraph B[Standby site] S1[RAC node 1] --- S2[RAC node 2] S1 --- D2[Shared storage ASM] S2 --- D2 end A -- Active Data Guard redo --> B OBS[FSFO Observer] -- watches --> A OBS -- watches --> B
Read it as two independent failure domains: lose a node and RAC absorbs it with no role change at all; lose a site and Data Guard promotes the standby. The reporting team can run on the open Active Data Guard standby, and backups can be offloaded there too — so the DR copy earns its keep every day, not just during a disaster.
Don’t forget the two failure modes nobody licensed for
Look back at that first table. RAC and Data Guard together still leave two rows uncovered well, and one of them is the most common cause of “lost data” incidents.
Block corruption is partly handled by Data Guard (independent copy, Automatic Block Media Recovery)
but your baseline defenses are configuration and backups: enable DB_BLOCK_CHECKING and
DB_LOST_WRITE_PROTECT, run periodic RMAN VALIDATE/BACKUP VALIDATE, and keep recoverable RMAN
backups.
Human and logical error is the trap. A DELETE with no WHERE clause is a perfectly valid
transaction — so Data Guard faithfully ships it to the standby and applies it in milliseconds. Your
“redundancy” just replicated the mistake to every copy. The defenses here are a different toolset
entirely:
-- Flashback Database: rewind the whole database to just before the mistake
-- (requires flashback logging / a guaranteed restore point)
SELECT flashback_on FROM v$database;
FLASHBACK DATABASE TO RESTORE POINT before_bad_deploy;
-- Or recover a single object after an accidental drop
FLASHBACK TABLE app.orders TO BEFORE DROP;
Guaranteed restore points before risky changes, Flashback Database/Table/Query, and RMAN point-in-time recovery are what save you here — not replication. If you take one thing from this article beyond “RAC ≠ DR,” take this: replication is not a backup.
Make the decision with RTO and RPO first
Every choice above maps cleanly onto two numbers you should set with the business, not in IT:
- RTO (Recovery Time Objective): how long can you be down? RAC handles node failure in ~seconds with no failover. Data Guard with FSFO recovers a site loss in seconds-to-minutes. Backups mean hours.
- RPO (Recovery Point Objective): how much data can you lose? RAC: zero (same data). Data Guard: zero with SYNC/Far Sync, seconds with ASYNC. Backups: back to your last backup plus available redo.
Get those two numbers agreed and most of the architecture chooses itself. Here’s the tree I walk:
flowchart TD
A([Define RTO and RPO with the business]) --> B{Must survive losing<br/>a whole site or region?}
B -- Yes --> C[Need an independent replica:<br/>Data Guard]
C --> D{Also need zero-downtime<br/>through local node failure?}
D -- Yes --> E[RAC + Active Data Guard<br/>MAA Gold]
D -- No --> F[Data Guard +<br/>Fast-Start Failover]
B -- No --> G{Need zero-downtime through a<br/>node/instance failure at one site?}
G -- Yes --> H[RAC or RAC One Node]
G -- No --> I[Single instance]
E --> Z
F --> Z
H --> Z
I --> Z
Z([In EVERY branch: RMAN backups + Flashback<br/>for corruption and human error]) A side-by-side, for the architecture review
| Dimension | RAC | Data Guard | RAC + DG | Backups only |
|---|---|---|---|---|
| Node/instance failure | instant | failover | instant | No |
| Site/region loss | No | Yes | Yes | slow |
| Block corruption | No | ADG repair | Yes | restore |
| Human/logical error | No | No | No | Flashback/PITR |
| Typical RTO | seconds | seconds–minutes | seconds | hours |
| Typical RPO | 0 | 0 (SYNC) / seconds (ASYNC) | 0 | last backup |
| Read offload | all nodes | Active DG | Yes | No |
| Rolling patching | Yes | standby-first | Yes | No |
| Scale-out writes | Yes | No | Yes | No |
| Cost beyond EE | RAC option ($$) | included; ADG extra | both ($$$) | none |
| Operational complexity | high | medium | highest | low |
Where GoldenGate fits
GoldenGate is the other tool people reach for, and it’s worth knowing why it’s not usually the answer to this particular question. It does logical replication — capturing changes and applying them elsewhere — which makes it brilliant for things Data Guard can’t do: heterogeneous targets, cross-version and near-zero-downtime migrations and upgrades, active-active multi-master, and replicating a subset of the data. But it’s a separately licensed option, it’s operationally heavier, and for plain “keep an identical standby for DR,” physical Data Guard is simpler and tighter. Use GoldenGate when you need its logical flexibility (it’s a Platinum-tier component for a reason) — not as a default DR mechanism.
A worked switchover (planned, zero data loss)
Choosing the architecture is half the job; the other half is being able to operate it under pressure. A switchover is a planned, lossless role reversal — the primary becomes a standby and a standby becomes the primary. You’ll do this for site maintenance, hardware refreshes, and — critically — as the rehearsal that proves your DR actually works. Always drive it through the Broker.
Step 1 — Validate before you touch anything. Modern Broker gives you a pre-flight check that catches gaps, missing standby redo logs, and flashback problems before you commit:
DGMGRL> SHOW CONFIGURATION; -- expect: Status SUCCESS
DGMGRL> VALIDATE DATABASE 'ORCLCDB_STBY';
A healthy result looks roughly like this (trimmed):
Database Role: Physical standby database
Primary Database: ORCLCDB
Ready for Switchover: Yes
Ready for Failover: Yes (Primary Running)
Flashback Database Status:
ORCLCDB : On
ORCLCDB_STBY : On
Transport-Related Information:
Transport lag: +00 00:00:00
Apply-Related Information:
Apply lag: +00 00:00:00
If “Ready for Switchover” isn’t Yes, stop and fix that first — usually an archive gap, missing standby redo logs, or apply lag.
Step 2 — Switch over. One command; the Broker orchestrates both databases:
DGMGRL> SWITCHOVER TO 'ORCLCDB_STBY';
Step 3 — Verify the new roles and that redo is flowing the other way:
-- On the NEW primary (formerly the standby)
SELECT database_role, open_mode, switchover_status FROM v$database;
-- DATABASE_ROLE should now be PRIMARY, OPEN_MODE READ WRITE
-- Confirm the configuration is healthy again
-- DGMGRL> SHOW CONFIGURATION; -> Status SUCCESS
Step 4 — Redirect the application. This is the step people forget. Clients need to land on the new primary — via a role-based service that only starts in the PRIMARY role, or via a connect string that lists both hosts. Test it, don’t assume it.
Failover (unplanned) and reinstate
A failover is what you run when the primary is gone and not coming back soon. It’s faster and more decisive than a switchover, and with asynchronous transport it may cost you a small amount of redo (your RPO):
DGMGRL> FAILOVER TO 'ORCLCDB_STBY';
With Fast-Start Failover enabled, you don’t type that at all — the Observer detects the outage and promotes the standby automatically, typically in seconds. Either way, when the old primary comes back to life, you don’t rebuild it from scratch: if it had Flashback Database enabled, the Broker can rewind and re-enrol it as the new standby in one step:
DGMGRL> REINSTATE DATABASE 'ORCLCDB';
That Flashback-Database prerequisite is exactly why “enable Flashback on both databases” belongs in your standard build — without it, a failover turns a returning primary into a full rebuild.
Monitoring: what to watch, and when to page
A standby silently falling behind is the classic way DR rots. You need two numbers alarmed at all times — transport lag (redo not yet received) and apply lag (redo received but not yet applied) — plus the health of the apply process and, if you use it, the FSFO state.
-- The two numbers that define your real-world RPO/RTO right now
SELECT name, value AS lag, time_computed
FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag');
-- Is the apply process actually running? (run on the standby)
SELECT process, status, sequence#
FROM gv$managed_standby
WHERE process LIKE 'MRP%';
-- Fast-Start Failover health (run on the primary)
SELECT fs_failover_status, fs_failover_current_target, fs_failover_observer_present
FROM v$database;
Sensible starting thresholds — tune them to your RPO/RTO, not these defaults:
| Signal | Warning | Critical | Why it matters |
|---|---|---|---|
| Transport lag | > 60s | > your RPO | Redo isn’t reaching the standby — data-loss exposure |
| Apply lag | > 5 min | > your RTO | Standby is “behind”; failover would replay slowly |
| MRP process | not running | absent after retry | Apply has stopped — lag will grow unbounded |
| FSFO status | not SYNCHRONIZED / not within lag limit | observer absent | Automatic failover is not currently possible |
| Archive gap | any persistent gap | growing | A missing sequence blocks all further apply |
Two operational notes: run the Observer on a third, independent host (not on either database server — otherwise the thing that watches for failure can die with the failure), and if you run Oracle Enterprise Manager, its Data Guard metrics wrap all of the above in alerting so you’re not hand-rolling every check.
One subtlety worth calling out: when apply lag grows but transport is healthy and there’s no archive gap, the standby itself is usually the bottleneck — redo is arriving but the apply can’t keep up because the standby is I/O- or CPU-bound. That’s not a Data Guard problem, it’s a performance problem, and you diagnose it the same way you’d diagnose any slow database: pull an AWR report on the standby and read it. If that’s unfamiliar territory, start with How to Read an AWR Report Without Drowning.
Troubleshooting the usual suspects
When Data Guard misbehaves, it’s almost always one of a handful of patterns. The Broker surfaces these as ORA-16xxx messages — always read the Broker’s StatusReport for the specific code and its recommended action rather than guessing:
DGMGRL> SHOW CONFIGURATION; -- look for WARNING/ERROR
DGMGRL> SHOW DATABASE 'ORCLCDB_STBY' StatusReport;
| Symptom | Likely cause | Where to look | Typical fix |
|---|---|---|---|
| Apply lag climbing, sequence stuck | Archive gap — a missing redo sequence | v$archive_gap, gv$archived_log | Broker/FAL usually auto-resolves; if not, ship the missing logs and re-register |
| Standby block corruption after a bulk load | NOLOGGING operation on the primary | alert log, v$database.force_logging | ALTER DATABASE FORCE LOGGING; restore affected datafile from primary |
| Transport lag grows under load | Network throughput < redo rate | v$dataguard_stats, redo generation rate | Tune TCP/socket buffers, enable redo transport compression, or use Far Sync |
| Real-time apply won’t start | Standby redo logs missing/undersized | v$standby_log | Add standby redo logs (one more group than online, same size) |
| Apply stopped after a failover test | Flashback off, can’t reinstate | v$database.flashback_on | Enable Flashback Database; reinstate via the Broker |
The meta-lesson: most “Data Guard is broken” tickets are really forcing logging wasn’t set, standby redo logs were never created, or the network can’t keep up with peak redo. Get those three right at build time and you’ll prevent the majority of incidents.
Test it for real: a DR game-day
A standby you have never failed over to is a hope, not a plan — so put it on a schedule. A practical cadence is a switchover every quarter (it’s lossless and reversible) and a full failover drill at least annually. To exercise the application against standby data without disturbing replication, use a snapshot standby: it opens read-write for testing, then discards its changes and catches back up.
-- Open the standby read-write for application testing
DGMGRL> CONVERT DATABASE 'ORCLCDB_STBY' TO SNAPSHOT STANDBY;
-- ... run your app test suite against it ...
-- Roll it back and resume keeping pace with the primary
DGMGRL> CONVERT DATABASE 'ORCLCDB_STBY' TO PHYSICAL STANDBY;
A repeatable game-day runbook:
- Announce the window and the rollback plan.
- Pre-check with
VALIDATE DATABASE(Ready for Switchover = Yes). - Execute the switchover (or failover, for the annual drill).
- Verify the application actually reconnects through your role-based service — this is the test, not the database role itself.
- Measure the real RTO and RPO and compare them to target. Numbers, not vibes.
- Switch back and confirm the configuration returns to SUCCESS.
- Report: measured RTO/RPO, every gap you hit, and the owner/date for each fix.
That report is also the artifact that turns “I think we’re covered” into something leadership can actually rely on — and it’s how you find the decommissioned-host-in-the-runbook problem in a drill instead of during a real outage.
Patching and upgrading without downtime
Here’s the payoff most teams undersell: the biggest day-to-day return on HA isn’t surviving disasters — it’s making planned maintenance nearly invisible. The same building blocks let you patch and upgrade with little or no downtime, and that benefit cashes in every single patch cycle.
- Rolling patches with RAC. Most quarterly Release Updates are RAC-rolling: you patch one node at a time while the others keep serving the database. Connections drain off the node you’re working on (via services with a drain timeout, or Application Continuity) and return when it rejoins. No outage, just a brief capacity dip.
- Standby-first patching. For patches that aren’t RAC-rolling, Data Guard gives you another route: apply the patch to the standby first, verify it there, switch over to the patched standby, then patch the old primary. The application sees one short switchover instead of a maintenance window. (Oracle marks which patches are “Standby-First Installable.”)
- Major upgrades with
DBMS_ROLLING. A full release upgrade (say 19c → 23ai) normally means real downtime.DBMS_ROLLINGconverts your physical standby into a transient logical standby, upgrades it while the primary keeps running, and then switches over — so the application’s downtime collapses to a single switchover rather than the whole upgrade window:
-- sketch of a DBMS_ROLLING upgrade, driven from the primary
EXEC DBMS_ROLLING.INIT_PLAN(future_primary => 'ORCLCDB_STBY');
EXEC DBMS_ROLLING.BUILD_PLAN;
EXEC DBMS_ROLLING.START_PLAN; -- standby becomes a transient logical standby
-- ... upgrade the transient logical standby to the new release ...
EXEC DBMS_ROLLING.SWITCHOVER; -- the application flips to the upgraded database
EXEC DBMS_ROLLING.FINISH_PLAN;
The thread tying all three together: planned downtime is a choice, not a law of physics. If your SLA can’t spare a maintenance window, the HA you built for disasters quietly pays for itself every time you patch.
Try it yourself: a runnable lab
Reading about recovery is one thing; doing it is what builds the reflex. I put together a small lab you can run on a laptop with nothing but Docker — no Oracle account required — so you can feel the most important lessons here first-hand. It uses the community Oracle Database Free image and runs every command inside the container, so you don’t even need a local Oracle client.
A quick honesty note about scope, because it maps exactly to this article:
- RAC isn’t something you can meaningfully run on a single laptop. It needs shared storage, a private interconnect, and clusterware across nodes — a real cluster, not a container trick. So the lab doesn’t pretend to.
- Data Guard is an Enterprise Edition feature, and the zero-login Free image doesn’t include it. So the no-setup lab focuses on the failure modes you can reproduce — and which this post argues are the most commonly mishandled: human error, media loss, and corruption. A separate, opt-in Enterprise Edition module covers a real primary/standby switchover and failover for when you want to rehearse those too.
Getting started is three commands:
./run.sh up # pulls the image and creates the database (first run takes a few minutes)
./run.sh setup # enables archivelog and creates a small demo schema
./run.sh all # runs all three drills end to end
The three drills, and the lesson each one drives home:
- Human-error recovery. The lab deletes every row (committed) and then drops the table — two perfectly valid statements a standby would have replicated in milliseconds — and recovers both locally with Flashback Query and Flashback Table. This is the “replication is not a backup” point you can now prove to yourself (and to a skeptical colleague) in thirty seconds.
- RMAN backup & restore. Take a backup, take a datafile offline and delete it from disk to simulate media failure, then restore and recover just that file while the rest of the database stays open. That’s the restore-drill muscle this post keeps insisting you build.
- Block-corruption detection & recovery. Write garbage into a single on-disk block, detect it with
RMAN VALIDATE CHECK LOGICAL, and repair it with block media recovery — no full restore needed.
The full lab — the docker compose file, the run.sh driver, every drill script, and the optional
Enterprise Edition Data Guard module — is the ha/ lab in
github.com/pyaroslav/oracle-labs. Clone it, run it, break
things on purpose. (No spare RAM on your laptop? The repo includes a guide to run the whole thing
free on an OCI Always Free cloud VM.) Discovering that your runbook references a decommissioned host
is a great thing to learn in a lab on a Tuesday afternoon — and a terrible thing to learn at 2am.
What about 23ai and 26ai?
If you’re on or moving to a newer release — 23ai, or the current 26ai — the good news is that none of the decision framework above changes: the failure modes are the same, RAC still protects compute, Data Guard still protects data, and backups + Flashback still own corruption and human error. The “ai”-era releases continue the same Maximum Availability Architecture lineage and add incremental improvements across the stack (redo transport/apply efficiency, manageability, and — notably in 23ai — new in-database capabilities like AI Vector Search that change what you run, not how you protect it). What does shift between releases is the small print: default parameter values, which features are enabled, and option licensing. So when you implement on 23ai or 26ai, confirm the exact behavior and licensing against that release’s documentation rather than assuming 19c defaults carry over — and, if you want a free place to check, the Oracle Database Free image (currently 26ai) and OCI Always Free Autonomous Database both let you verify on a real instance at no cost.
What teams get wrong (the short list)
- Treating RAC as DR. It isn’t. One copy of data, one storage, one site.
- An untested standby. If you haven’t done a real switchover, you don’t have DR — you have a theory. Schedule game-days.
- Assuming replication protects against mistakes. A bad
DELETEreaches the standby before you can cancel it. Flashback and backups are your safety net, every time. - Buying Gold when Bronze/Silver was the requirement. Match the MAA tier to a stated RTO/RPO, not to fear. Complexity you can’t operate is a liability, not insurance.
- Ignoring the licensing line. RAC and Active Data Guard are paid options. Design within what you’re actually licensed for, or get the budget approved on purpose.
Frequently asked questions
Is Oracle RAC a disaster recovery solution?
No. RAC protects against instance and node failure by running multiple instances against one shared copy of the database. Because there is only one copy of the data on shared storage, a site outage, storage failure, or block corruption affects all RAC nodes at once. Disaster recovery requires an independent copy, which is what Data Guard provides.
Do I still need Data Guard if I already have RAC?
Yes, if you need to survive losing a site or region, or to protect against data corruption. RAC and Data Guard solve different failures: RAC handles local node failure, while Data Guard maintains a separate standby database for site loss and corruption protection. Many mission-critical systems run both.
Does Data Guard protect against accidental data deletion?
No. An accidental DELETE or DROP is a valid transaction, so Data Guard faithfully ships and applies it to the standby within seconds. Protection against human and logical errors comes from Flashback Database, Flashback Table, guaranteed restore points, and RMAN point-in-time recovery — not from replication.
What is the difference between a switchover and a failover?
A switchover is a planned, lossless role reversal between the primary and standby, used for maintenance and DR testing. A failover is an unplanned promotion of the standby when the primary is lost; with asynchronous transport it may incur a small amount of data loss. Fast-Start Failover can perform failovers automatically.
Is Data Guard included with Oracle Enterprise Edition?
Basic Data Guard — a physical standby in mount mode doing Redo Apply — is included with Enterprise Edition. Active Data Guard, which adds a read-only open standby, Automatic Block Media Recovery, and Far Sync, is a separately licensed option. RAC is also a separately licensed option.
What RPO can Data Guard achieve?
Zero data loss is achievable using synchronous redo transport in Maximum Availability or Maximum Protection mode, optionally with a Far Sync instance to preserve zero RPO over long distances. Asynchronous transport (Maximum Performance) typically loses only seconds of redo but adds no commit latency on the primary.
What is the difference between RAC and RAC One Node?
Full RAC runs multiple active instances across nodes for both high availability and scale-out. RAC One Node runs a single active instance that Oracle Clusterware can fail over or online-relocate to another node, with rolling patching. RAC One Node offers most of the availability benefit with less complexity, and can be scaled up to full RAC later.
What is Oracle Maximum Availability Architecture (MAA)?
MAA is Oracle's set of best-practice reference architectures for high availability and disaster recovery, organized into tiers: Bronze (a single instance with RMAN backups and Flashback), Silver (adds RAC or RAC One Node for local failure), Gold (adds Active Data Guard for site loss and corruption), and Platinum (adds GoldenGate, Application Continuity, and Edition-Based Redefinition for zero-downtime maintenance). You choose the lowest tier that meets your RTO and RPO targets.
What is an Oracle Data Guard Far Sync instance?
A Far Sync instance is a lightweight Data Guard member — just a control file and redo, no datafiles — placed close to the primary. The primary ships redo to it synchronously (zero data loss, low latency), and Far Sync forwards that redo asynchronously to a distant standby. This achieves zero-data-loss protection (RPO near zero) across long geographic distances without the commit latency that synchronous transport directly to a far-away standby would impose.
The one-paragraph version
Set RTO and RPO with the business. Use RAC (or RAC One Node) to survive instance and node failure at a site with no downtime. Use Data Guard to survive site loss and corruption, with Fast-Start Failover for automatic recovery and Far Sync if you need zero data loss over distance. Use both — MAA Gold — only when your targets genuinely demand it. And in every design, no exceptions, keep RMAN backups and Flashback Database, because that’s the only thing that saves you from the failure RAC and Data Guard can’t: the human one.
Have a question or some feedback?
I write here in a personal capacity and enjoy comparing notes with other Oracle folks. Say hello.
Get in touch