HA & DR

The Oracle HA Decision Tree: RAC vs Data Guard vs Both


“We have RAC, so we’re covered for DR.” It’s one of the most expensive sentences in Oracle operations, and I’ve watched variations of it play out more than once. Real Application Clusters (RAC) and Data Guard both live under the “high availability” umbrella, so it’s easy to assume they’re interchangeable — or that having one means you don’t need the other. They are not interchangeable. They solve different failures, and the cost of confusing them is usually discovered at the worst possible time.

This is the long version of how I think about the choice. We’ll start where every good HA design starts — not with a feature, but with the failure you’re trying to survive — then work through what RAC and Data Guard each actually do, what they cost (in licensing and in complexity), how to reason about RTO and RPO, and finally a decision tree you can apply to a real system. Everything here targets Oracle 19c, the enterprise workhorse, with notes on where the newer releases — 23ai and the current 26ai — change the picture. It’s written from general industry practice and lab work — your environment will differ, so test before you trust.

The short version. RAC keeps you running through a node failure — but it’s one copy of your data on shared storage, so it is not disaster recovery. Data Guard keeps you running through site loss and corruption by maintaining an independent standby you fail over to. Neither saves you from a bad DELETE — only backups and Flashback do. Set RTO and RPO with the business, then buy the cheapest combination that meets them.

Start with the failure, not the feature

Before you evaluate any technology, write down the failure modes you actually need to survive. There are four that matter for an Oracle database, and they are genuinely different problems:

  • Instance or node failure — a database instance crashes, or the server it runs on dies.
  • Site or region loss — a data center, availability zone, or whole region becomes unavailable.
  • Data corruption — physical block corruption (bad storage, lost writes) or logical corruption.
  • Human error — an accidental DROP TABLE, a bad deploy, a DELETE without a WHERE clause.

No single feature covers all four. That is the entire reason this article exists. Here is the map we’ll spend the rest of the post justifying:

Failure modeRACData GuardBackups + Flashback
Instance / node failure Yes failover No
Site / region loss No Yes slow, if offsite
Block corruption No Yes Yes
Human / logical error No No Yes

Notice that the bottom row — human error — is covered by neither RAC nor Data Guard. Hold that thought; it’s the mistake I see most often.

What RAC actually solves

RAC runs multiple database instances on multiple servers (nodes) against one shared copy of the database. The instances coordinate through Oracle Grid Infrastructure (Clusterware) and a private interconnect, using Cache Fusion to ship blocks between node memories. Clients connect through the SCAN listener and node VIPs, so a failed node’s sessions are redirected to survivors.

What that buys you:

  • Instance and node resilience. If a node dies, the surviving instances keep serving the same database. There’s no “restore” and no “fail over to a copy” — the data was already open on the other nodes.
  • Online scale-out for reads and writes. Add a node, add capacity, without re-architecting.
  • Rolling maintenance. Patch or relocate one node at a time while the service stays up.
  • Brownout masking. With application services and Application Continuity / TAF, in-flight work can be replayed or transparently redirected during a node loss.

You check on it with Clusterware and srvctl:

# Cluster resource overview
crsctl status resource -t

# Is the database up, and on which instances?
srvctl status database -d ORCLCDB

# Service placement (services are how you steer connections across nodes)
srvctl status service -d ORCLCDB

Now the part that the “RAC is our DR” crowd misses: every RAC instance points at the same storage. There is exactly one copy of your data. A storage array failure, a site outage, or a corrupt block is seen identically by all nodes. RAC gives you redundancy of compute, not redundancy of data.

A composite scenario (illustrative). Picture a shop running a healthy 3-node RAC cluster. Uptime dashboards are green for two years; leadership is told the database is “fully redundant.” Then a SAN controller pushes bad firmware and the shared LUNs go offline. All three nodes go down at once, because all three were reading the same storage. The cluster did exactly what it was designed to do — it just was never designed for that failure. That’s not a RAC flaw; it’s a design gap.

Licensing and complexity (the honest cost)

RAC is a separately licensed option on top of Oracle Database Enterprise Edition, priced per processor (or in the cloud, baked into certain shapes/editions). On top of license cost you’re taking on real operational weight: Clusterware, a redundant private interconnect, shared storage (typically ASM), and the skills to run all of it. That complexity is itself a source of outages if the team isn’t staffed for it — a RAC node eviction, where Clusterware fences a node it can’t verify is healthy, is the canonical 3am example.

RAC One Node is the pragmatic middle ground: a single active instance that Clusterware can fail over (or you can online-relocate) to another node, with online rolling patching — most of the availability benefit, far less of the multi-instance complexity, and you can scale up to full RAC later.

# RAC One Node: relocate the running instance to another node, online
srvctl relocate database -d ORCLCDB -node racnode2

What Data Guard actually solves

Data Guard maintains one or more standby databases — independent, physically separate copies of your primary — kept in sync by shipping redo and applying it. A physical standby applies redo block-for-block (Redo Apply); a logical standby reconstructs SQL (SQL Apply). For HA/DR, physical standby is the default and the one I’ll focus on. The Data Guard Broker (dgmgrl) is how you should manage it — it removes most of the manual ALTER DATABASE foot-guns.

What it buys you:

  • Site and region survival. The standby is a different database on different storage, usually in a different location. Lose the primary site and you fail over to the standby.
  • Corruption protection. Because the standby is an independent copy with its own writes, it doesn’t inherit the primary’s physical block corruption. With Active Data Guard, Automatic Block Media Recovery can transparently repair a corrupt block on either side from the other.
  • A real failover/switchover target. Planned role transitions (switchover) for maintenance, and unplanned ones (failover) for disasters.
  • Read offload and more (with Active Data Guard): an open read-only standby for reporting, offloaded backups, and snapshot standbys you can open read-write for testing and then flip back.

You watch role and lag with SQL and the broker:

-- Where am I, and what mode am I in?
SELECT database_role, open_mode, protection_mode, switchover_status
FROM   v$database;

-- How far behind is apply? (the number that matters during an incident)
SELECT name, value, time_computed
FROM   v$dataguard_stats
WHERE  name IN ('transport lag','apply lag');
dgmgrl sys@ORCLCDB
DGMGRL> SHOW CONFIGURATION;
DGMGRL> SHOW DATABASE 'ORCLCDB_STBY';

-- Planned role swap (maintenance): primary and standby trade places
DGMGRL> SWITCHOVER TO 'ORCLCDB_STBY';

-- Unplanned (disaster): promote the standby
DGMGRL> FAILOVER TO 'ORCLCDB_STBY';

Protection modes set your RPO

Data Guard’s protection mode is the dial that trades data-loss risk against primary performance:

Protection modeRedo transportData loss (RPO)Effect on primary
Maximum Performance (default)ASYNCPossible — seconds of redoNone
Maximum AvailabilitySYNCZero while in sync; falls back to ASYNC if the standby is unreachableSmall commit latency
Maximum ProtectionSYNCZero, guaranteedPrimary stalls if no standby can acknowledge

Most enterprises run Maximum Availability with SYNC transport to a nearby standby — zero data loss in normal operation, without the “halt production if the standby is down” behavior of Maximum Protection.

Going further: Fast-Start Failover and Far Sync

  • Fast-Start Failover (FSFO) adds automatic failover. A lightweight Observer process (run it on a third, independent host) watches both databases and promotes the standby automatically if the primary disappears — turning a 2am page into an event you read about in the morning.

    DGMGRL> ENABLE FAST_START FAILOVER;
    DGMGRL> START OBSERVER;
  • Far Sync solves the distance problem. SYNC gives you zero data loss but adds latency proportional to distance, so a DR site 2,000 km away can’t be SYNC without hurting production. A Far Sync instance — a tiny control-file-and-redo-only instance placed near the primary — receives redo SYNC (zero loss, low latency) and forwards it ASYNC to the distant standby. You get RPO ≈ 0 and geographic distance.

flowchart LR
P[Primary<br/>Site A] -- SYNC redo, zero loss --> FS[Far Sync<br/>near primary]
FS -- ASYNC redo, over distance --> S[Physical Standby<br/>Site B]
OBS[FSFO Observer] -- watches --> P
OBS -- watches --> S
Far Sync gives you zero data loss over distance: synchronous redo to a nearby Far Sync instance, then asynchronous onward to a far-off standby. A Fast-Start Failover Observer in a third location promotes the standby automatically.

Licensing note

Plain Data Guard (a physical standby in mount mode, doing Redo Apply) is included with Enterprise Edition — there’s no excuse not to have one. Active Data Guard — the open read-only standby, Automatic Block Media Recovery, Far Sync, and friends — is a separately licensed option. Decide deliberately which capabilities you’re actually licensed for.

A composite scenario (illustrative). A team has a standby and a green broker status, so DR is “done.” Nobody has ever run a switchover. During a real failover they discover apply has been lagging for weeks behind a quietly-stuck archive gap, the network team never opened the ports for client redirection, and the runbook references a host that was decommissioned. The technology worked; the operational readiness didn’t. A standby you’ve never failed over to is a hope, not a plan.

The combined topology: RAC + Data Guard

When you genuinely need both local zero-downtime and cross-site survival, you run RAC at each site with Data Guard between them. This is the heart of Oracle’s Maximum Availability Architecture (MAA): local node failures are absorbed by RAC with no failover at all, while a site loss triggers a Data Guard role transition.

It’s the gold standard, and it’s also the most expensive and most complex thing on the menu — you’re paying for (and operating) RAC and Active Data Guard, in two locations. The honest question is whether your RTO/RPO targets and the business cost of downtime justify it. MAA frames this as tiers, so you can match spend to requirement:

MAA tierAddsProtects against
BronzeSingle instance + RMAN backups + FlashbackCorruption, human error (slow recovery)
Silver+ RAC or RAC One NodeInstance/node failure (near-zero RTO locally)
Gold+ Active Data GuardSite loss, corruption; read offload
Platinum+ GoldenGate, Application Continuity, Edition-Based RedefinitionZero-downtime maintenance, app-transparent failover

A useful way to read this table: you don’t start at Gold. You start at Bronze and climb only as far as your RTO/RPO and budget require.

What MAA Gold actually looks like

It helps to picture the topology. RAC handles failures inside each site; Data Guard handles losing a site; and the Observer — deliberately in a third location — is what makes failover automatic without becoming a casualty of the outage it’s supposed to detect.

flowchart TB
subgraph A[Primary site]
  R1[RAC node 1] --- R2[RAC node 2]
  R1 --- D1[Shared storage ASM]
  R2 --- D1
end
subgraph B[Standby site]
  S1[RAC node 1] --- S2[RAC node 2]
  S1 --- D2[Shared storage ASM]
  S2 --- D2
end
A -- Active Data Guard redo --> B
OBS[FSFO Observer] -- watches --> A
OBS -- watches --> B
MAA Gold: RAC at each site for local node resilience, Active Data Guard between sites for DR + corruption protection + read offload, and an FSFO Observer in a third location for automatic failover.

Read it as two independent failure domains: lose a node and RAC absorbs it with no role change at all; lose a site and Data Guard promotes the standby. The reporting team can run on the open Active Data Guard standby, and backups can be offloaded there too — so the DR copy earns its keep every day, not just during a disaster.

Don’t forget the two failure modes nobody licensed for

Look back at that first table. RAC and Data Guard together still leave two rows uncovered well, and one of them is the most common cause of “lost data” incidents.

Block corruption is partly handled by Data Guard (independent copy, Automatic Block Media Recovery) but your baseline defenses are configuration and backups: enable DB_BLOCK_CHECKING and DB_LOST_WRITE_PROTECT, run periodic RMAN VALIDATE/BACKUP VALIDATE, and keep recoverable RMAN backups.

Human and logical error is the trap. A DELETE with no WHERE clause is a perfectly valid transaction — so Data Guard faithfully ships it to the standby and applies it in milliseconds. Your “redundancy” just replicated the mistake to every copy. The defenses here are a different toolset entirely:

-- Flashback Database: rewind the whole database to just before the mistake
-- (requires flashback logging / a guaranteed restore point)
SELECT flashback_on FROM v$database;
FLASHBACK DATABASE TO RESTORE POINT before_bad_deploy;

-- Or recover a single object after an accidental drop
FLASHBACK TABLE app.orders TO BEFORE DROP;

Guaranteed restore points before risky changes, Flashback Database/Table/Query, and RMAN point-in-time recovery are what save you here — not replication. If you take one thing from this article beyond “RAC ≠ DR,” take this: replication is not a backup.

Make the decision with RTO and RPO first

Every choice above maps cleanly onto two numbers you should set with the business, not in IT:

  • RTO (Recovery Time Objective): how long can you be down? RAC handles node failure in ~seconds with no failover. Data Guard with FSFO recovers a site loss in seconds-to-minutes. Backups mean hours.
  • RPO (Recovery Point Objective): how much data can you lose? RAC: zero (same data). Data Guard: zero with SYNC/Far Sync, seconds with ASYNC. Backups: back to your last backup plus available redo.

Get those two numbers agreed and most of the architecture chooses itself. Here’s the tree I walk:

flowchart TD
A([Define RTO and RPO with the business]) --> B{Must survive losing<br/>a whole site or region?}
B -- Yes --> C[Need an independent replica:<br/>Data Guard]
C --> D{Also need zero-downtime<br/>through local node failure?}
D -- Yes --> E[RAC + Active Data Guard<br/>MAA Gold]
D -- No --> F[Data Guard +<br/>Fast-Start Failover]
B -- No --> G{Need zero-downtime through a<br/>node/instance failure at one site?}
G -- Yes --> H[RAC or RAC One Node]
G -- No --> I[Single instance]
E --> Z
F --> Z
H --> Z
I --> Z
Z([In EVERY branch: RMAN backups + Flashback<br/>for corruption and human error])
A practical RAC vs Data Guard vs Both decision tree. Backups + Flashback are mandatory in every branch.

A side-by-side, for the architecture review

DimensionRACData GuardRAC + DGBackups only
Node/instance failure instant failover instant No
Site/region loss No Yes Yes slow
Block corruption No ADG repair Yes restore
Human/logical error No No No Flashback/PITR
Typical RTOsecondsseconds–minutessecondshours
Typical RPO00 (SYNC) / seconds (ASYNC)0last backup
Read offload all nodes Active DG Yes No
Rolling patching Yes standby-first Yes No
Scale-out writes Yes No Yes No
Cost beyond EERAC option ($$)included; ADG extraboth ($$$)none
Operational complexityhighmediumhighestlow

Where GoldenGate fits

GoldenGate is the other tool people reach for, and it’s worth knowing why it’s not usually the answer to this particular question. It does logical replication — capturing changes and applying them elsewhere — which makes it brilliant for things Data Guard can’t do: heterogeneous targets, cross-version and near-zero-downtime migrations and upgrades, active-active multi-master, and replicating a subset of the data. But it’s a separately licensed option, it’s operationally heavier, and for plain “keep an identical standby for DR,” physical Data Guard is simpler and tighter. Use GoldenGate when you need its logical flexibility (it’s a Platinum-tier component for a reason) — not as a default DR mechanism.

A worked switchover (planned, zero data loss)

Choosing the architecture is half the job; the other half is being able to operate it under pressure. A switchover is a planned, lossless role reversal — the primary becomes a standby and a standby becomes the primary. You’ll do this for site maintenance, hardware refreshes, and — critically — as the rehearsal that proves your DR actually works. Always drive it through the Broker.

Step 1 — Validate before you touch anything. Modern Broker gives you a pre-flight check that catches gaps, missing standby redo logs, and flashback problems before you commit:

DGMGRL> SHOW CONFIGURATION;          -- expect: Status SUCCESS
DGMGRL> VALIDATE DATABASE 'ORCLCDB_STBY';

A healthy result looks roughly like this (trimmed):

  Database Role:       Physical standby database
  Primary Database:    ORCLCDB
  Ready for Switchover:  Yes
  Ready for Failover:    Yes (Primary Running)
  Flashback Database Status:
    ORCLCDB       : On
    ORCLCDB_STBY  : On
  Transport-Related Information:
    Transport lag:   +00 00:00:00
  Apply-Related Information:
    Apply lag:       +00 00:00:00

If “Ready for Switchover” isn’t Yes, stop and fix that first — usually an archive gap, missing standby redo logs, or apply lag.

Step 2 — Switch over. One command; the Broker orchestrates both databases:

DGMGRL> SWITCHOVER TO 'ORCLCDB_STBY';

Step 3 — Verify the new roles and that redo is flowing the other way:

-- On the NEW primary (formerly the standby)
SELECT database_role, open_mode, switchover_status FROM v$database;
-- DATABASE_ROLE should now be PRIMARY, OPEN_MODE READ WRITE

-- Confirm the configuration is healthy again
-- DGMGRL> SHOW CONFIGURATION;   -> Status SUCCESS

Step 4 — Redirect the application. This is the step people forget. Clients need to land on the new primary — via a role-based service that only starts in the PRIMARY role, or via a connect string that lists both hosts. Test it, don’t assume it.

Failover (unplanned) and reinstate

A failover is what you run when the primary is gone and not coming back soon. It’s faster and more decisive than a switchover, and with asynchronous transport it may cost you a small amount of redo (your RPO):

DGMGRL> FAILOVER TO 'ORCLCDB_STBY';

With Fast-Start Failover enabled, you don’t type that at all — the Observer detects the outage and promotes the standby automatically, typically in seconds. Either way, when the old primary comes back to life, you don’t rebuild it from scratch: if it had Flashback Database enabled, the Broker can rewind and re-enrol it as the new standby in one step:

DGMGRL> REINSTATE DATABASE 'ORCLCDB';

That Flashback-Database prerequisite is exactly why “enable Flashback on both databases” belongs in your standard build — without it, a failover turns a returning primary into a full rebuild.

Monitoring: what to watch, and when to page

A standby silently falling behind is the classic way DR rots. You need two numbers alarmed at all times — transport lag (redo not yet received) and apply lag (redo received but not yet applied) — plus the health of the apply process and, if you use it, the FSFO state.

-- The two numbers that define your real-world RPO/RTO right now
SELECT name, value AS lag, time_computed
FROM   v$dataguard_stats
WHERE  name IN ('transport lag','apply lag');

-- Is the apply process actually running? (run on the standby)
SELECT process, status, sequence#
FROM   gv$managed_standby
WHERE  process LIKE 'MRP%';

-- Fast-Start Failover health (run on the primary)
SELECT fs_failover_status, fs_failover_current_target, fs_failover_observer_present
FROM   v$database;

Sensible starting thresholds — tune them to your RPO/RTO, not these defaults:

SignalWarningCriticalWhy it matters
Transport lag> 60s> your RPORedo isn’t reaching the standby — data-loss exposure
Apply lag> 5 min> your RTOStandby is “behind”; failover would replay slowly
MRP processnot runningabsent after retryApply has stopped — lag will grow unbounded
FSFO statusnot SYNCHRONIZED / not within lag limitobserver absentAutomatic failover is not currently possible
Archive gapany persistent gapgrowingA missing sequence blocks all further apply

Two operational notes: run the Observer on a third, independent host (not on either database server — otherwise the thing that watches for failure can die with the failure), and if you run Oracle Enterprise Manager, its Data Guard metrics wrap all of the above in alerting so you’re not hand-rolling every check.

One subtlety worth calling out: when apply lag grows but transport is healthy and there’s no archive gap, the standby itself is usually the bottleneck — redo is arriving but the apply can’t keep up because the standby is I/O- or CPU-bound. That’s not a Data Guard problem, it’s a performance problem, and you diagnose it the same way you’d diagnose any slow database: pull an AWR report on the standby and read it. If that’s unfamiliar territory, start with How to Read an AWR Report Without Drowning.

Troubleshooting the usual suspects

When Data Guard misbehaves, it’s almost always one of a handful of patterns. The Broker surfaces these as ORA-16xxx messages — always read the Broker’s StatusReport for the specific code and its recommended action rather than guessing:

DGMGRL> SHOW CONFIGURATION;                 -- look for WARNING/ERROR
DGMGRL> SHOW DATABASE 'ORCLCDB_STBY' StatusReport;
SymptomLikely causeWhere to lookTypical fix
Apply lag climbing, sequence stuckArchive gap — a missing redo sequencev$archive_gap, gv$archived_logBroker/FAL usually auto-resolves; if not, ship the missing logs and re-register
Standby block corruption after a bulk loadNOLOGGING operation on the primaryalert log, v$database.force_loggingALTER DATABASE FORCE LOGGING; restore affected datafile from primary
Transport lag grows under loadNetwork throughput < redo ratev$dataguard_stats, redo generation rateTune TCP/socket buffers, enable redo transport compression, or use Far Sync
Real-time apply won’t startStandby redo logs missing/undersizedv$standby_logAdd standby redo logs (one more group than online, same size)
Apply stopped after a failover testFlashback off, can’t reinstatev$database.flashback_onEnable Flashback Database; reinstate via the Broker

The meta-lesson: most “Data Guard is broken” tickets are really forcing logging wasn’t set, standby redo logs were never created, or the network can’t keep up with peak redo. Get those three right at build time and you’ll prevent the majority of incidents.

Test it for real: a DR game-day

A standby you have never failed over to is a hope, not a plan — so put it on a schedule. A practical cadence is a switchover every quarter (it’s lossless and reversible) and a full failover drill at least annually. To exercise the application against standby data without disturbing replication, use a snapshot standby: it opens read-write for testing, then discards its changes and catches back up.

-- Open the standby read-write for application testing
DGMGRL> CONVERT DATABASE 'ORCLCDB_STBY' TO SNAPSHOT STANDBY;
-- ... run your app test suite against it ...
-- Roll it back and resume keeping pace with the primary
DGMGRL> CONVERT DATABASE 'ORCLCDB_STBY' TO PHYSICAL STANDBY;

A repeatable game-day runbook:

  1. Announce the window and the rollback plan.
  2. Pre-check with VALIDATE DATABASE (Ready for Switchover = Yes).
  3. Execute the switchover (or failover, for the annual drill).
  4. Verify the application actually reconnects through your role-based service — this is the test, not the database role itself.
  5. Measure the real RTO and RPO and compare them to target. Numbers, not vibes.
  6. Switch back and confirm the configuration returns to SUCCESS.
  7. Report: measured RTO/RPO, every gap you hit, and the owner/date for each fix.

That report is also the artifact that turns “I think we’re covered” into something leadership can actually rely on — and it’s how you find the decommissioned-host-in-the-runbook problem in a drill instead of during a real outage.

Patching and upgrading without downtime

Here’s the payoff most teams undersell: the biggest day-to-day return on HA isn’t surviving disasters — it’s making planned maintenance nearly invisible. The same building blocks let you patch and upgrade with little or no downtime, and that benefit cashes in every single patch cycle.

  • Rolling patches with RAC. Most quarterly Release Updates are RAC-rolling: you patch one node at a time while the others keep serving the database. Connections drain off the node you’re working on (via services with a drain timeout, or Application Continuity) and return when it rejoins. No outage, just a brief capacity dip.
  • Standby-first patching. For patches that aren’t RAC-rolling, Data Guard gives you another route: apply the patch to the standby first, verify it there, switch over to the patched standby, then patch the old primary. The application sees one short switchover instead of a maintenance window. (Oracle marks which patches are “Standby-First Installable.”)
  • Major upgrades with DBMS_ROLLING. A full release upgrade (say 19c → 23ai) normally means real downtime. DBMS_ROLLING converts your physical standby into a transient logical standby, upgrades it while the primary keeps running, and then switches over — so the application’s downtime collapses to a single switchover rather than the whole upgrade window:
-- sketch of a DBMS_ROLLING upgrade, driven from the primary
EXEC DBMS_ROLLING.INIT_PLAN(future_primary => 'ORCLCDB_STBY');
EXEC DBMS_ROLLING.BUILD_PLAN;
EXEC DBMS_ROLLING.START_PLAN;     -- standby becomes a transient logical standby
-- ... upgrade the transient logical standby to the new release ...
EXEC DBMS_ROLLING.SWITCHOVER;     -- the application flips to the upgraded database
EXEC DBMS_ROLLING.FINISH_PLAN;

The thread tying all three together: planned downtime is a choice, not a law of physics. If your SLA can’t spare a maintenance window, the HA you built for disasters quietly pays for itself every time you patch.

Try it yourself: a runnable lab

Reading about recovery is one thing; doing it is what builds the reflex. I put together a small lab you can run on a laptop with nothing but Docker — no Oracle account required — so you can feel the most important lessons here first-hand. It uses the community Oracle Database Free image and runs every command inside the container, so you don’t even need a local Oracle client.

A quick honesty note about scope, because it maps exactly to this article:

  • RAC isn’t something you can meaningfully run on a single laptop. It needs shared storage, a private interconnect, and clusterware across nodes — a real cluster, not a container trick. So the lab doesn’t pretend to.
  • Data Guard is an Enterprise Edition feature, and the zero-login Free image doesn’t include it. So the no-setup lab focuses on the failure modes you can reproduce — and which this post argues are the most commonly mishandled: human error, media loss, and corruption. A separate, opt-in Enterprise Edition module covers a real primary/standby switchover and failover for when you want to rehearse those too.

Getting started is three commands:

./run.sh up        # pulls the image and creates the database (first run takes a few minutes)
./run.sh setup     # enables archivelog and creates a small demo schema
./run.sh all       # runs all three drills end to end

The three drills, and the lesson each one drives home:

  1. Human-error recovery. The lab deletes every row (committed) and then drops the table — two perfectly valid statements a standby would have replicated in milliseconds — and recovers both locally with Flashback Query and Flashback Table. This is the “replication is not a backup” point you can now prove to yourself (and to a skeptical colleague) in thirty seconds.
  2. RMAN backup & restore. Take a backup, take a datafile offline and delete it from disk to simulate media failure, then restore and recover just that file while the rest of the database stays open. That’s the restore-drill muscle this post keeps insisting you build.
  3. Block-corruption detection & recovery. Write garbage into a single on-disk block, detect it with RMAN VALIDATE CHECK LOGICAL, and repair it with block media recovery — no full restore needed.

The full lab — the docker compose file, the run.sh driver, every drill script, and the optional Enterprise Edition Data Guard module — is the ha/ lab in github.com/pyaroslav/oracle-labs. Clone it, run it, break things on purpose. (No spare RAM on your laptop? The repo includes a guide to run the whole thing free on an OCI Always Free cloud VM.) Discovering that your runbook references a decommissioned host is a great thing to learn in a lab on a Tuesday afternoon — and a terrible thing to learn at 2am.

What about 23ai and 26ai?

If you’re on or moving to a newer release — 23ai, or the current 26ai — the good news is that none of the decision framework above changes: the failure modes are the same, RAC still protects compute, Data Guard still protects data, and backups + Flashback still own corruption and human error. The “ai”-era releases continue the same Maximum Availability Architecture lineage and add incremental improvements across the stack (redo transport/apply efficiency, manageability, and — notably in 23ai — new in-database capabilities like AI Vector Search that change what you run, not how you protect it). What does shift between releases is the small print: default parameter values, which features are enabled, and option licensing. So when you implement on 23ai or 26ai, confirm the exact behavior and licensing against that release’s documentation rather than assuming 19c defaults carry over — and, if you want a free place to check, the Oracle Database Free image (currently 26ai) and OCI Always Free Autonomous Database both let you verify on a real instance at no cost.

What teams get wrong (the short list)

  • Treating RAC as DR. It isn’t. One copy of data, one storage, one site.
  • An untested standby. If you haven’t done a real switchover, you don’t have DR — you have a theory. Schedule game-days.
  • Assuming replication protects against mistakes. A bad DELETE reaches the standby before you can cancel it. Flashback and backups are your safety net, every time.
  • Buying Gold when Bronze/Silver was the requirement. Match the MAA tier to a stated RTO/RPO, not to fear. Complexity you can’t operate is a liability, not insurance.
  • Ignoring the licensing line. RAC and Active Data Guard are paid options. Design within what you’re actually licensed for, or get the budget approved on purpose.

Frequently asked questions

Is Oracle RAC a disaster recovery solution?

No. RAC protects against instance and node failure by running multiple instances against one shared copy of the database. Because there is only one copy of the data on shared storage, a site outage, storage failure, or block corruption affects all RAC nodes at once. Disaster recovery requires an independent copy, which is what Data Guard provides.

Do I still need Data Guard if I already have RAC?

Yes, if you need to survive losing a site or region, or to protect against data corruption. RAC and Data Guard solve different failures: RAC handles local node failure, while Data Guard maintains a separate standby database for site loss and corruption protection. Many mission-critical systems run both.

Does Data Guard protect against accidental data deletion?

No. An accidental DELETE or DROP is a valid transaction, so Data Guard faithfully ships and applies it to the standby within seconds. Protection against human and logical errors comes from Flashback Database, Flashback Table, guaranteed restore points, and RMAN point-in-time recovery — not from replication.

What is the difference between a switchover and a failover?

A switchover is a planned, lossless role reversal between the primary and standby, used for maintenance and DR testing. A failover is an unplanned promotion of the standby when the primary is lost; with asynchronous transport it may incur a small amount of data loss. Fast-Start Failover can perform failovers automatically.

Is Data Guard included with Oracle Enterprise Edition?

Basic Data Guard — a physical standby in mount mode doing Redo Apply — is included with Enterprise Edition. Active Data Guard, which adds a read-only open standby, Automatic Block Media Recovery, and Far Sync, is a separately licensed option. RAC is also a separately licensed option.

What RPO can Data Guard achieve?

Zero data loss is achievable using synchronous redo transport in Maximum Availability or Maximum Protection mode, optionally with a Far Sync instance to preserve zero RPO over long distances. Asynchronous transport (Maximum Performance) typically loses only seconds of redo but adds no commit latency on the primary.

What is the difference between RAC and RAC One Node?

Full RAC runs multiple active instances across nodes for both high availability and scale-out. RAC One Node runs a single active instance that Oracle Clusterware can fail over or online-relocate to another node, with rolling patching. RAC One Node offers most of the availability benefit with less complexity, and can be scaled up to full RAC later.

What is Oracle Maximum Availability Architecture (MAA)?

MAA is Oracle's set of best-practice reference architectures for high availability and disaster recovery, organized into tiers: Bronze (a single instance with RMAN backups and Flashback), Silver (adds RAC or RAC One Node for local failure), Gold (adds Active Data Guard for site loss and corruption), and Platinum (adds GoldenGate, Application Continuity, and Edition-Based Redefinition for zero-downtime maintenance). You choose the lowest tier that meets your RTO and RPO targets.

What is an Oracle Data Guard Far Sync instance?

A Far Sync instance is a lightweight Data Guard member — just a control file and redo, no datafiles — placed close to the primary. The primary ships redo to it synchronously (zero data loss, low latency), and Far Sync forwards that redo asynchronously to a distant standby. This achieves zero-data-loss protection (RPO near zero) across long geographic distances without the commit latency that synchronous transport directly to a far-away standby would impose.

The one-paragraph version

Set RTO and RPO with the business. Use RAC (or RAC One Node) to survive instance and node failure at a site with no downtime. Use Data Guard to survive site loss and corruption, with Fast-Start Failover for automatic recovery and Far Sync if you need zero data loss over distance. Use both — MAA Gold — only when your targets genuinely demand it. And in every design, no exceptions, keep RMAN backups and Flashback Database, because that’s the only thing that saves you from the failure RAC and Data Guard can’t: the human one.

Have a question or some feedback?

I write here in a personal capacity and enjoy comparing notes with other Oracle folks. Say hello.

Get in touch