Salesforce.com's Oracle Grid Database cluster crash!

This must have been a lot of sweating nights for the Salesforce folks, one of Oracle's customers!
But believe me , I know how things go.

Database is grid control, who wants a standby.
10g is easy, who wants a backup dba. Well can we fire the dba too? Sysadmin's can do it, no?
Everything is on SAN, someone will restore it.
Cluster crash? Never heard of it, funny the series that I'm writing (part V) which spoke about understanding your architecture and proactively work towards it's continuity.
MTTR, what is that?
MTTR, we have it set to 3 mins! (Hey have you ever tested it? Anywhere! Somewhere!)
Backup restore, have you tested it?
Do you have a valid test environment?
Do you have anything that looks like a test environment? Anything? Something?

I know the management team there is looking hard for someone to blame. I just hope the poor sysadmin or dba isn't the only one who will take the heat! Management ought to stand up to take it's responsibility as well.

We need to understand together

disks fail
clusters crash (with all kinds of errors which need desperate attention all the time!)
backups fail
restores fail
It's always happening when you're asleep
It happens most of the times in weekends

Technologies like grid computing or RAC etc are thoroughly tested technologies. What we do need to realize is that we cannot just rely on technologies but also have a proper plan for business continuity!

And I don't think hatred had anything to do with it, or did it?

Comments

This must have been a lot of sweating nights for the Salesforce folks, one of Oracle's customers!
But believe me , I know how things go.

Database is grid control, who wants a standby.
10g is easy, who wants a backup dba. Well can we fire the dba too? Sysadmin's can do it, no?
Everything is on SAN, someone will restore it.
Cluster crash? Never heard of it, funny the series that I'm writing (part V) which spoke about understanding your architecture and proactively work towards it's continuity.
MTTR, what is that?
MTTR, we have it set to 3 mins! (Hey have you ever tested it? Anywhere! Somewhere!)
Backup restore, have you tested it?
Do you have a valid test environment?
Do you have anything that looks like a test environment? Anything? Something?

disks fail
clusters crash (with all kinds of errors which need desperate attention all the time!)
backups fail
restores fail
It's always happening when you're asleep
It happens most of the times in weekends

Avastu Blog: Sustainable Global Clouds

Search This Blog