Skip to main content

Salesforce.com's Oracle Grid Database cluster crash!




This must have been a lot of sweating nights for the Salesforce folks, one of Oracle's customers!
But believe me , I know how things go.
  • Database is grid control, who wants a standby.
  • 10g is easy, who wants a backup dba. Well can we fire the dba too? Sysadmin's can do it, no?
  • Everything is on SAN, someone will restore it.
  • Cluster crash? Never heard of it, funny the series that I'm writing (part V) which spoke about understanding your architecture and proactively work towards it's continuity.
  • MTTR, what is that?
  • MTTR, we have it set to 3 mins! (Hey have you ever tested it? Anywhere! Somewhere!)
  • Backup restore, have you tested it?
  • Do you have a valid test environment?
  • Do you have anything that looks like a test environment? Anything? Something?
I know the management team there is looking hard for someone to blame. I just hope the poor sysadmin or dba isn't the only one who will take the heat! Management ought to stand up to take it's responsibility as well.

We need to understand together
  • disks fail
  • clusters crash (with all kinds of errors which need desperate attention all the time!)
  • backups fail
  • restores fail
  • It's always happening when you're asleep
  • It happens most of the times in weekends
Technologies like grid computing or RAC etc are thoroughly tested technologies. What we do need to realize is that we cannot just rely on technologies but also have a proper plan for business continuity!

And I don't think hatred had anything to do with it, or did it?

Comments

Popular posts from this blog

Security: VMware Workstation 6 vulnerability

vulnerable software: VMware Workstation 6.0 for Windows, possible some other VMware products as well type of vulnerability: DoS, potential privilege escalation I found a vulnerability in VMware Workstation 6.0 which allows an unprivileged user in the host OS to crash the system and potentially run arbitrary code with kernel privileges. The issue is in the vmstor-60 driver, which is supposed to mount VMware images within the host OS. When sending the IOCTL code FsSetVoleInformation with subcode FsSetFileInformation with a large buffer and underreporting its size to at max 1024 bytes, it will underrun and potentially execute arbitrary code. Security focus

Splunk that!

Saw this advert on Slashdot and went on to look for it and found the tour pretty neat to look at. Check out the demo too! So why would I need it? WHY NOT? I'd say. As an organization grows , new services, new data comes by, new logs start accumulating on the servers and it becomes increasingly difficult to look at all those logs, leave alone that you'd have time to read them and who cares about analysis as the time to look for those log files already makes your day, isn't it? Well a solution like this is a cool option to have your sysadmins/operators look at ONE PLACE and thus you don't have your administrators lurking around in your physical servers and *accidentally* messing up things there. Go ahead and give it a shot by downloading it and testing it. I'll give it a shot myself! Ok so I went ahead and installed it. Do this... [root@tarrydev Software]# ./splunk-Server-1.0.1-linux-installer.bin to install and this (if you screw up) [root@tarrydev Software]# /op