Grooveshark’s Growing Pains: Love from Our CIO

Ben Westermann-Clark

June 26, 2008

We try our hardest to bring you a reliable, fun-to-use place for listening to old favorites and discovering new music. Being a somewhat new company and, indeed, being a small bunch, we’ve run into some bugs as we’ve worked to keep up with all of your interest in Grooveshark–and we’re so very, very thankful. Grooveshark Chief Information Officer, Colin Hostert, has taken some time to explain a bit of the behind the scenes, complete with updates on where we stand now. Take a look!

Over the last week you may have noticed a few hiccups in the availability of our beloved Grooveshark. Most of those issues have centered around issues we were having with our storage device: a Sun x4500 [ http://www.sun.com/servers/x64/x4500/ ]. We use this device to store content for our various websites (Grooveshark Beta [ http://beta.grooveshark.com/ ] and Grooveshark [ http://listen.grooveshark.com/ ]), so when it goes down, things go crazy. I would like to share with you an email I sent to our developers explaining to them what happened in hopes that it will answer any questions you all might have. If anyone has any further questions or insight please feel free to comment and I will reply as best I can.

Ok, so quick recap of what happened with the x4500 (SAN). There is a bug in ZFS that caused a destroy command to hang, so after it hung I had to wipe the ZFS config file and import the pools (partitions). This should be a quick process, but because the box got rebooted during a destroy operation, extra consistency checks had to be run which caused the import to take forever. There is supposedly a patch that solves the destroy hang issue, but I am having trouble getting it to install because of conflicts with other patches sun wanted installed. I am still working with Sun on this.

Big point here is that everything is back up and it appears to be stable. Now, to answer a few questions:

A bug? WTF I thought Solaris was supposed to be uber-stable.
This is a good point. My general outlook is that it is a very new hardware platform, and new software (in terms of filesystem age). The 4500 is the first system to routinely push ZFS to 30TB working sets, so issues are popping up here that wouldn’t show on smaller hardware. Also, the platform has 4 controller cards, and 48 hard drives. That is a lot of interaction with hardware, which demands the drivers and firmware be perfect on the 4500. Also, the bug was only on the 4500 and not apparent in all destroy operations.

Does this crash mean ZFS sucks and is not suitable for production usage?
If anything I think this speaks to the strength of ZFS. There was a hang during an operation that was destroying part of the filesysem (at my request). During that operation the box was cold rebooted, which is almost a worst-case scenario. Even with all of that there is 0 loss or corruption of data, and in theory 0 chance of that ever happening because of the insane consistency checks ZFS does.

Does this mean you suck and should have kept up with patches to keep this from happening? or Why wasn’t the box patched?

The particular patch that Sun tells me will fix the issue is only available as an IDR [ http://www.auditmypc.com/acronym/IDR.asp ] (it’s not an official patch yet). As an IDR it will not show up in the update manager nor is it posted anywhere on the net. They only hand it out if someone has an issue that requires it, so there is little preventative action that could be done Again, the bug only effects a small percent of destroy operations (we have done destroys dozens of times with no issues) and it is only on the x4500, so I guess they figured it could wait to go through their normal QA/patch system vs. doing an emergency release of it.

We are not a storage solutions company, is it realy worth messing with the 4500 and getting it right? Why don’t we just store our data in “the cloud”?
There are three reasons we’re not using Amazon S3 exclusively: price, flexibility, and vendor neutrality. To be honest, most of it is price.

So are we operating in a crippled mode since we can’t destroy volumes?
Thankfully no. While we used it a lot in the beginning for testing, the destroy command is not something we have needed to use much recently. I dont see the need to destroy an volumes in the next 60 days.

Once the IDR patch is applied will everything be stable and bug free?
Not completely. The Linux kernel has been around forever and it still has bugs, and no system will ever be bug free. However, I do think we’ll soon get to the point where the only bugs left are so small that we wouldn’t notice.

Thanks for your support and patience, and if you have any questions or thoughts, leave a comment or send me a message [ http://www.grooveshark.com/contact?id=13 ].

Copyright 2008 http://www.grooveshark.com/community/2008/06/26/groovesharks-growing-pains-love-from-our-cto/