View Full Version : Recent Talkingreef Site outage


Rob
12-02-2007, 12:42 AM
First of all I want to apologize to everyone affected by the recent server outage that caused the TR site to be down for so long.
Second, I want to thank everyone for bearing with me, and being patient while I get things back up and running.

Briefly here is what happened.
Last Thursday there was a hard drive issue that caused us to have to take the server down for a while.
Last Friday, that issue returned. The hard drive totally failed and when replaced, we still had nothing. We then realized that the entire partition had been lost.. Basically all the data on the server was gone.
By Saturday night we had started the long process of data restores.
By Sunday night I cut the cord on that, the restore process was just taking way to long, so we built up a new server and I then had to recreate all the accounts on this server (Talkingreef is only 1 of about 20 sites I host here, other sites you may know are Reefreaders, ProjectDIBS, and The DIBS Foundation)

Late Monday night we finally got the systems back online starting with Talkingreef and Project DIBS. I then spend the next 2 days working on other sites.

As of now Talkingreef is back up and running. There are a few existing issues that I have to address and these will require periodic site outages but should not be more that 30-60 minutes at a time. These are going to be needed as i work to correct database issues left over from the way we had to restore the database.

NOW, with that said, the restore we did a few days ago left us with an extremely damaged database. I have spent the last few days (and nights) working to get this database back in working order, but because of continuous issues I have been forced to restore from an older backup. This means all posts after 11/19 until now have been lost. This will affect post counts, profile changes, image updates, and thread details as the newest posts will be gone. I must apologize for this, but these posts are gone and i have no way to retrieve them.

Just to wrap up here, again i want to thank everyone. Hopefully the future outages will be few and far between as i work to clean up these last few issues.

CarmieJo
12-02-2007, 12:59 AM
Thanks Rob for all the hard work you have gone through on this.

Rob
12-02-2007, 01:04 AM
Thanks Carmie..
you know, i knew there was a reason i took this week off work, just did realize i would spend the entire week on this shtuff.. Grrrr..

well it should be all back to normal now... just have a few more things to update

(yes i know the images are broke, im working on that too.. ;) )

Reefbaby
12-02-2007, 08:37 AM
gee Rob - major bummer! But, you're awesome for putting up with all those annoying problems!

So...the rest of us...let's get back to re-creating our posts since 11/19!

Phurst
12-02-2007, 09:05 AM
Rob, thanks for putting in all that hard work! I think we can live with a few lost posts, we're just glad the site is back!

Lpkenneys
12-02-2007, 11:20 AM
Just wanted to say thanks as well.. your hard work is appreciated!!!!: :-)

Rob
12-02-2007, 01:15 PM
thanks guys, im very happy to see appreciation rather than frustration (i have had enough of that for everyone.. :) )

i should have the last few bugs out soon..

veriann
12-02-2007, 08:45 PM
yea...its rob, welcome back buddy...pweeeh wee,:eek: you need a shower big boy, 6 days in the server dug-outs all for us, where would we be without ya:up:, if it wasn't for you bud, i'd be trapped in some weird german porn site wearing virtual platform shoes & some sort of exotic lace . lol .

nice job on the system wide restall, although, its a pitty it didn't dump all historical data, id love to start again just quietly. no past to incriminate me:tongue2:

Drimo
12-03-2007, 12:36 AM
Thanks, Rob. I wondered what had happened with my account, but figured it was due to data loss during the server issues. I work in IT and software development so I know how frustrating it can be to lose data. Thanks for the hard work on getting things restored as much as possible. The community has been an asset to me in the past few months (I've been lurking) and I just added my first coral to my tank today (some bullseye mushrooms), so I am glad the site is back up and running.

Amphibious
12-03-2007, 07:26 AM
Wow, Rob, Thanks for the hard work and dedication to TR. I can't even imagine the problems you encountered.

2 - 4 - 6 - 8 - Who do we appreciate? Rob Weatherly! Yeah!!!

Dick (And the rest of the gang, I'm sure!)

mikellini
12-12-2007, 08:43 PM
I know you've probably heard enough of the grief from people, but I want to give you some more... I spent a few hours making a new post in the member's tank projects section, and all for not. I've been thinking about doing it again, but I don't know if it's worth it quite yet. I think I'll wait for the next crash...

Not to be a pessimist, but these things come in threes. And I know it likely wasn't your fault, but it also likely could have been prevented. When I first saw the title for this thread I thought it said Recent Talkingreef Site Outrage.... which may have been more appropriate

poppin_fresh
12-12-2007, 10:08 PM
:rolleyes:

I cant complain about something that doesn't cost me anything to use.

Drimo
12-12-2007, 11:23 PM
I know you've probably heard enough of the grief from people, but I want to give you some more... I spent a few hours making a new post in the member's tank projects section, and all for not. I've been thinking about doing it again, but I don't know if it's worth it quite yet. I think I'll wait for the next crash...

Not to be a pessimist, but these things come in threes. And I know it likely wasn't your fault, but it also likely could have been prevented. When I first saw the title for this thread I thought it said Recent Talkingreef Site Outrage.... which may have been more appropriate

You could host your own tank journal or website, then there is no worry about losing it..... right? :roll:

Pescaiolo
12-12-2007, 11:43 PM
I know you've probably heard enough of the grief from people, but I want to give you some more...

Great way to start your post...great way to make friends. Seriously this a free website that Rob runs in his spare time. He doesn't have to do this and attitude like this would probably deter him from doing it, and in turn ruining it for the rest of us that do appreciate what he does. Do you find this site useful? If so then next time something like this happens; and yes these things do happen, its a part of life; sit back and realize that he doesn't have to do this, and if he wanted to pull the plug, he can. Then all your posts would be gone and you would only have your fish to complain to.

Rob thank you for the work you put into this site and thank you for continuing to provide this service to us. I for one appreciate it and I know most of us do as well. Thank you once again!

Amphibious
12-13-2007, 12:10 AM
I too was writing a long post when the site crashed. I lost the material and my train of thought. But rather than bash Rob I chose to accept the loss and move on with life. I met Rob at MACNA XIX in Pittsburgh this Sept. He is dedicated to TR and it's members. This is a labor of love not of money. He could easily pull the plug and let TR go down the drain. I for one would hate to see that happen.

Rob, I too thank you for the dedication you put into this site, the best reefing site on the Internet.

Dick

mikellini
12-13-2007, 03:19 PM
I wasn't bashing Rob, and I'm pretty sure he realizes this... I was just voicing my feelings. Crucify me. I do appreciate that it's a free site (which is, by the way, not unlike many other free reef sites) and is not money-driven. I also appreciate the general attitude of members and the community as a whole. I just wanted to vent and say how I felt. Since when was this a bad thing? I didn't do it in a disrespectful manner, and I wasn't trying to start a fight. But if we aren't allowed to say how we feel in this forum without being judged, that says something about the general attitude of the forum community. Maybe it is changing, I don't know...

CarmieJo
12-13-2007, 11:58 PM
I think everyone was frustrated with the crash. I don't speak for Rob but I would venture to guess that he was the one who was MOST frustrated. I don't know any of the intricacies of hosting a podcast or website but I suppose they are legion.

Computers crash and data gets lost even in a corporate environment. I back up a lot but I had my HD fail at work and lost a project that I had worked on for 2 days because it haden't backed up to the server.

I think that one of the things that is great about TR is that fact that we are open and willing to listen. And if you have $3 a month extra consider a http://www.talkingreef.com/forums/site-info-news/189-premium-membership.html

veriann
12-14-2007, 07:40 AM
yep, thanks carmie for drawing my attentions to that, im not membered up anymore:unsure: nice fat check after the holdays robbo ok!


Mike, boldly go where no smurf has gone before, Im down with that, cause i do that! ... but vent in maybe a slightly more delicate fashion than " dude you dropped the ball but you could have caught it type over tones"
Slightly more respect than the common folk is needed to the hand that feeds you if catch the drift.:up:

Ive lost zillions of classic posts in the past, Sh@t happens at the worst of times, but then im glad it happened to some murfy dude more that us if you know what im saying. Besides my bad mannors alone are enough to shut the site down without tech issues lol

soooooo back to brass taxes people.............who has my beer???

Rob
12-15-2007, 08:36 PM
mike, no hard feelings i do feel the frustration..
lets put this in to perspective..
i had a week off or work, i spent my entire week working to recover this site and many others i manage.
i know you and many other people lost posts and data.. believe me.. i spend almsot two days trying to prevent that..

the only thing i want to make clear is the part of "And I know it likely wasn't your fault, but it also likely could have been prevented." and to an extent you are right. BUT extreme levels of protection cost extreme amounts of money. Here are more details as to why. as i noted there had been a few days of ongoing hardware issues.. failing drives and RAID controllers.. in the beginning they were more "flakey" than actually failing. we worked through that and hoped for the best... well it turned out that they were quite toasted and when one drive finally failed we lost the whole data partition on the server. this was caused by the RAID controller writing bad parity blocks in the RAID volume. Because of this the simple RAID disk rebuild did not work. And because of the past few days of data issues our backups were not valid and were leaving me with corrupted Databases. could i have deployed a clustered front end web server with High Avail backed DBs servers with daily DB backups (which i do have now) and hourly transact log backups, sure.. but this type of work is very expensive. lets keep in mind that this website alone sees millions of hits a month and that doesn't include the podcasts.. how many people realize that Talkingreef podcasts are downloaded thousands of times in the first few days of release and that i use over 2 Terabytes of bandwidth each month on the podcast server.

look, i know everyone was frustrated, and Mike, i know you were just venting.. like i said no hard feelings..
i want to make sure everyone understands the situation..

veriann
12-17-2007, 06:50 AM
Now i didn't understand one word you just said rob, but im suddenly feeling aroused!:huh:

mikellini
12-19-2007, 08:32 PM
Thanks Rob, now I have a little more insight as to what exactly went wrong... I have no doubts that all of this has caused more grief for you than for me. What I am really glad to hear is that you now have taken steps to prevent this from happening again, albeit expensive ones. I appreciate this very much and was not aware of it until I read the last post. I'm going to break the bank and buy a membership, I suppose I really don't have much of a right to complain unless I do ;)

Rob
12-19-2007, 10:55 PM
thank you all for your support and understanding.. and Mike, thanks for your membership, it really does help pay for a lot of these types of things...

oh and V..
just wait until i start talking about SCSI buses and I/O boards.. or even quad port NICS with teamed interfaces and failover paths... :D

poppin_fresh
12-20-2007, 09:14 PM
SCSI? Do they even make that anymore? I haven't used a scsi device in about 5 years! Remember ZIP drives?

CarmieJo
12-21-2007, 06:14 PM
I still have a zip drive at work. Of course, I probably haven't used it in 4 or 5 years. :)

lReef lKeeper
12-21-2007, 06:48 PM
i have not said thanks to Rob ??!! i can not believe that i have not replied in this thread !! THANKS for doing all that you do Rob !!

rroselavy
12-22-2007, 03:13 AM
Here are more details as to why. as i noted there had been a few days of ongoing hardware issues.. failing drives and RAID controllers.. in the beginning they were more "flakey" than actually failing. we worked through that and hoped for the best... well it turned out that they were quite toasted and when one drive finally failed we lost the whole data partition on the server. this was caused by the RAID controller writing bad parity blocks in the RAID volume.

OK. I will take this opportunity to thank Rob for his efforts but also geek out with a few comments and questions. Sys Admin and Integration is part of my job, so I have had the opportunity to work with storage and networking equipment on a daily basis - and understand the limitations, frustrations and liabilities. The only difference is that our setup is primarily Mac OS X based, and I am guessing that Rob runs (or is hosted by) PC or Linux servers.

Geeky Q #1: Our Xserve RAID volumes (RAID 5) have a utility for rebuilding the RAID parity data should it get corrupted, and reconditioning bad blocks while the system, all while the system is live. Did you have that option? I am not sure what utilities other RAID products ship with...

Geeky Q #2: How much data does TR encompass, and to what medium are backups performed? Does the backup software handle live databases? We perform both disk-to-disk and backup to LTO-3 for our data, but we do not typically backup database files. If you need any advice or have any questions in this area, let me know.

Geeky Q #3: What platform and networking topology does TR use? For our production network we are all Gb with link aggregation for our backbone connections and single Gb links for each client. Our horsepower is for production and local file serving, so we do not require fat pipes to the internet. Also, I am no network guru, but I have put several together and have researched a range of products.

No problem if you prefer to be guarded about these details, I know I would. Just offering any help that I can...

look, i know everyone was frustrated, and Mike, i know you were just venting.. like i said no hard feelings..
i want to make sure everyone understands the situation..

I noticed a few lost posts, but I experienced no frustration. Your attention to detail, sensitivity to user needs, and tireless effort is exceptional and much appreciated.

Pescaiolo
12-22-2007, 11:31 AM
Mike I'm sorry if I came across as judging, I did not mean it that way. The first line in your post got to me when you wanted to add to the grief poor Rob already has! Didn't mean to offend you mate.

SCSI, oh wow haven't seen a SCSI device in years! Daisy chaining was awesome! Chaining 5 to 10 GB harddrives, now we have flash cards with 10gb! Amazing how technology gets better and better!

rroselavy
12-22-2007, 12:58 PM
SCSI, oh wow haven't seen a SCSI device in years! Daisy chaining was awesome! Chaining 5 to 10 GB harddrives, now we have flash cards with 10gb! Amazing how technology gets better and better!

SCSI is considerably less hot-pluggable like USB and Firewire, but is still a high-performance (Ultra320-SCSI @ 320MBps, Ultra640-SCSI @ 640MBps), reliable technology still implemented for server side RAID systems and backup tape libraries and other demanding devices. By comparison, we use fibre channel for our RAID arrays, which has max throughputs of 250MB/s (2Gb) and 500MB/s (4Gb). You can see how SCSI is still holding its own.

We've been through many backup, delivery, and sneaker-net mediums over the years, including 3.5" floppy, Magneto-Optical, Syquest, Jaz, Zip, and CD-R, DVD-R and now thumb drives. For our servers we used to back up to DAT and then DLT-7000 and now nearline storage and LTO-3. We sometimes need to resurrect legacy data and are generally pack-rats, so we have kept most of these relic drives in storage just in case. For example, we recently had a bygone client request 13+ year old audio files off of DA-88 tapes!!! I try to migrate more useful data from old medium to new medium when I can, but it is hard to keep up...

As far as hard disk space is concerned, we used to do our production within 450 GB. Back then that was a good amount. Now we have over 15 TB, and I still find it challenging to keep the RAIDs from filling up. Television production is all about HDTV nowadays, so our data requirements have exploded...

I'll shut up now, except to say: Much sympathy to Rob!

Amphibious
12-23-2007, 06:09 AM
just wait until i start talking about SCSI buses and I/O boards.. or even quad port NICS with teamed interfaces and failover paths... :D SCSI buses, I/O boards, quad port NICS, teamed interfaces, failover paths, Magneto-Optical, Syquest, Jaz, Zip, and CD-R, DVD-R, DAT, DLT-7000, nearline storage, LTO-3, Daisy chaining. What in the world??? That's all greek, no Arabic to me. All I know about the outage is this...

Before the outage I never saw my ad on TR, nor anyone else's. Rob assured mine was in rotation with the others. He could see them, others could see them but I couldn't see them. Then I went on a road trip, took my laptop and broadband card and lo and behold there was all the ads.

Now, since the "fix" Rob had to implement, I'm seeing the ads on my PC.

Thanks Rob.

I'm telling you this computer stuff is wizardry.

Dick

CarmieJo
12-23-2007, 04:09 PM
SCSI buses, I/O boards, quad port NICS, teamed interfaces, failover paths, Magneto-Optical, Syquest, Jaz, Zip, and CD-R, DVD-R, DAT, DLT-7000, nearline storage, LTO-3, Daisy chaining. What in the world??? That's all greek, no Arabic to me. All I know about the outage is this...

Before the outage I never saw my ad on TR, nor anyone else's. Rob assured mine was in rotation with the others. He could see them, others could see them but I couldn't see them. Then I went on a road trip, took my laptop and broadband card and lo and behold there was all the ads.

Now, since the "fix" Rob had to implement, I'm seeing the ads on my PC.

Thanks Rob.

I'm telling you this computer stuff is wizardry.

Dick

Dick, that's not Greek its GEEK! ;)

Amphibious
12-23-2007, 05:30 PM
I knew it was something foreign. :rotfl:

veriann
12-25-2007, 08:25 PM
Before the outage I never saw my ad on TR, nor anyone else's. Rob assured mine was in rotation with the others. He could see them, others could see them but I couldn't see them. Then I went on a road trip, took my laptop and broadband card and lo and behold there was all the ads.Ampage, i touched up your pic file to give you a dark Hitler mo instead of white:huh: i think thats what stopped you from seeing the banner adverts lol

bklynreefdude
05-23-2008, 10:39 AM
Thanks Rob! You da man.

rroselavy
05-23-2008, 11:26 AM
Thanks Rob! You da man.

Yeah, Rob is the man...but alas this is an old thread from December. This most recent outage hasn't been discussed yet, which is weird. Seems like it was at least 2 days long.

I feel bad for Rob. We only hear from the guy when he makes announcements. He doesn't seem to have time to just chat with us anymore.... :cry:

bklynreefdude
05-23-2008, 12:42 PM
Hey Rroselavy:

You are so right! I was so excited to see the site back up I didn't even check the date.


Matt

Rob
05-26-2008, 08:21 PM
yes sorry about that guys.. had an attack on the server on top of some confusion between me and the operators in the data center. the outage was due to an internal DOS attack and having to rebuild the server (was faster than finding the compromise.

Psychojam
05-26-2008, 08:37 PM
The computers were attacked by dissolved organic solids...YUCK! Rob, you need to move your servers away from your fish tank!

IAreef
05-26-2008, 08:42 PM
lol, it's funny how acronyms can have so many meanings

veriann
05-27-2008, 02:54 AM
lol, yeah like " Dont Over Sympathize" .

"Donuts over sweets" :huh:

"Deliver only saturday"


Thanks for the rescue rob, it would be nice to see you more often though.
after seeing that movie the invisible man you thought it was cool to roam the halls in that way didn't you..lol I thought i heard some heavy breathing last time i started undressing! :rotfl:

lReef lKeeper
05-27-2008, 05:39 PM
that wasnt heavy breathing ... it was your date throwing up her/his (i dont know about you somethimes) guts !! lmao

Rob
05-27-2008, 10:09 PM
that wasnt heavy breathing ... it was your date throwing up her/his (i dont know about you somethimes) guts !! lmao
lmao.. you guys are a RIOT!!

Dave1NC
05-27-2008, 10:27 PM
For all the hard work you put into the Talkingreef. :up:

Keep up the good work!

Psychojam
05-27-2008, 10:36 PM
For all the hard work you put into the Talkingreef. :up:

Keep up the good work!

I second that...Thanks Rob.

THEJRC
05-28-2008, 01:40 AM
SCSI is considerably less hot-pluggable like USB and Firewire

I beg to differ, come play with some of the fujitsu and EMC gear we have here in the datacenter he he. But in the entry level / mid grade gear level... I'll agree to a point he he but the high end stuff is nothing short of cool!


but is still a high-performance (Ultra320-SCSI @ 320MBps, Ultra640-SCSI @ 640MBps), reliable technology still implemented for server side RAID systems and backup tape libraries and other demanding devices. By comparison, we use fibre channel for our RAID arrays, which has max throughputs of 250MB/s (2Gb) and 500MB/s (4Gb). You can see how SCSI is still holding its own.


Important to note... Fibre Channel is more like SCSI 2.0, if you read the standards SCSI itself covers the theory of multiple LUN's and such, hence the basis of fibre channel..... all that said yep I'm glad SCSI is holding it's own! Even compared against the more popular entry level SAS solutions (cheaper and larger, same speed or somewhat faster, less reliable). In the real world after you cut your teeth and make mistakes you learn that SCSI = RELIABLE + FAST all others fit one of the two categories but not both... hrmm... I love anyone who loves SCSI....
[/QUOTE]



We've been through many backup, delivery, and sneaker-net mediums over the years, including 3.5" floppy, Magneto-Optical, Syquest, Jaz, Zip, and CD-R, DVD-R and now thumb drives. For our servers we used to back up to DAT and then DLT-7000 and now nearline storage and LTO-3. We sometimes need to resurrect legacy data and are generally pack-rats, so we have kept most of these relic drives in storage just in case. For example, we recently had a bygone client request 13+ year old audio files off of DA-88 tapes!!! I try to migrate more useful data from old medium to new medium when I can, but it is hard to keep up...!

oh god, do let me know when you come up with a viable solution... I want one!!! The new LTO libraries out there are nothing short of massive... but looking at the world your in I dont envy you. Most of my clientelle is in the engineering world and cad drawings are simply massive... with compliance requirements to keep full documentation of each revision no matter how small this gets simply HUGE!!! Even worse... I specialize in compliance... logging file accesses over a day or so doesnt seem big... logging ALL corporate communications, file accesses, online access, and logins (heck even print jobs) means your setting up a raid for just logging... you mean I gotta store not only the files but the logs now too? Oh wait and backup all the data in a window thats the approximate size of a walnut... oh wait you're archiving video and audio streams, lemme think of all the things that would be easier to backup...

in the world of suck... you win!

Used to be this kinda stuff was reserved for NASA....

When did we get so lucky? remember when we wanted to do this as a career track! (hah scary huh....)

/end rant...

all that said rob if theres anything I can do to help lemme know, beauty about being the owner... nobody's going to ask me why a server was provisioned for xx.

veriann
05-30-2008, 02:06 AM
you had me at hello! lol

THEJRC
05-30-2008, 02:33 AM
hah (and I'm sure we can all get geekier but I digress..)

The important thing is that the sites back up and working!!!! Rob's put more than his fair share of effort into it at this point for sure so I've got to give him kudos!

in a funny note, I spent so much time formulating my response in that last post that several people had squeezed in between thus creating an even more confusing mess...

I am now sticking with the Dissolved Organic Solids theory... and will be continuing to use that as an explanation of what a DOS attack is when I get a cockeyed look with the first explanation.

Rob: I bet you never thought TR would get so big as to have a ton of people freaking out when the server went down... oh my god it's turned into work!!! Thanks for all of it, you've done one helluva job

veriann
05-30-2008, 10:36 PM
see problem is, when rob forgets to fill the generators up with juice, the power goes out! :unsure:

lReef lKeeper
05-30-2008, 10:40 PM
V, i guess you missed my post above ?? just wondering cuz there has been no reply from you. you either missed it ... or had no comeback for it ??