Tuesday, September 19, 2006

120986-06

At approximately 5:15 PM last night the cluster node running our main mail server crashed and rebooted. No big deal I thought, if it doesn't want to play nice I'll use other cluster nodes.

Then the /mail partition won't mount because it needs fsck. No big deal since its logging... wait... it needs fsck because its really confused about the logs... fine. This is a huge filesystem with 260ish gig of stuff on it. In maildir format. Lots of tiny little files.

Anyway, I sigh and think: well it'll be about 30 mins to fsck that baddie and all will be right.

Negative. The first fsck took over an hour and a half.

And it failed.

I go to dinner and watch the output on my Treo. (which crashed several times, but thankfully screen was there to save the day).

Each time I ran fsck it got a bit faster (down to about 30 mins to fsck it, but it would always exit saying:

# fsck -y -v /dev/rdsk/emcpower4a
** /dev/rdsk/emcpower4a
** Last Mounted on /san/mail/mail
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3a - Check Connectivity
** Phase 3b - Verify Shadows/ACLs
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cylinder Groups
CG 1164: BAD CG MAGIC NUMBER (0x0 should be 0x90255)
WRONG CG NUMBER (0 should be 1164)
IMPOSSIBLE NUMBER OF CYLINDERS IN GROUP (0 is less than 1)
INCORRECT NUMBER OF INODES IN GROUP (0 should be 11648)
INCORRECT NUMBER OF DATA BLOCKS IN GROUP (0 should be 49152)
IMPOSSIBLE BLOCK ALLOCATION ROTOR POSITION (-2147483648 should be at least 0 and less than 49152)
IMPOSSIBLE FRAGMENT ALLOCATION ROTOR POSITION (-17039106 should be at least 0 and less than 49152)
IMPOSSIBLE INODE ALLOCATION ROTOR POSITION (16711420 should be at least 0 and less than 11648)
INCORRECT BLOCK TOTALS OFFSET (3840 should be 168)
BAD FREE BLOCK POSITIONS TABLE OFFSET (1056964608 should 184)
INCORRECT USED INODE MAP OFFSET (0 should be 248)
INCORRECT FREE FRAGMENT MAP OFFSET (16530432 should be 1704)
END OF HEADER POSITION INCORRECT (255819520 should be 7848)

Irreparable cylinder group header problem. Program terminated.


Fine. Ask Google.



Only four responses. [note: this is where panic starts to set in] And to make things better, two of those are the opensolaris source code. The other two is the same post of someone asking about this. [note: panic in full swing now]

This is not a good sign. This means I'm out in left field here.

So, after a few more hours of banging my head on the thirty minute long fsck's, I start the process to ufsdump and ufsrestore the partition. One problem becomes immediately apparent... Its gonna take 11 hours just to ufsdump the thing. And one could figure about that long to ufsrestore it. TWENTY TWO HOURS estimated. Whoa... time to look more into the problem.

Then something Matt had told me earlier in the evening (like around 10ish)... something about the new version of fsck adding a -v option... heeey.... new version of fsck, eh? Upon prodding around in the Sunsolve stuff I see that the 120986-06 was released on August 18... right before I patched the machines. I see other indecations of someone having something like my problem in Sunsolve (but not nearly to the degree I do...), but at any rate Sunsolve has no real solutions.

So, I took the only step I could think of. I took it out:

# patchrm 120986-06


And, viola!

# fsck /dev/rdsk/emcpower4a

** /dev/rdsk/emcpower4a
** Last Mounted on /san/mail/mail
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
SUMMARY INFORMATION BAD
SALVAGE? y

CG 1164: BAD MAGIC NUMBER
6123615 files, 232133573 used, 457986504 free (2902704 frags, 56885475 blocks, 0.4% fragmentation)

***** FILE SYSTEM WAS MODIFIED *****


I've come to the strong conclusion that 120986-06 (mkfs newfs ufs utilities) is [insert not nice thing here].

So, at least without that the [pejoritive] fsck finished. Hopefully it did the right thing with Cylinder Group 1164. Like showed it a whole bunch of zeros. At least the fs is mounted and all the spam...er...mail is flowing in like normal.

Which of course was blowing out or spamassassin server (when it rains it pours). I've added more procmailrc locking (now only running once per user) and hopefully life will be happier now. Okay yeah, until procmailrcs time out.

As soon as this is all done then I'm off to bed. Thats what I said an hour and a half ago. Mental note. If you get
svc:/system/cron:default: Could not interpret group property.

Check to see that root is uid 0 gid 0. *Sigh*

3 comments:

Random Sysadmin said...

Well you just saved a critical production filesystem ! (actually you just saved me the pain and the time of restoring it but w/e)

Thank you sir !

Rob said...

Question: what Solaris version are you running and on what arch?

Thanks

Jeff Ballard said...

Solaris Sparc 10