eygle.com   eygle.com
eygle.com  
 

« November 2, 2006 | Blog首页 | November 6, 2006 »



November 3, 2006

磁盘IO错误 导致数据库故障一则

作者:eygle

出处:http://blog.eygle.com

本周一刚刚说过最近硬件故障频繁,昨天又有一个数据库出现问题。

同样是硬件故障,存放数据库软件及数据文件的磁盘出现问题,导致数据库Down机。
登陆数据库服务器检查可以发现:

$ df -k
Filesystem kbytes used avail capacity Mounted on
/dev/dsk/c0t10d0s0 494235 95149 349663 22% /
/dev/dsk/c0t10d0s6 4384710 2160661 2180202 50% /usr
/proc 0 0 0 0% /proc
mnttab 0 0 0 0% /etc/mnttab
fd 0 0 0 0% /dev/fd
/dev/dsk/c0t10d0s1 1018191 586987 370113 62% /var
swap 3703192 96 3703096 1% /var/run
swap 4133440 430344 3703096 11% /tmp
/dev/dsk/c4t1d0s0 120514012 100868307 18440565 85% /data1
/dev/dsk/c0t10d0s5 8261393 2365474 5813306 29% /opt
/dev/dsk/c0t11d0s2 17348866 17229 17158149 1% /backup
/dev/dsk/c0t10d0s4 586515 21157 506707 5% /export/home
$ cd /data1
$ ls
.: I/O error

数据库Mount点data1已经不可以访问,I/O error的提示一般意味着磁盘出现问题。

这时我们可以通过一个系统命令dmesg来进行系统信息察看。
dmesg - collect system diagnostic messages to form error log

dmesg主要发现如下错误:

$ dmesg

Nov 1 23:58:10 stat socal: [ID 403145 kern.info] ID[SUNWssa.socal.link.5010] socal1: port 1: Fibre Channel is OFFLINE
Nov 1 23:58:56 stat scsi: [ID 243001 kern.warning] WARNING: /sbus@3,0/SUNW,socal@0,0/sf@1,0 (sf3):
Nov 1 23:58:56 stat Offline Timeout
Nov 1 23:58:56 stat scsi: [ID 243001 kern.info] /sbus@3,0/SUNW,socal@0,0/sf@1,0 (sf3):
Nov 1 23:58:56 stat target 0x1 al_pa 0xe8 lun 0 offlined
Nov 1 23:58:56 stat scsi: [ID 107833 kern.warning] WARNING: /sbus@3,0/SUNW,socal@0,0/sf@1,0/ssd@w50020f2300009321,0 (ssd0):
Nov 1 23:58:56 stat SCSI transport failed: reason 'reset': retrying command
Nov 1 23:58:56 stat scsi: [ID 107833 kern.warning] WARNING: /sbus@3,0/SUNW,socal@0,0/sf@1,0/ssd@w50020f2300009321,0 (ssd0):
Nov 1 23:58:56 stat transport rejected fatal error
Nov 1 23:58:56 stat ufs: [ID 702911 kern.warning] WARNING: Error writing master during ufs log roll
Nov 1 23:58:56 stat ufs: [ID 127457 kern.warning] WARNING: ufs log for /data1 changed state to Error
Nov 1 23:58:56 stat ufs: [ID 616219 kern.warning] WARNING: Please umount(1M) /data1 and run fsck(1M)

至此我们已经可以看到这是IO通道出现问题,最后导致IO操作失败。

这已经不是数据库层面的问题,我们通过重新启动主机及阵列,进行磁盘检查后,系统恢复正常。
还算幸运!

-The End-

Posted by eygle at 4:13 PM | Comments (7)



CopyRight © 2004-2008 eygle.com, All rights reserved.