« Oracle KKS 层-Kernel Kompile Shared (cursors) | Blog首页 | 西宁 青海湖 ,短暂的逃离 »
Oracle数据恢复:RAC系统Redo/Undo损坏恢复
作者:eygle | 【转载请注出处】|【云和恩墨 领先的zData数据库一体机 | zCloud PaaS云管平台 | SQM SQL审核平台 | ZDBM 数据库备份一体机】
链接:https://www.eygle.com/archives/2011/06/recovery_4193_4194_2662.html
昨天,一个客户的数据库系统出现故障,RAC无法启动,大量的错误信息,经过分析检查,最后我们通过强制手段打开数据库,帮助用户挽回了数据损失。链接:https://www.eygle.com/archives/2011/06/recovery_4193_4194_2662.html
故障的起因应该是主机和存储之间出现连接故障,message信息里可以看到大量如下提示:
Jun 1 19:57:18 Oracle-02 kernel: end_request: I/O error, dev sdb, sector 12931135一旦出现IO写失败或者写丢失,则数据库将遭受数据损失,一旦UNDO和REDO损坏,数据库就将会无法启动。
Jun 1 19:57:18 Oracle-02 kernel: sd 6:0:0:1: Device not ready: <6>: Current: sense key: Not Ready
Jun 1 19:57:18 Oracle-02 kernel: Add. Sense: Logical unit not accessible, asymmetric access state transition
Jun 1 19:57:18 Oracle-02 kernel:
Jun 1 19:57:18 Oracle-02 kernel: end_request: I/O error, dev sdb, sector 12931647
Jun 1 19:57:18 Oracle-02 kernel: sd 6:0:0:1: Device not ready: <6>: Current: sense key: Not Ready
Jun 1 19:57:18 Oracle-02 kernel: Add. Sense: Logical unit not accessible, asymmetric access state transition
Jun 1 19:57:18 Oracle-02 kernel:
Jun 1 19:57:18 Oracle-02 kernel: end_request: I/O error, dev sdb, sector 12932671
Jun 1 19:57:18 Oracle-02 kernel: sd 6:0:0:1: Device not ready: <6>: Current: sense key: Not Ready
Jun 1 19:57:18 Oracle-02 kernel: Add. Sense: Logical unit not accessible, asymmetric access state transition
在数据库的告警日志层面,首先出现的错误是如下类型:
WARNING: Read Failed. group:3 disk:0 AU:102 offset:16384 size:16384
WARNING: failed to read mirror side 1 of virtual extent 0 logical extent 0 of file 256 in group [3.3870646521] from disk REDO1 allocation unit 102 reason error; if possible,will try another mirror side
WARNING: Read Failed. group:3 disk:0 AU:102 offset:16384 size:16384
WARNING: failed to read mirror side 1 of virtual extent 0 logical extent 0 of file 256 in group [3.3870646521] from disk REDO1 allocation unit 102 reason error; if possible,will try another mirror side
这些错误首先提示REDO写操作失败,读写AU单位失败,这类错误信息教Oracle 11gR2之前详细了很多,精确到了AU单位及Offset。
在启动过程中出现了ORA-00600错误:
ARC0: STARTING ARCH PROCESSES COMPLETE
Redo thread 2 internally disabled at seq 1 (CKPT)
ARC2: Archiving disabled thread 2 sequence 1
Archived Log entry 785 added for thread 2 sequence 1 ID 0x0 dest 1:
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /oracle/diag/rdbms/spsp/SPSP1/trace/SPSP1_ora_19044.trc:
ORA-00600: internal error code, arguments: [2662], [0], [7291890], [0], [7314785], [12583040], [], [], [], [], [], []
Errors in file /oracle/diag/rdbms/spsp/SPSP1/trace/SPSP1_ora_19044.trc:
ORA-00600: internal error code, arguments: [2662], [0], [7291890], [0], [7314785], [12583040], [], [], [], [], [], []
Errors in file /oracle/diag/rdbms/spsp/SPSP2/trace/SPSP2_smon_14335.trc (incident=291384):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/diag/rdbms/spsp/SPSP2/incident/incdir_291384/SPSP2_smon_14335_i291384.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
注意Oracle 11gR2的提示,非常详尽,居然提示你通过My Oracle Support去参考 Note 411.1 学习ADRCI知识。
ORA-600的 4193 和 4194 错误,可以通过重建回滚表空间解决。
此处需要提醒的是,Oracle 11g的缺省UNDO段命名,增加了一个Unix Time的时间戳在回滚段名称里,如下所示:
SQL> select * from v$rollname;
USN NAME
---------- ------------------------------
0 SYSTEM
31 _SYSSMU31_1012201531$
32 _SYSSMU32_3505455792$
33 _SYSSMU33_760762808$
34 _SYSSMU34_2022544229$
35 _SYSSMU35_774751923$
36 _SYSSMU36_3132287946$
37 _SYSSMU37_1634453262$
38 _SYSSMU38_1341527290$
39 _SYSSMU39_2970077311$
40 _SYSSMU40_1403026629$
在设置初始化参数时大致设置如下:
_allow_resetlogs_corruption= TRUE
_offline_rollback_segments= "_SYSSMU11_1202330240$"
_offline_rollback_segments= "_SYSSMU12_1617713323$"
_offline_rollback_segments= "_SYSSMU13_1359816937$"
_offline_rollback_segments= "_SYSSMU14_2078039711$"
_offline_rollback_segments= "_SYSSMU15_1538259293$"
_offline_rollback_segments= "_SYSSMU16_3126366281$"
_offline_rollback_segments= "_SYSSMU17_2553547900$"
_offline_rollback_segments= "_SYSSMU18_1481844821$"
_offline_rollback_segments= "_SYSSMU19_135756661$"
_offline_rollback_segments= "_SYSSMU20_2322289537$"
_corrupted_rollback_segments= "_SYSSMU11_1202330240$"
_corrupted_rollback_segments= "_SYSSMU12_1617713323$"
_corrupted_rollback_segments= "_SYSSMU13_1359816937$"
_corrupted_rollback_segments= "_SYSSMU14_2078039711$"
_corrupted_rollback_segments= "_SYSSMU15_1538259293$"
_corrupted_rollback_segments= "_SYSSMU16_3126366281$"
_corrupted_rollback_segments= "_SYSSMU17_2553547900$"
_corrupted_rollback_segments= "_SYSSMU18_1481844821$"
_corrupted_rollback_segments= "_SYSSMU19_135756661$"
_corrupted_rollback_segments= "_SYSSMU20_2322289537$"
屏蔽了事务的回滚与重做之后,数据库被成功打开。
SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE 11.2.0.2.0 Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production
数据库风险随时可能发生,备份必不可少啊!
历史上的今天...
>> 2009-06-02文章:
>> 2008-06-02文章:
>> 2005-06-02文章:
By eygle on 2011-06-02 09:51 | Comments (8) | Backup&Recovery | 2808 |
前面说rac不能启动,如何获得上述_SYSSMU11_1202330240信息?
三种方法:
1、如果是文件系统,可以直接strings system01.dbf | grep _SYSSMU | sort -u
2、在pfile里设置event 10015,即使库打不开,它也可以把所有的回滚段段头给dump出来
3、无论是否使用ASM,用最新版本的ODU都可以把undo$里的记录给unload出来
描述的错误只是11到20回滚段有错误,21到30的状态没有体现呢?10以下的回滚段没有错误?
缺省的只有10个回滚段,之前的可能被删除掉了。
挑错:故障的起因应该是主机和存储之间出现连接“鼓掌”--->> 故障
挑错:故障的起因应该是主机和存储之间出现连接“鼓掌”--->> 故障
改之了,谢谢指出。
挑错别字:屏蔽了事务的回滚与重做之后,数据库备成功打开。 ‘备’-‘被’?