eygle.com   eygle.com
eygle.com eygle
eygle.com  
 

« Oracle KKS 层-Kernel Kompile Shared (cursors) | Blog首页 | 西宁 青海湖 ,短暂的逃离 »

Oracle数据恢复:RAC系统Redo/Undo损坏恢复
modb.pro

昨天,一个客户的数据库系统出现故障,RAC无法启动,大量的错误信息,经过分析检查,最后我们通过强制手段打开数据库,帮助用户挽回了数据损失。

故障的起因应该是主机和存储之间出现连接故障,message信息里可以看到大量如下提示:
Jun  1 19:57:18 Oracle-02 kernel: end_request: I/O error, dev sdb, sector 12931135
Jun  1 19:57:18 Oracle-02 kernel: sd 6:0:0:1: Device not ready: <6>: Current: sense key: Not Ready
Jun  1 19:57:18 Oracle-02 kernel:     Add. Sense: Logical unit not accessible, asymmetric access state transition
Jun  1 19:57:18 Oracle-02 kernel:
Jun  1 19:57:18 Oracle-02 kernel: end_request: I/O error, dev sdb, sector 12931647
Jun  1 19:57:18 Oracle-02 kernel: sd 6:0:0:1: Device not ready: <6>: Current: sense key: Not Ready
Jun  1 19:57:18 Oracle-02 kernel:     Add. Sense: Logical unit not accessible, asymmetric access state transition
Jun  1 19:57:18 Oracle-02 kernel:
Jun  1 19:57:18 Oracle-02 kernel: end_request: I/O error, dev sdb, sector 12932671
Jun  1 19:57:18 Oracle-02 kernel: sd 6:0:0:1: Device not ready: <6>: Current: sense key: Not Ready
Jun  1 19:57:18 Oracle-02 kernel:     Add. Sense: Logical unit not accessible, asymmetric access state transition
一旦出现IO写失败或者写丢失,则数据库将遭受数据损失,一旦UNDO和REDO损坏,数据库就将会无法启动。

在数据库的告警日志层面,首先出现的错误是如下类型:
WARNING: Read Failed. group:3 disk:0 AU:102 offset:16384 size:16384
WARNING: failed to read mirror side 1 of virtual extent 0 logical extent 0 of file 256 in group [3.3870646521] from disk REDO1  allocation unit 102 reason error; if possible,will try another mirror side
WARNING: Read Failed. group:3 disk:0 AU:102 offset:16384 size:16384
WARNING: failed to read mirror side 1 of virtual extent 0 logical extent 0 of file 256 in group [3.3870646521] from disk REDO1  allocation unit 102 reason error; if possible,will try another mirror side

这些错误首先提示REDO写操作失败,读写AU单位失败,这类错误信息教Oracle 11gR2之前详细了很多,精确到了AU单位及Offset。

在启动过程中出现了ORA-00600错误:

ARC0: STARTING ARCH PROCESSES COMPLETE
Redo thread 2 internally disabled at seq 1 (CKPT)
ARC2: Archiving disabled thread 2 sequence 1
Archived Log entry 785 added for thread 2 sequence 1 ID 0x0 dest 1:
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /oracle/diag/rdbms/spsp/SPSP1/trace/SPSP1_ora_19044.trc:
ORA-00600: internal error code, arguments: [2662], [0], [7291890], [0], [7314785], [12583040], [], [], [], [], [], []
Errors in file /oracle/diag/rdbms/spsp/SPSP1/trace/SPSP1_ora_19044.trc:
ORA-00600: internal error code, arguments: [2662], [0], [7291890], [0], [7314785], [12583040], [], [], [], [], [], []

Errors in file /oracle/diag/rdbms/spsp/SPSP2/trace/SPSP2_smon_14335.trc  (incident=291384):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/diag/rdbms/spsp/SPSP2/incident/incdir_291384/SPSP2_smon_14335_i291384.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.

注意Oracle 11gR2的提示,非常详尽,居然提示你通过My Oracle Support去参考 Note 411.1 学习ADRCI知识。

ORA-600的 4193 和 4194 错误,可以通过重建回滚表空间解决。
此处需要提醒的是,Oracle 11g的缺省UNDO段命名,增加了一个Unix Time的时间戳在回滚段名称里,如下所示:
SQL> select * from v$rollname;

       USN NAME
---------- ------------------------------
         0 SYSTEM
        31 _SYSSMU31_1012201531$
        32 _SYSSMU32_3505455792$
        33 _SYSSMU33_760762808$
        34 _SYSSMU34_2022544229$
        35 _SYSSMU35_774751923$
        36 _SYSSMU36_3132287946$
        37 _SYSSMU37_1634453262$
        38 _SYSSMU38_1341527290$
        39 _SYSSMU39_2970077311$
        40 _SYSSMU40_1403026629$

在设置初始化参数时大致设置如下:
  _allow_resetlogs_corruption= TRUE
  _offline_rollback_segments= "_SYSSMU11_1202330240$"
  _offline_rollback_segments= "_SYSSMU12_1617713323$"
  _offline_rollback_segments= "_SYSSMU13_1359816937$"
  _offline_rollback_segments= "_SYSSMU14_2078039711$"
  _offline_rollback_segments= "_SYSSMU15_1538259293$"
  _offline_rollback_segments= "_SYSSMU16_3126366281$"
  _offline_rollback_segments= "_SYSSMU17_2553547900$"
  _offline_rollback_segments= "_SYSSMU18_1481844821$"
  _offline_rollback_segments= "_SYSSMU19_135756661$"
  _offline_rollback_segments= "_SYSSMU20_2322289537$"
  _corrupted_rollback_segments= "_SYSSMU11_1202330240$"
  _corrupted_rollback_segments= "_SYSSMU12_1617713323$"
  _corrupted_rollback_segments= "_SYSSMU13_1359816937$"
  _corrupted_rollback_segments= "_SYSSMU14_2078039711$"
  _corrupted_rollback_segments= "_SYSSMU15_1538259293$"
  _corrupted_rollback_segments= "_SYSSMU16_3126366281$"
  _corrupted_rollback_segments= "_SYSSMU17_2553547900$"
  _corrupted_rollback_segments= "_SYSSMU18_1481844821$"
  _corrupted_rollback_segments= "_SYSSMU19_135756661$"
  _corrupted_rollback_segments= "_SYSSMU20_2322289537$"

屏蔽了事务的回滚与重做之后,数据库被成功打开。

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE    11.2.0.2.0      Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production

数据库风险随时可能发生,备份必不可少啊



历史上的今天...
    >> 2009-06-02文章:
    >> 2008-06-02文章:
    >> 2005-06-02文章:
           西塘游记之二

By eygle on 2011-06-02 09:51 | Comments (8) | Backup&Recovery | 2808 |

8 Comments

前面说rac不能启动,如何获得上述_SYSSMU11_1202330240信息?

三种方法:
1、如果是文件系统,可以直接strings system01.dbf | grep _SYSSMU | sort -u
2、在pfile里设置event 10015,即使库打不开,它也可以把所有的回滚段段头给dump出来
3、无论是否使用ASM,用最新版本的ODU都可以把undo$里的记录给unload出来

描述的错误只是11到20回滚段有错误,21到30的状态没有体现呢?10以下的回滚段没有错误?

挑错:故障的起因应该是主机和存储之间出现连接“鼓掌”--->> 故障

挑错:故障的起因应该是主机和存储之间出现连接“鼓掌”--->> 故障

挑错别字:屏蔽了事务的回滚与重做之后,数据库备成功打开。 ‘备’-‘被’?


CopyRight © 2004~2020 云和恩墨,成就未来!, All rights reserved.
数据恢复·紧急救援·性能优化 云和恩墨 24x7 热线电话:400-600-8755 业务咨询:010-59007017-7040 or 7037 业务合作: marketing@enmotech.com