eygle.com   eygle.com
eygle.com  
 

« October 5, 2004 | Blog首页 | October 8, 2004 »



October 7, 2004

Oracle诊断案例-SGA与Swap之一

作者:eygle

出处:http://blog.eygle.com

案例描述:

用户报告,服务器启动一段时间以后,无法建立数据库连接
重新启动几分钟以后,再次无法连接

系统无法正常使用.

1.登陆系统

SunOS 5.8

login: root
Password:
Last login: Tue Mar 23 13:56:59 from 172.16.31.41
Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001
You have new mail.

2.su 为Oracle用户
检查启动的Oracle进程

发现后台进程正常,有一定量的用户连接


 

wapplatform:/>su - oracle
Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001
You have new mail.
/export/home1/oracle>ls
admin codesyndealt31 exp.sh local.cshrc local.profile oraclebak oui v6_database
app exp.log jre local.login nsmail oradata swan
export/home1/oracle>cd admin
/export/home1/oracle/admin>ps -ef|grep ora
oracle 25269 25258 0 13:58:36 pts/3 0:00 grep ora
oracle 25257 24906 0 13:58:31 pts/4 0:00 vi alert_HSWAPDB.log
oracle 25267 1 1 13:58:34 ? 0:00 oracleHSWAPDB (LOCAL=NO)
oracle 25184 1 0 13:56:57 ? 0:00 ora_p007_HSWAPDB
oracle 25182 1 0 13:56:57 ? 0:00 ora_p006_HSWAPDB
oracle 25193 1 0 13:57:03 ? 0:01 oracleHSWAPDB (LOCAL=NO)
oracle 25209 1 0 13:57:09 ? 0:00 oracleHSWAPDB (LOCAL=NO)
oracle 25176 1 0 13:56:57 ? 0:00 ora_p003_HSWAPDB
oracle 25180 1 0 13:56:57 ? 0:00 ora_p005_HSWAPDB
oracle 25172 1 0 13:56:56 ? 0:00 ora_p001_HSWAPDB
oracle 25178 1 0 13:56:57 ? 0:00 ora_p004_HSWAPDB
oracle 25170 1 0 13:56:56 ? 0:00 ora_p000_HSWAPDB
oracle 24254 24240 0 12:08:25 pts/2 0:00 -ksh
oracle 25174 1 0 13:56:56 ? 0:00 ora_p002_HSWAPDB
oracle 25244 1 1 13:58:23 ? 0:00 oracleHSWAPDB (LOCAL=NO)
oracle 25218 1 0 13:57:23 ? 0:00 oracleHSWAPDB (LOCAL=NO)
oracle 25159 1 0 13:56:42 ? 0:02 ora_qmn0_HSWAPDB
oracle 25230 1 0 13:57:40 ? 0:01 oracleHSWAPDB (LOCAL=NO)
oracle 25161 1 0 13:56:42 ? 0:00 ora_s000_HSWAPDB
oracle 25149 1 0 13:56:41 ? 0:01 ora_lgwr_HSWAPDB
oracle 25157 1 0 13:56:42 ? 0:00 ora_cjq0_HSWAPDB
oracle 24906 3698 0 13:47:47 pts/4 0:00 -ksh
oracle 25153 1 0 13:56:42 ? 0:01 ora_smon_HSWAPDB
oracle 25058 7464 0 13:55:14 pts/1 0:00 -ksh
oracle 25163 1 0 13:56:42 ? 0:00 ora_d000_HSWAPDB
oracle 25155 1 0 13:56:42 ? 0:00 ora_reco_HSWAPDB
oracle 25151 1 0 13:56:41 ? 0:00 ora_ckpt_HSWAPDB
oracle 25145 1 0 13:56:41 ? 0:00 ora_dbw0_HSWAPDB
oracle 25199 1 15 13:57:04 ? 0:49 ora_j000_HSWAPDB
oracle 4149 4146 0 12:05:11 pts/5 0:00 -ksh
oracle 25232 1 0 13:57:41 ? 0:00 oracleHSWAPDB (LOCAL=NO)
oracle 25119 1 0 13:56:29 ? 0:00 oraclehswapdb (LOCAL=NO)
oracle 25075 1 0 13:55:34 ? 0:00 /export/home1/oracle/app/bin/tnslsnr LISTENER -inherit
oracle 24374 4149 0 12:21:56 pts/5 0:00 sqlplus /nolog
oracle 25143 1 0 13:56:41 ? 0:00 ora_pmon_HSWAPDB
oracle 25258 25242 0 13:58:31 pts/3 0:00 -ksh
/export/home1/oracle/admin>ps -ef|grep ora_
oracle 25275 25258 0 13:58:42 pts/3 0:00 grep ora_
oracle 25184 1 0 13:56:57 ? 0:00 ora_p007_HSWAPDB
oracle 25182 1 0 13:56:57 ? 0:00 ora_p006_HSWAPDB
oracle 25176 1 0 13:56:57 ? 0:00 ora_p003_HSWAPDB
oracle 25180 1 0 13:56:57 ? 0:00 ora_p005_HSWAPDB
oracle 25172 1 0 13:56:56 ? 0:00 ora_p001_HSWAPDB
oracle 25178 1 0 13:56:57 ? 0:00 ora_p004_HSWAPDB
oracle 25170 1 0 13:56:56 ? 0:00 ora_p000_HSWAPDB
oracle 25174 1 0 13:56:56 ? 0:00 ora_p002_HSWAPDB
oracle 25159 1 0 13:56:42 ? 0:02 ora_qmn0_HSWAPDB
oracle 25161 1 0 13:56:42 ? 0:00 ora_s000_HSWAPDB
oracle 25149 1 0 13:56:41 ? 0:01 ora_lgwr_HSWAPDB
oracle 25157 1 0 13:56:42 ? 0:00 ora_cjq0_HSWAPDB
oracle 25153 1 0 13:56:42 ? 0:01 ora_smon_HSWAPDB
oracle 25163 1 0 13:56:42 ? 0:00 ora_d000_HSWAPDB
oracle 25155 1 0 13:56:42 ? 0:00 ora_reco_HSWAPDB
oracle 25151 1 0 13:56:41 ? 0:00 ora_ckpt_HSWAPDB
oracle 25145 1 0 13:56:41 ? 0:00 ora_dbw0_HSWAPDB
oracle 25199 1 13 13:57:04 ? 0:51 ora_j000_HSWAPDB
oracle 25143 1 0 13:56:41 ? 0:00 ora_pmon_HSWAPDB

3.检查Alert.log警报日志文件


 

/export/home1/oracle/admin>ls
hswapdb
/export/home1/oracle/admin>cd *
/export/home1/oracle/admin/hswapdb>ls
bdump cdump create pfile udump
/export/home1/oracle/admin/hswapdb>cd bdump
/export/home1/oracle/admin/hswapdb/bdump>

/export/home1/oracle/admin/hswapdb/bdump>ls -l *.log

-rw-r--r-- 1 oracle dba 813396 Mar 23 13:57 alert_HSWAPDB.log
/export/home1/oracle/admin/hswapdb/bdump>vi *.log
"alert_HSWAPDB.log" 18888 lines, 813396 characters (115 null)
Tue Jun 24 21:17:14 2003
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
SCN scheme 3
Using log_archive_dest parameter default value
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 9.2.0.3.0.
System parameters with non-default values:
processes = 400
timed_statistics = TRUE
shared_pool_size = 117440512
large_pool_size = 83886080
java_pool_size = 33554432
control_files = /export/home1/oracle/oradata/hswapdb/control01.ctl,

/export/home1/oracle/oradata/hswapdb/control02.ctl,
/export/home1/oracle/oradata/hswapdb/control03.ctl
db_block_size = 8192
db_cache_size = 352321536
compatible = 9.2.0.0.0
db_file_multiblock_read_count= 16
fast_start_mttr_target = 300
undo_management = AUTO
undo_tablespace = UNDOTBS1
undo_retention = 10800
remote_login_passwordfile= EXCLUSIVE
db_domain = eygle.com
instance_name = hswapdb
dispatchers = (PROTOCOL=TCP) (SERVICE=hswapdbXDB)
job_queue_processes = 10
hash_join_enabled = TRUE
background_dump_dest = /export/home1/oracle/admin/hswapdb/bdump
user_dump_dest = /export/home1/oracle/admin/hswapdb/udump
core_dump_dest = /export/home1/oracle/admin/hswapdb/cdump
sort_area_size = 524288
db_name = hswapdb
open_cursors = 300
star_transformation_enabled= FALSE
query_rewrite_enabled = FALSE
pga_aggregate_target = 154140672
aq_tm_processes = 1

.................

Tue Mar 23 13:40:45 2004
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 11, op = fork, loc = skgpspawn5
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
Tue Mar 23 13:42:02 2004
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
Tue Mar 23 13:55:38 2004
Starting ORACLE instance (normal)
Shutting down instance: further logons disabled
Tue Mar 23 13:56:20 2004
Shutting down instance (abort)
License high water mark = 26
Instance terminated by USER, pid = 25112
Tue Mar 23 13:56:37 2004
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
SCN scheme 3
Using log_archive_dest parameter default value
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 9.2.0.3.0.
System parameters with non-default values:
processes = 400
timed_statistics = TRUE
shared_pool_size = 117440512
large_pool_size = 83886080
java_pool_size = 33554432
control_files = /export/home1/oracle/oradata/hswapdb/control01.ctl,

/export/home1/oracle/oradata/hswapdb/control02.ctl,
/export/home1/oracle/oradata/hswapdb/control03.ctl
db_block_size = 8192
db_cache_size = 352321536
compatible = 9.2.0.0.0
db_file_multiblock_read_count= 16
fast_start_mttr_target = 300
undo_management = AUTO
undo_tablespace = UNDOTBS1
undo_retention = 10800
remote_login_passwordfile= EXCLUSIVE
db_domain = eygle.com
instance_name = hswapdb
dispatchers = (PROTOCOL=TCP) (SERVICE=hswapdbXDB)
remote_dependencies_mode = SIGNATURE
job_queue_processes = 10
hash_join_enabled = TRUE
background_dump_dest = /export/home1/oracle/admin/hswapdb/bdump
user_dump_dest = /export/home1/oracle/admin/hswapdb/udump
core_dump_dest = /export/home1/oracle/admin/hswapdb/cdump
sort_area_size = 524288
db_name = hswapdb
open_cursors = 300
star_transformation_enabled= FALSE
parallel_automatic_tuning= TRUE
query_rewrite_enabled = FALSE
pga_aggregate_target = 154140672
aq_tm_processes = 1
PMON started with pid=2
DBW0 started with pid=3
LGWR started with pid=4
CKPT started with pid=5
SMON started with pid=6
RECO started with pid=7
CJQ0 started with pid=8
QMN0 started with pid=9
Tue Mar 23 13:56:42 2004
starting up 1 shared server(s) ...
Tue Mar 23 13:56:42 2004
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
Tue Mar 23 13:56:43 2004
ALTER DATABASE MOUNT
Tue Mar 23 13:56:47 2004
Successful mount of redo thread 1, with mount id 3253076635.
Tue Mar 23 13:56:47 2004
Database mounted in Exclusive Mode.
Completed: ALTER DATABASE MOUNT
Tue Mar 23 13:56:47 2004
Current log# 2 seq# 2136 mem# 0: /export/home1/oracle/oradata/hswapdb/redo02.log
Successful open of redo thread 1.
Tue Mar 23 12:24:54 2004
SMON: enabling cache recovery
Tue Mar 23 12:24:56 2004
Undo Segment 1 Onlined
Undo Segment 2 Onlined
Undo Segment 3 Onlined
Undo Segment 4 Onlined
Undo Segment 5 Onlined
Undo Segment 6 Onlined
Undo Segment 7 Onlined
Undo Segment 8 Onlined
Undo Segment 9 Onlined
Undo Segment 10 Onlined
Successfully onlined Undo Tablespace 1.
Tue Mar 23 12:24:56 2004
SMON: enabling tx recovery
Tue Mar 23 12:24:56 2004
Database Characterset is ZHS16GBK
Tue Mar 23 12:25:01 2004
SMON: Parallel transaction recovery tried
Tue Mar 23 12:25:01 2004
replication_dependency_tracking turned off (no async multimaster replication found)
Completed: ALTER DATABASE OPEN
Tue Mar 23 12:28:26 2004
/* OracleOEM */ ALTER DATABASE DATAFILE '/export/home1/oracle/oradata/hswapdb/users01.dbf' RESIZE 2501760K
Tue Mar 23 12:28:26 2004
ORA-3297 signalled during: /* OracleOEM */ ALTER DATABASE DATAFILE '/export/h...
Tue Mar 23 12:28:32 2004
/* OracleOEM */ ALTER DATABASE DATAFILE '/export/home1/oracle/oradata/hswapdb/users01.dbf' RESIZE 2501760K
ORA-3297 signalled during: /* OracleOEM */ ALTER DATABASE DATAFILE '/export/h...
Tue Mar 23 12:28:53 2004
/* OracleOEM */ ALTER DATABASE DATAFILE '/export/home1/oracle/oradata/hswapdb/users01.dbf' RESIZE 3501760K
Tue Mar 23 12:28:53 2004
ORA-3297 signalled during: /* OracleOEM */ ALTER DATABASE DATAFILE '/export/h...
Tue Mar 23 13:40:45 2004
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 11, op = fork, loc = skgpspawn5
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
Tue Mar 23 13:42:02 2004
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3
:q


发现数据库多次重起,并记录了部分错误信息

该提示说明数据库无法spawn a new session.

quote Yong Huang's comment:

The number in "skgpspawn failed:category = 27142" is probably ORA error:

$ oerr ora 27142
27142, 0000, "could not create new process"
// *Cause: OS system call
// *Action: check errno and if possible increase the number of processes


OSD (OS-dependent) errors are almost always shown as an skg... error (probably means "system, kernel generic").

I don't know what "depinfo = 12" means.

 

4.尝试连接数据库

收到错误信息,无法连接数据库

 


$ sqlplus "/ as sysdba"

SQL*Plus: Release 9.2.0.3.0 - Production on 星期二 3月 23 14:14:06 2004

Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.

ERROR:
ORA-12540: TNS: 超出内部限制

请输入用户名:
ERROR:
ORA-12540: TNS: 超出内部限制

请输入用户名:
ERROR:
ORA-12540: TNS: 超出内部限制

SP2-0157: 在3次尝试之后无法 CONNECT 到 ORACLE, 退出 SQL*Plus

 


内部限制超过,通常说明某些系统资源不足.

 

5.检查监听器

发现部分连接被拒绝

 

/export/home1/oracle>lsnrctl services

LSNRCTL for Solaris: Version 9.2.0.3.0 - Production on 23-3月 -2004 14:37:23

Copyright (c) 1991, 2002, Oracle Corporation. All rights reserved.

正在连接到 (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC)))
服务摘要..
服务 "PLSExtProc" 包含 1 个例程。
例程 "PLSExtProc", 状态 UNKNOWN, 包含此服务的 1 个处理程序...
处理程序:
"DEDICATED" 已建立:0 已被拒绝:0
LOCAL SERVER
服务 "hswapdb.eygle.com" 包含 2 个例程。
例程 "hswapdb", 状态 UNKNOWN, 包含此服务的 1 个处理程序...
处理程序:
"DEDICATED" 已建立:6 已被拒绝:0
LOCAL SERVER
例程 "hswapdb", 状态 READY, 包含此服务的 1 个处理程序...
处理程序:
"DEDICATED" 已建立:21 已拒绝:6 状态:ready
LOCAL SERVER
服务 "hswapdbXDB.eygle.com" 包含 1 个例程。
例程 "hswapdb", 状态 READY, 包含此服务的 1 个处理程序...
处理程序:
"D000" 已建立:0 已被拒绝:0 当前: 0 最大: 972 状态: ready
DISPATCHER <machine: wapplatform, pid: 25839>
(ADDRESS=(PROTOCOL=tcp)(HOST=wapplatform)(PORT=32869))
命令执行成功

 

在listener.log中找到了相关错误信息

 

23-3\324\302 -2004 12:19:40 * (CONNECT_DATA=(SID=hswapdb)(CID=(PROGRAM=C:\WINNT\Microsoft.NET\Framework\v1.1.4322\aspnet_wp.e
xe)(HOST=SWAN)(USER=SYSTEM))) * (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.30.125)(PORT=1291)) * establish * hswapdb * 12500
TNS-12500: TNS\243\272\274\340\314\375\306\367\316\264\304\334\306\364\266\257\327\250\323\303\265\304\267\376\316\361\306\36
7\275\370\263\314
TNS-12540: TNS\243\272\263\254\263\366\304\332\262\277\274\253\317\336\317\336\326\306
TNS-12560: TNS: \320\255\322\351\312\312\305\344\306\367\264\355\316\363
TNS-00510: \263\254\263\366\304\332\262\277\274\253\317\336\317\336\326\306
Solaris Error: 12: Not enough space
23-3\324\302 -2004 12:19:50 * (CONNECT_DATA=(SID=hswapdb)(CID=(PROGRAM=C:\Program Files\PLSQL Developer\PLSQLDev.exe)(HOST=SW
AN)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.30.125)(PORT=1292)) * establish * hswapdb * 12500
TNS-12500: TNS\243\272\274\340\314\375\306\367\316\264\304\334\306\364\266\257\327\250\323\303\265\304\267\376\316\361\306\36
7\275\370\263\314
TNS-12540: TNS\243\272\263\254\263\366\304\332\262\277\274\253\317\336\317\336\326\306
TNS-12560: TNS: \320\255\322\351\312\312\305\344\306\367\264\355\316\363
TNS-00510: \263\254\263\366\304\332\262\277\274\253\317\336\317\336\326\306
Solaris Error: 12: Not enough space

/export/home1/oracle/app/network/log>grep -w 12 /usr/include/sys/errno.h
#define ENOMEM 12 /* Not enough core

 

quote Yong Huang's comment:

$ grep -w 12 /usr/include/sys/errno.h
#define ENOMEM 12 /* Not enough core */

Here "core" means memory, including real RAM memory and swap space.

6.退出Oracle用户检查

检查系统日志信息,发现大量失败的su操作
有swap区不足的报告

 

/export/home1/oracle/admin/hswapdb/bdump>exit
wapplatform:/>dmesg

2004年03月23日 星期二 14时00分32秒 CST
Mar 22 22:52:36 wapplatform elfexec: [ID 700856 kern.notice] ps: Cannot find ^?ELF^A^B^A
Mar 22 22:53:00 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 22:53:09 wapplatform elfexec: [ID 700856 kern.notice] w: Cannot find ^?ELF^A^B^A
Mar 22 22:53:53 wapplatform last message repeated 4 times
Mar 22 22:56:28 wapplatform elfexec: [ID 700856 kern.notice] ipnat: Cannot find ^?ELF^B^B^A
Mar 22 22:58:00 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 22:59:54 wapplatform elfexec: [ID 700856 kern.notice] ipnat: Cannot find ^?ELF^B^B^A
Mar 22 23:02:26 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 23:03:00 wapplatform last message repeated 1 time
Mar 22 23:08:00 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 23:08:34 wapplatform elfexec: [ID 700856 kern.notice] ipnat: Cannot find ^?ELF^B^B^A
Mar 22 23:10:27 wapplatform last message repeated 3 times
Mar 22 23:11:49 wapplatform elfexec: [ID 700856 kern.notice] ipnat: Cannot find ^?ELF^B^B^A
Mar 22 23:11:52 wapplatform last message repeated 1 time
Mar 22 23:13:01 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 23:18:01 wapplatform last message repeated 1 time
Mar 22 23:23:01 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 23:28:01 wapplatform last message repeated 1 time
Mar 22 23:33:01 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 23:38:01 wapplatform last message repeated 1 time
Mar 22 23:43:01 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 23:48:01 wapplatform last message repeated 1 time
Mar 22 23:53:01 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 23:58:01 wapplatform last message repeated 1 time
Mar 23 00:00:00 wapplatform ufs: [ID 213553 kern.notice] NOTICE: realloccg /export/home1: file system full
Mar 23 00:00:00 wapplatform sendmail[3075]: [ID 702911 mail.crit] My unqualified host name (wapplatform) unknown; sleeping

for retry
Mar 23 00:01:00 wapplatform sendmail[3075]: [ID 702911 mail.alert] unable to qualify my own domain name (wapplatform) --

using short name
Mar 23 00:02:36 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 23 00:03:02 wapplatform last message repeated 1 time
Mar 23 00:08:02 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
....

Mar 23 10:18:15 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 23 10:20:41 wapplatform ufs: [ID 213553 kern.notice] NOTICE: realloccg /export/home1: file system full
Mar 23 10:20:47 wapplatform last message repeated 1 time
Mar 23 10:23:15 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 23 10:24:38 wapplatform ufs: [ID 213553 kern.notice] NOTICE: realloccg /export/home1: file system full
Mar 23 10:24:43 wapplatform last message repeated 1 time
Mar 23 10:24:55 wapplatform ufs: [ID 213553 kern.notice] NOTICE: realloccg /export/home1: file system full
Mar 23 10:25:06 wapplatform last message repeated 2 times
Mar 23 11:09:31 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3118 (su)
Mar 23 11:09:39 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3121 (su)
Mar 23 11:10:48 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3137 (su)
Mar 23 11:18:02 wapplatform sshd[3620]: [ID 800047 auth.error] error: grantpt: Not enough space
Mar 23 11:18:02 wapplatform sshd[3620]: [ID 800047 auth.error] error: session_pty_req: session 0 alloc failed
Mar 23 11:18:43 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3636 (su)
Mar 23 11:19:47 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3672 (su)
Mar 23 11:20:20 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3694 (su)
Mar 23 11:22:23 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3736 (sshd)
Mar 23 11:23:17 wapplatform tmpfs: [ID 518458 kern.warning] WARNING: /tmp: File system full, swap space limit exceeded
Mar 23 11:23:40 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3804 (su)
Mar 23 11:23:40 wapplatform last message repeated 8 times
Mar 23 11:23:56 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3806 (ps)
Mar 23 11:23:56 wapplatform last message repeated 12 times
Mar 23 11:24:01 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3808 (w)
Mar 23 11:24:01 wapplatform last message repeated 8 times
Mar 23 13:40:56 wapplatform su: [ID 810491 auth.crit] 'su root' failed for root on /dev/pts/2
Mar 23 13:46:26 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 24888

(sqlplus)
Mar 23 13:49:18 wapplatform su: [ID 810491 auth.crit] 'su oracle' failed for root on /dev/pts/6
Mar 23 13:54:03 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 25035 (su)
Mar 23 13:54:08 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 25036 (su)

 

现在基本可以判断是交换区的问题,当然和Oracle SGA设置有关.

7.检查系统内存及交换区使用

 

/export/home1/oracle/admin/hswapdb/bdump>exit
wapplatform:/>dmesg

2004年03月23日 星期二 14时00分32秒 CST
Mar 22 22:52:36 wapplatform elfexec: [ID 700856 kern.notice] ps: Cannot find ^?ELF^A^B^A
Mar 22 22:53:00 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 22:53:09 wapplatform elfexec: [ID 700856 kern.notice] w: Cannot find ^?ELF^A^B^A
Mar 22 22:53:53 wapplatform last message repeated 4 times
Mar 22 22:56:28 wapplatform elfexec: [ID 700856 kern.notice] ipnat: Cannot find ^?ELF^B^B^A
Mar 22 22:58:00 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 22:59:54 wapplatform elfexec: [ID 700856 kern.notice] ipnat: Cannot find ^?ELF^B^B^A
Mar 22 23:02:26 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 23:03:00 wapplatform last message repeated 1 time
Mar 22 23:08:00 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 23:08:34 wapplatform elfexec: [ID 700856 kern.notice] ipnat: Cannot find ^?ELF^B^B^A
Mar 22 23:10:27 wapplatform last message repeated 3 times
Mar 22 23:11:49 wapplatform elfexec: [ID 700856 kern.notice] ipnat: Cannot find ^?ELF^B^B^A
Mar 22 23:11:52 wapplatform last message repeated 1 time
Mar 22 23:13:01 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 23:18:01 wapplatform last message repeated 1 time
Mar 22 23:23:01 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 23:28:01 wapplatform last message repeated 1 time
Mar 22 23:33:01 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 23:38:01 wapplatform last message repeated 1 time
Mar 22 23:43:01 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 23:48:01 wapplatform last message repeated 1 time
Mar 22 23:53:01 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 22 23:58:01 wapplatform last message repeated 1 time
Mar 23 00:00:00 wapplatform ufs: [ID 213553 kern.notice] NOTICE: realloccg /export/home1: file system full
Mar 23 00:00:00 wapplatform sendmail[3075]: [ID 702911 mail.crit] My unqualified host name (wapplatform) unknown; sleeping

for retry
Mar 23 00:01:00 wapplatform sendmail[3075]: [ID 702911 mail.alert] unable to qualify my own domain name (wapplatform) --

using short name
Mar 23 00:02:36 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 23 00:03:02 wapplatform last message repeated 1 time
Mar 23 00:08:02 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
....

Mar 23 10:18:15 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 23 10:20:41 wapplatform ufs: [ID 213553 kern.notice] NOTICE: realloccg /export/home1: file system full
Mar 23 10:20:47 wapplatform last message repeated 1 time
Mar 23 10:23:15 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full
Mar 23 10:24:38 wapplatform ufs: [ID 213553 kern.notice] NOTICE: realloccg /export/home1: file system full
Mar 23 10:24:43 wapplatform last message repeated 1 time
Mar 23 10:24:55 wapplatform ufs: [ID 213553 kern.notice] NOTICE: realloccg /export/home1: file system full
Mar 23 10:25:06 wapplatform last message repeated 2 times
Mar 23 11:09:31 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3118 (su)
Mar 23 11:09:39 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3121 (su)
Mar 23 11:10:48 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3137 (su)
Mar 23 11:18:02 wapplatform sshd[3620]: [ID 800047 auth.error] error: grantpt: Not enough space
Mar 23 11:18:02 wapplatform sshd[3620]: [ID 800047 auth.error] error: session_pty_req: session 0 alloc failed
Mar 23 11:18:43 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3636 (su)
Mar 23 11:19:47 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3672 (su)
Mar 23 11:20:20 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3694 (su)
Mar 23 11:22:23 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3736 (sshd)
Mar 23 11:23:17 wapplatform tmpfs: [ID 518458 kern.warning] WARNING: /tmp: File system full, swap space limit exceeded
Mar 23 11:23:40 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3804 (su)
Mar 23 11:23:40 wapplatform last message repeated 8 times
Mar 23 11:23:56 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3806 (ps)
Mar 23 11:23:56 wapplatform last message repeated 12 times
Mar 23 11:24:01 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 3808 (w)
Mar 23 11:24:01 wapplatform last message repeated 8 times
Mar 23 13:40:56 wapplatform su: [ID 810491 auth.crit] 'su root' failed for root on /dev/pts/2
Mar 23 13:46:26 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 24888

(sqlplus)
Mar 23 13:49:18 wapplatform su: [ID 810491 auth.crit] 'su oracle' failed for root on /dev/pts/6
Mar 23 13:54:03 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 25035 (su)
Mar 23 13:54:08 wapplatform genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 25036 (su)


现在基本可以判断是交换区的问题,当然和Oracle SGA设置有关.

7.检查系统内存及交换区使用

 

$ top

last pid: 25456; load averages: 0.67, 0.70, 0.69 

14:10:03
93 processes: 91 sleeping, 2 on cpu
CPU states: 72.7% idle, 14.9% user, 2.7% kernel, 9.7% iowait, 0.0% swap
Memory: 1024M real, 34M free, 752M swap in use, 10M swap free

PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
25199 oracle 1 40 0 674M 631M cpu/2 8:03 16.32% oracle
25209 oracle 1 30 0 675M 630M sleep 0:03 0.13% oracle
25159 oracle 1 48 0 674M 628M sleep 0:03 0.06% oracle
25384 oracle 1 58 0 2632K 1736K cpu/0 0:01 0.05% top
25145 oracle 143 58 0 682M 630M sleep 0:01 0.03% oracle
25446 oracle 1 58 0 674M 628M sleep 0:00 0.03% oracle
25149 oracle 15 58 0 682M 626M sleep 0:00 0.02% oracle
25075 oracle 1 48 0 17M 7208K sleep 0:00 0.01% tnslsnr
25151 oracle 11 58 0 676M 624M sleep 0:00 0.01% oracle
25366 oracle 1 10 0 674M 628M sleep 0:00 0.00% oracle
25356 oracle 1 18 0 674M 628M sleep 0:00 0.00% oracle
25360 oracle 1 20 0 674M 628M sleep 0:00 0.00% oracle
25364 oracle 1 20 0 674M 628M sleep 0:00 0.00% oracle
25362 oracle 1 20 0 674M 628M sleep 0:00 0.00% oracle
25330 oracle 1 28 0 674M 628M sleep 0:00 0.00% oracle

            

发现物理内存仅为1G,free部分为34M,交换区使用了752M,仅10M free
系统内存严重不足,Swap区不足

 

8. 检查数据库的SGA设置

发现SGA设置为: 622299344 bytes
接近600M

 


wapplatform:/>su - oracle
Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001
You have new mail.
/export/home1/oracle>sqlplus "/ as sysdba"

SQL*Plus: Release 9.2.0.3.0 - Production on 星期二 3月 23 14:02:30 2004

Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.

连接到:
Oracle9i Enterprise Edition Release 9.2.0.3.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.3.0 - Production

SQL> show sga

Total System Global Area 622299344 bytes
Fixed Size 731344 bytes
Variable Size 268435456 bytes
Database Buffers 352321536 bytes
Redo Buffers 811008 bytes
SQL>



对于RAM小于1G的系统,Dedicated模式下,Oracle的SGA一般不应超过1/2物理内存.

 

9.第一步调整
减小SGA,为系统保留足够的内存.

10.增加swap区

 



wapplatform:/>df -k
文件系统 千字节 用了 可用 容量 挂接在
/dev/dsk/c0t1d0s0 3099093 105421 2931691 4% /
/dev/dsk/c0t2d0s0 10325760 8359637 1862866 82% /usr
/proc 0 0 0 0% /proc
fd 0 0 0 0% /dev/fd
mnttab 0 0 0 0% /etc/mnttab
/dev/dsk/c0t1d0s3 1018382 285914 671366 30% /var
swap 3904 24 3880 1% /var/run
swap 3936 56 3880 2% /tmp
/dev/dsk/c0t1d0s5 1671823 459202 1162467 29% /opt
/dev/dsk/c0t2d0s7 7087473 6068462 948137 87% /export/home
/dev/dsk/c2t1d0s7 17413250 15900222 1338896 93% /export/home2
/dev/dsk/c0t3d0s7 17413250 13749782 3489336 80% /export/home1
/dev/dsk/c0t1d0s1 771110 382410 334723 54% /usr/openwin
/export/home/wapgw/luke
7087473 6068462 948137 87% /home/wap

wapplatform:/var/swap>cd /export/home1
wapplatform:/export/home1>ls
TT_DB lost+found oracle oracli9
wapplatform:/export/home1>mkdir swap
wapplatform:/export/home1>cd swap
wapplatform:/export/home1/swap>mkfile -v 1g swapfile1
swapfile1 1073741824 bytes
wapplatform:/export/home1/swap>id
uid=0(root) gid=1(other)
wapplatform:/export/home1/swap>swap -a /export/home1/swap/swapfile1
wapplatform:/export/home1/swap>swap -s
总数:分配了 623160k 字节 + 保留 162704k = 已使用 785864k,1010936k 可用

 


11.连接测试

系统恢复正常,问题解决

 



wapplatform:/export/home1/swap>su - oracle
Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001
You have new mail.
/export/home1/oracle>sqlplus "/ as sysdba"

SQL*Plus: Release 9.2.0.3.0 - Production on 星期四 3月 25 11:56:28 2004

Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.

连接到:
Oracle9i Enterprise Edition Release 9.2.0.3.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.3.0 - Production

SQL> exit
从Oracle9i Enterprise Edition Release 9.2.0.3.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.3.0 - Production中断开
/export/home1/oracle>top

last pid: 5372; load averages: 0.25, 0.22, 0.29

11:57:58
148 processes: 137 sleeping, 9 zombie, 2 on cpu
CPU states: 98.8% idle, 0.2% user, 0.7% kernel, 0.2% iowait, 0.0% swap
Memory: 1024M real, 17M free, 824M swap in use, 934M swap free

PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
5363 root 1 58 0 2680K 1736K sleep 0:00 0.24% top
5370 oracle 1 58 0 514M 469M sleep 0:00 0.18% oracle
5366 oracle 1 28 0 514M 469M sleep 0:00 0.11% oracle
5341 oracle 1 58 0 2680K 1736K cpu/2 0:00 0.10% top
5372 oracle 1 48 0 61M 3288K cpu/3 0:00 0.06% oracle
1288 oracle 1 48 0 514M 468M sleep 5:33 0.05% oracle
607 root 12 48 0 2768K 2312K sleep 1:48 0.03% mibiisa
25075 oracle 1 48 0 17M 7208K sleep 0:16 0.02% tnslsnr
1278 oracle 15 58 0 522M 466M sleep 0:49 0.02% oracle
374 root 11 53 0 3504K 2888K sleep 0:16 0.01% nscd
1280 oracle 19 58 0 518M 466M sleep 0:28 0.00% oracle
5361 root 1 46 0 1024K 680K sleep 0:00 0.00% sleep
5362 root 1 46 0 1024K 680K sleep 0:00 0.00% sleep
5469 root 1 36 0 1952K 1176K sleep 30:09 0.00% monithttp
4167 oracle 1 40 0 515M 471M sleep 29:38 0.00% oracle


问题总结:

Oracle数据库问题的解决从来就离不开操作系统

很多时候我们必须通过操作系统一级的手段来诊断并解决问题.

关于操作系统

一般Swap区的推荐值为2XRAM
如果Ram很大,不一定非要把Swap设置为2xSwap
但是通常至少设置Swap = Ram

如果Swap区过小,在系统繁忙期间
产生大量交换无法换到磁盘,就会出现问题.
如本案例就是这样。

另外,如果系统Ram较小
通常设置SGA < 1/2 Ram

要为Server process及OS保留足够的内存空间.

 

Posted by eygle at 11:28 PM | Comments (1)


Oracle诊断案例-SGA与Swap之二

作者:eygle

出处:http://blog.eygle.com

案例描述:

这是一个大型生产系统
问题出现时系统累计大量用户进程
用户请求得不到及时响应,新的进程不断尝试建立连接
连接数很快被用完

数据库版本:9.2.0.3
操作系统:Solaris8

 

1.检查alert文件

日志中记录如下错误信息,说明磁盘异步IO出现问题:

 

WARNING: aiowait timed out 2 times
Tue Aug 26 15:33:32 2003
WARNING: aiowait timed out 2 times
Tue Aug 26 15:33:34 2003
WARNING: aiowait timed out 2 times
Tue Aug 26 15:33:36 2003
WARNING: aiowait timed out 2 times
Tue Aug 26 15:33:38 2003
WARNING: aiowait timed out 2 times
Tue Aug 26 15:33:43 2003
WARNING: aiowait timed out 1 times
Tue Aug 26 15:33:46 2003
WARNING: aiowait timed out 1 times
Tue Aug 26 15:33:49 2003
WARNING: aiowait timed out 1 times
Tue Aug 26 15:33:51 2003
WARNING: aiowait timed out 1 times
Tue Aug 26 15:33:52 2003
WARNING: aiowait timed out 1 times
Tue Aug 26 15:33:53 2003
WARNING: aiowait timed out 1 times
.............

我们知道在SUN的某些版本上异步IO存在问题
而异步IO缺省是打开的


 

SQL> show parameter disk_a

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
disk_asynch_io boolean TRUE

 

针对此问题,我们停用了数据库的异步IO写入。

2.共享内存问题

alert文件中还记录了以下错误信息:

 

  Tue Aug 26 21:37:40 2003
WARNING: EINVAL creating segment of size 0x0000000190400000
fix shm parameters in /etc/system or equivalent

 



该信息说明内核参数设置过小或者和SGA不匹配

我们检查system配置文件

 

$ cat /etc/system
.......................
set shmsys:shminfo_shmmax=4096000000
set shmsys:shminfo_shmmin=1
set shmsys:shminfo_shmmni=200
set shmsys:shminfo_shmseg=200
set semsys:seminfo_semmap=1024
set semsys:seminfo_semmni=2048
set semsys:seminfo_semmns=2048
set semsys:seminfo_semmnu=2048
set semsys:seminfo_semume=200
set semsys:seminfo_semmsl=2048

 



我们发现最大共享内存设置仅有4G

 

3.检查SGA设置

 


SQL*Plus: Release 9.2.0.3.0 - Production on 星期二 8月 26 21:46:35 2003

Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.

Connected to:
Oracle9i Enterprise Edition Release 9.2.0.3.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.3.0 - Production

SQL> show sga

Total System Global Area 6695660272 bytes
Fixed Size 740080 bytes
Variable Size 2399141888 bytes
Database Buffers 4294967296 bytes
Redo Buffers 811008 bytes

 




我们发现SGA设置接近7G,这也就是步骤2中错误提示出现的原因


4.交换区问题

我们用top工具检查系统运行状况

 

# /usr/local/bin/top

last pid: 16899; load averages: 0.82, 0.81, 0.83 21:49:05
1230 processes:1228 sleeping, 1 running, 1 on cpu
CPU states: 50.1% idle, 7.4% user, 8.6% kernel, 33.9% iowait, 0.0% swap
Memory: 8192M real, 118M free, 12G swap in use, 11G swap free

PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
15751 oracle 11 44 0 6456M 6408M sleep 0:02 0.49% oracle
15725 oracle 11 58 0 6458M 6410M sleep 0:02 0.46% oracle
251 root 12 48 0 7096K 1944K sleep 126:00 0.45% picld
16540 oracle 11 58 0 6458M 6411M sleep 0:01 0.45% oracle
16766 root 1 43 0 3744K 2248K cpu/1 0:01 0.41% top
16408 oracle 11 58 0 6457M 6410M sleep 0:01 0.34% oracle
15989 oracle 11 58 0 6458M 6409M sleep 0:01 0.34% oracle
15919 oracle 11 58 0 6457M 6409M sleep 0:02 0.30% oracle
16404 oracle 11 58 0 6457M 6409M sleep 0:00 0.28% oracle
16327 oracle 11 55 0 6457M 6410M sleep 0:00 0.27% oracle
14870 oracle 11 58 0 6457M 6412M sleep 0:05 0.24% oracle
16851 oracle 11 35 0 6457M 6411M sleep 0:00 0.22% oracle
16467 oracle 11 58 0 6457M 6409M sleep 0:00 0.21% oracle
16163 oracle 11 58 0 6457M 6408M sleep 0:03 0.21% oracle
15159 oracle 11 58 0 6457M 6408M sleep 0:05 0.21% oracle

 


Memory: 8192M real, 118M free, 12G swap in use, 11G swap free

我们发现系统仅有8G RAM,物理内存仅有118M可用
现在SWAP区使用了12G

我们初步作出以下判断:

SGA设置过大(将近7G)导致运行时产生大量交换

大量SWAP交换进而引发磁盘问题
这也就应该是我们第一步看到
WARNING: aiowait timed out 1 times
的原因

大量交换导致数据库性能急剧下降
进而导致用户请求得不到快速响应,堵塞、累积,直至数据库失去响应

 

5.解决方案

此问题主要是由于SGA设置不当引起,我们马上缩小了SGA设置:

SQL> show sga

Total System Global Area 3591870848 bytes
Fixed Size 735616 bytes
Variable Size 1442840576 bytes
Database Buffers 2147483648 bytes
Redo Buffers 811008 bytes

此时,数据库减少了交换,达到了稳定运行,用户请求可以得到快速响应。

问题解决完成.

 

6.系统状态

调整后系统运行状况:


 

$ top
 
last pid: 12745;  load averages:  0.46,  0.79,  0.65           22:22:49
228 processes: 227 sleeping, 1 on cpu
CPU states: 92.3% idle,  5.0% user,  1.6% kernel,  1.1% iowait,  0.0% swap
Memory: 8192M real, 3817M free, 4015M swap in use, 15G swap free
 
   PID USERNAME THR PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
 12610 oracle     1  51    0 3511M   22M sleep    0:04  1.96% oracle
 12595 oracle     1  48    0 3511M   22M sleep    0:03  0.92% oracle
 12630 oracle     1  38    0 3511M   21M sleep    0:01  0.84% oracle
 12614 oracle     1  46    0 3511M   22M sleep    0:01  0.64% oracle
 12620 oracle     1  58    0 3511M   22M sleep    0:01  0.53% oracle
 12709 oracle     1  48    0 3511M   21M sleep    0:00  0.45% oracle
   265 root      11  38    0 7032K 1920K sleep    3:16  0.42% picld
 12729 oracle     1   0    0 3511M   20M sleep    0:00  0.26% oracle
 12741 oracle     1  58    0 2768K 1760K cpu/3    0:00  0.19% top
 12745 oracle     1  44    0 3506M   16M sleep    0:00  0.17% oracle
 12711 oracle     1  48    0 3506M   16M sleep    0:00  0.11% oracle
 12738 oracle     1  43    0 3506M   16M sleep    0:00  0.06% oracle
  7606 oracle     1  45    0   17M 6928K sleep    0:07  0.05% tnslsnr
 12721 oracle     1  34    0 3506M   16M sleep    0:00  0.05% oracle
 12723 oracle     1  53    0 3506M   16M sleep    0:00  0.05% oracle
   

该系统调整完以后,一直稳定运行至今.

 

一点总结:

这个案例和前面我提到的另外一个极其相似
同样都是SGA设置不当引起的数据库问题

本身并不复杂
这一类问题应该在数据库规划和建设阶段就避免掉.

其时,该问题对我更像是个心理测试
当所有老板都站在你背后的时候,你能否冷静快速的找到并解决问题.

关于SUN上的aiowait timed out 有很多总情况及诱因
我后面还有相应的案例说明 .

-Eygle

2.共享内存问题

alert文件中还记录了以下错误信息:

 Tue Aug 26 21:37:40 2003
            WARNING: EINVAL creating segment of size 0x0000000190400000
            fix shm parameters in /etc/system or equivalent 

 该信息说明内核参数设置过小或者和SGA不匹配

我们检查system配置文件

$ cat /etc/system
            .......................
            set shmsys:shminfo_shmmax=4096000000
            set shmsys:shminfo_shmmin=1
            set shmsys:shminfo_shmmni=200
            set shmsys:shminfo_shmseg=200
            set semsys:seminfo_semmap=1024
            set semsys:seminfo_semmni=2048
            set semsys:seminfo_semmns=2048
            set semsys:seminfo_semmnu=2048
            set semsys:seminfo_semume=200
            set semsys:seminfo_semmsl=2048 

 我们发现最大共享内存设置仅有4G

3.检查SGA设置

SQL*Plus: Release 9.2.0.3.0 - Production on 星期二 8月 26 21:46:35 2003
            Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
            Connected to:
            Oracle9i Enterprise Edition Release 9.2.0.3.0 - 64bit Production
            With the Partitioning, OLAP and Oracle Data Mining options
            JServer Release 9.2.0.3.0 - Production
            SQL> show sga
            Total System Global Area 6695660272 bytes
            Fixed Size 740080 bytes
            Variable Size 2399141888 bytes
            Database Buffers 4294967296 bytes
            Redo Buffers 811008 bytes 

 我们发现SGA设置接近7G,这也就是步骤2中错误提示出现的原因


4.交换区问题

我们用top工具检查系统运行状况

 

# /usr/local/bin/top
            last pid: 16899; load averages: 0.82, 0.81, 0.83 21:49:05
            1230 processes:1228 sleeping, 1 running, 1 on cpu
            CPU states: 50.1% idle, 7.4% user, 8.6% kernel, 33.9% iowait, 0.0% swap
            Memory: 8192M real, 118M free, 12G swap in use, 11G swap free
            PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
            15751 oracle 11 44 0 6456M 6408M sleep 0:02 0.49% oracle
            15725 oracle 11 58 0 6458M 6410M sleep 0:02 0.46% oracle
            251 root 12 48 0 7096K 1944K sleep 126:00 0.45% picld
            16540 oracle 11 58 0 6458M 6411M sleep 0:01 0.45% oracle
            16766 root 1 43 0 3744K 2248K cpu/1 0:01 0.41% top
            16408 oracle 11 58 0 6457M 6410M sleep 0:01 0.34% oracle
            15989 oracle 11 58 0 6458M 6409M sleep 0:01 0.34% oracle
            15919 oracle 11 58 0 6457M 6409M sleep 0:02 0.30% oracle
            16404 oracle 11 58 0 6457M 6409M sleep 0:00 0.28% oracle
            16327 oracle 11 55 0 6457M 6410M sleep 0:00 0.27% oracle
            14870 oracle 11 58 0 6457M 6412M sleep 0:05 0.24% oracle
            16851 oracle 11 35 0 6457M 6411M sleep 0:00 0.22% oracle
            16467 oracle 11 58 0 6457M 6409M sleep 0:00 0.21% oracle
            16163 oracle 11 58 0 6457M 6408M sleep 0:03 0.21% oracle
            15159 oracle 11 58 0 6457M 6408M sleep 0:05 0.21% oracle 

 

Memory: 8192M real, 118M free, 12G swap in use, 11G swap free

我们发现系统仅有8G RAM,物理内存仅有118M可用
现在SWAP区使用了12G

我们初步作出以下判断:

SGA设置过大(将近7G)导致运行时产生大量交换

大量SWAP交换进而引发磁盘问题
这也就应该是我们第一步看到
WARNING: aiowait timed out 1 times
的原因

大量交换导致数据库性能急剧下降
进而导致用户请求得不到快速响应,堵塞、累积,直至数据库失去响应

 

 

5.解决方案

此问题主要是由于SGA设置不当引起,我们马上缩小了SGA设置:

SQL> show sga
            Total System Global Area 	3591870848 bytes
            Fixed Size 			735616 bytes
            Variable Size 			1442840576 bytes
            Database Buffers 		2147483648 bytes
            Redo Buffers 			811008 bytes 

此时,数据库减少了交换,达到了稳定运行,用户请求可以得到快速响应。

问题解决完成.

 

6.系统状态

调整后系统运行状况:

 

$ top
            last pid: 12745;  load averages:  0.46,  0.79,  0.65           22:22:49
            228 processes: 227 sleeping, 1 on cpu
            CPU states: 92.3% idle,  5.0% user,  1.6% kernel,  1.1% iowait,  0.0% swap
            Memory: 8192M real, 3817M free, 4015M swap in use, 15G swap free
            PID USERNAME THR PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
            12610 oracle     1  51    0 3511M   22M sleep    0:04  1.96% oracle
            12595 oracle     1  48    0 3511M   22M sleep    0:03  0.92% oracle
            12630 oracle     1  38    0 3511M   21M sleep    0:01  0.84% oracle
            12614 oracle     1  46    0 3511M   22M sleep    0:01  0.64% oracle
            12620 oracle     1  58    0 3511M   22M sleep    0:01  0.53% oracle
            12709 oracle     1  48    0 3511M   21M sleep    0:00  0.45% oracle
            265 root      11  38    0 7032K 1920K sleep    3:16  0.42% picld
            12729 oracle     1   0    0 3511M   20M sleep    0:00  0.26% oracle
            12741 oracle     1  58    0 2768K 1760K cpu/3    0:00  0.19% top
            12745 oracle     1  44    0 3506M   16M sleep    0:00  0.17% oracle
            12711 oracle     1  48    0 3506M   16M sleep    0:00  0.11% oracle
            12738 oracle     1  43    0 3506M   16M sleep    0:00  0.06% oracle
            7606 oracle     1  45    0   17M 6928K sleep    0:07  0.05% tnslsnr
            12721 oracle     1  34    0 3506M   16M sleep    0:00  0.05% oracle
            12723 oracle     1  53    0 3506M   16M sleep    0:00  0.05% oracle 

该系统调整完以后,一直稳定运行至今.

 

一点总结:

这个案例和前面我提到的另外一个极其相似
同样都是SGA设置不当引起的数据库问题

本身并不复杂
这一类问题应该在数据库规划和建设阶段就避免掉.

其时,该问题对我更像是个心理测试
当所有老板都站在你背后的时候,你能否冷静快速的找到并解决问题.

关于SUN上的aiowait timed out 有很多总情况及诱因
我后面还有相应的案例说明 .

Posted by eygle at 10:17 PM | Comments (1)


推荐阅读-中国农民调查

作者:eygle

出处:http://blog.eygle.com

中国农民调查

作者:陈桂棣 春桃


   

中国作家陈桂棣、春桃夫妇著的长篇纪实文学《中国农民调查》,在柏林获得了2004年度全球唯一的报导文学大奖 "尤利西斯"国际报告文学奖[Lettre Ulysses Award]。

德新社驻中国记者和《南德意志报》就此发表了评述。德新社驻中国记者说∶要设想中国9亿农民的艰难命运,一般的想像力已经不够用了。"我们看到了你想像不到的贫穷,想像不到的罪恶,想像不到的苦难,想像不到的无奈,想像不到的抗争,想像不到的沈默,想像不到的感动和想像不到的悲壮,┅┅"


《中国农民调查》花了三年时间走访了安徽省几千农民的经历,立即成了畅销书,三个月内就售出了15万册。中国社科院有人称之为农村问题的"白皮书"。
它不仅描述了地方党的头头脑脑们的腐败和农业政策的失败,它也叙述了农民是如何抗议高税费的。农民从中国的经济奇迹中几乎未有得益。只有流动的民工得到低工资,使中国成为全球化经济中充满吸引力的工厂。尽管作者们竭力把问题推给地方首脑们,而赞美北京中央政府,但"中国农民调查"仍然揭示了所谓共产主义体制的不公平。
为了不使农民问题具有当年那样的爆炸力,宣传部门於今年3月下令该书停止出版。此外,一名被描述成腐败人物的党的干部张西德以诬蔑罪起诉作者。张已经上升成为阜阳的政协副主席。


原书封面
 

《中国农民调查》获奖後,《南德意志报》介绍了这部书并评论说,陈吴二人一夜之间从功臣成了不受欢迎的人,直到今天,出版社哪怕提及这本书或作者的名字也受到禁止。陈吴二人虽然在书中尽力称赞中央政府的政策、把责任加在农村恶霸的头上,但是检察官们显然看到,这些精心调查的结果当然也对体制提出了批评。更为危险的是,这是一部文学作品,它能同时触动读者的心灵。书中描写了腐败如何得不到惩治、滥用权力的现象如何无时不在、披著共产党外衣的机构如何象黑社会以及一个接一个的改革如何销声匿迹。
现在,陈吴二人被送上了法庭。美国之音报导说,陈桂棣春桃夫妇已经在柏林出席了颁奖仪式并领取了奖金。他们希望得奖能有助於他们赢得这场官司。被告律师浦志强说,这个报告文学获奖,说明中国民间和国外的读者对此书反映的问题是否真实,对它的震撼力的感受,是一致的,"唯独中宣部和他们不一致"。

浦志强说,这样的报告文学,起到了新闻没有起到的作用,这几件"涉农大案",在任何时候,都是触目惊心的,应该在转瞬之间就通过现代传媒让所有读者都知道。不过,事实并非如此。

他说,"但是,因为中国新闻自由表达不够自由,以及在选题等各方面不够敏感,这样一个问题,只是过了若干年後,采取了报告文学的方式表现出来,这是报告文学有生命力的表现,但同时也是中国没有新闻自由的悲哀。"

陈桂棣 春桃 夫妇
 
 

 

   
  本书下载链接:
Word版下载 Exe版本下载

Posted by eygle at 9:12 AM | Comments (0)



CopyRight © 2004-2008 eygle.com, All rights reserved.