The backbone monitoring solution for Oracle databases in my environment is dbalarm.sh script. It’s a kind of “deploy and forget” script where it can monitor lots of database system components in one go.
The script can report/monitor Database, ASM, Clusterware, Golden Gate and OS main events, including:
– Database Monitoring:
– Monitor the DB instance ALERT LOG file and report ORA and TNS errors.
– Monitor TABLESPACES utilization. – Monitor FLASH RECOVERY AREA (FRA) utilization.
– Monitor ASM Disk Groups utilization.
– Monitor BLOCKING LOCKS.
– Monitor database named SERVICES. [Service names to be provided to SERVICEMON variable]
– Monitor RMAN Backup Failure.
– Monitor the database if it goes OFFLINE or gets HANGED.
When the Paranoid mode is set to ON:
– Startup/Shutdown events of the DB instances will be reported.
– ALTER SYSTEM commands will be reported.
– ALTER DATABASE commands will be reported.
– EXPORT/IMPORT operations will be reported.
– Listener Monitoring:
– Monitor the LISTENERS’ LOG file and report TNS errors which plays a crucial rule in reporting connectivity security breaches, applications misconfiguration and network failures.
– Grid Infrastructure Monitoring:
– Monitor the ASM alert log for ORA and TNS errors.
– Monitor the Grid Infrastructure alert log for ORA and TNS errors.
– Monitor the Grid Infrastructure alert log for Shutdown/Startup events.
– Monitor the Grid Infrastructure alert log for Node eviction events.
– Monitor the Grid Infrastructure alert log for Network IP conflict.
– Monitor the Grid Infrastructure alert log for Heart Beat failures.
– Monitor the Grid Infrastructure alert log for service failure events.
– OS/Hardware Staff:
– Monitor the CPU for high utilization.
– Monitor the Filesystems / Mount Points space utilization.
– Monitor dmesg log for new entries which let you know what is going on on the OS side.
– Golden Gate Monitoring:
– Monitor the Golden Gate log for Errors and Process ABENDED events.
How it works?
The script is coded to send only new errors that are triggered since the last execution of the script, once the error gets reported once, it won’t be reported again unless it appears again in the alert log file.
Step 1:
EMAIL=”youremail@yourcompany.com“
Note: sendmail service should be configured on your server to allow emails to be sent out from the DB machine.
By Oracle user:
# crontab -e
*/5 * * * * /home/oracle/dbalarm.sh
Note: /home/oracle/dbalarm.sh is the full path that points to dbalarm script where /home/oracle is the Oracle user home directory.
In case you will schedule this script to run from root user’s crontab: [Not recommended]
# crontab -e
*/5 * * * * su - oracle -c /home/oracle/dbalarm.sh
Now the only thing left here is to set back and relax and the script will report you all errors and all the breached predefined threshold.
One thing more, you can adjust the threshold inside the script as per your preferences by altering the below red colored values in THRESHOLDS section inside the script:
# #########################
# THRESHOLDS:
# #########################
# Modify the THRESHOLDS to the value you prefer:
HTMLENABLE=Y
# Enable HTML Email Format[DB]
FSTHRESHOLD=95 # THRESHOLD FOR FILESYSTEM %USED [OS]
CPUTHRESHOLD=95 # THRESHOLD FOR CPU %UTILIZATION [OS]
TBSTHRESHOLD=95 # THRESHOLD FOR TABLESPACE %USED [DB]
FRATHRESHOLD=95 # THRESHOLD FOR FRA %USED [DB]
FSITHRESHOLD=95 # THRESHOLD FOR FILESYSTEM INODES %USED [OS]
ASMTHRESHOLD=95 # THRESHOLD FOR ASM DISK GROUPS [DB]
BLOCKTHRESHOLD=1 # THRESHOLD FOR BLOCKED SESSIONS count [DB]
WAIT_FOR_LOCK_THRES=1 # THRESHOLD FOR REPORTING BLOCKED SESSIONS BLOCKED FOR MORE THAN N MINUTES
LAST_MIN_BKP_CHK=5 # REPORT RMAN Backups FAILURE in the last N MINUTES. Should be the same interval as the execution interval of dbalarm script in crontab.
CHKRMANBKP=Y # Enable/Disable Checking of RMAN Backup FAILURE. [Default Enabled]
CHKLISTENER=Y # Enable/Disable Checking Listeners: [Default Enabled]
CHKOFFLINEDB=Y # Enable/Disable Database Down Alert: [Default Enabled]
CHKGOLDENGATE=Y # Enable/Disable Goldengate Alert: [Default Enabled]
CPUDIGMORE=Y # Break down to DB Active sessions when CPU hit the threshold: [RECOMMENDED TO SET =N on VERY BUSY environments]
TIMEOUTDIGMORE=Y # Enable/Disable the display of Network Errors when TIMEOUT error gets detected. [Default Enabled]
PARANOIDMODE=N # Paranoid mode will report more events like export/import, instance shutdown/startup. [Default Disabled]
CHKASMALERTLOG=Y # Enable/Disable Monitoring ASM instance ALERTLOG. [Default Enabled]
CHKCLSALERTLOG=Y # Enable/Disable Monitoring GRID INFRASTRUCTURE ALERTLOG. [Default Enabled]
DEVICEDRIVERLOG=Y # Enable/Disable Check “dmesg” Device Driver log for errors. [Default Enabled]
You can control script features like checking listener/offline databases/Golden Gate alert/show DB active sessions when CPU is high/monitor specific services by using Y to enable or N to disable it.
You can enable the script to do more checks like (reporting instance startup/shutdown, export/import operations, alter system/database commands and other major DB activities) when activating the Paranoid mode by setting its threshold PARANOIDMODE=Y
In addition, the script gives you the option to exclude specific database, tablespace, ASM Diskgroup, filesystm from having the script run against. You can do so by editing the following parameters:
# #######################################
# Excluded INSTANCES:
# #######################################
# Here you can mention the instances dbalarm will IGNORE and will NOT run against:
# Use pipe “|” as a separator between each instance name.
# e.g. Excluding: -MGMTDB, ASM instances:
EXL_DB=”-MGMTDB|ASM|APX” #Excluded INSTANCES [Will not get reported offline].
# #########################
# Excluded TABLESPACES:
# #########################
# Here you can exclude one or more tablespace if you don’t want to be alerted when they hit the threshold:
# e.g. to exclude “UNDOTBS1” modify the following variable in this fashion without removing “donotremove” value:
# EXL_TBS=”donotremove|UNDOTBS1“
EXL_TBS=”donotremove”
# #########################
# Excluded ASM Diskgroups:
# #########################
# Here you can exclude one or more ASM Disk Groups if you don’t want to be alerted when they hit the threshold:
# e.g. to exclude “FRA” DISKGROUP modify the following variable in this fashion without removing “donotremove” value:
# EXL_DISK_GROUP=”donotremove|FRA“
EXL_DISK_GROUP=”donotremove”
Moreover, you can exclude specific ORA- TNS- errors in case you are OK with ignoring them, I’ve already excluded some minor frequent incoming errors to not get bothered about them:
# #########################
# Excluded ERRORS:
# #########################
# Here you can exclude the errors that you don’t want to be alerted when they appear in the logs:
# Use pipe “|” between each error.
EXL_DB_ALERT_ERR=”ORA-2396|TNS-00507|TNS-12502|TNS-12560|TNS-12537|TNS-00505” #Excluded ALERTLOG ERRORS [Will not get reported].
EXL_LSNR_ERR=”TNS-00507|TNS-12502|TNS-12560|TNS-12537|TNS-00505” #Excluded LISTENER ERRORS [Will not get reported].
EXL_GRID_ALERT_ERR=”donotremove” #Excluded GRID INFRA ERRORS [Will not get reported].
EXL_GG_ERR=”donotremove” #Excluded GoldenGate ERRORS [Will not get reported].
EXL_DMESG_ERR=”donotremove|scsmd” #Excluded OS DEVICE DRIVERS ERRORS [Will not get reported].
DISCLAIMER: THIS SCRIPT IS DISTRIBUTED IN THE HOPE THAT IT WILL BE USEFUL BUT WITHOUT ANY WARRANTY. IT IS PROVIDED “AS IS”.
If you’re looking for a script to check the database health check on a daily basis, please follow this link:
http://dba-tips.blogspot.ae/2015/05/oracle-database-health-check-script.html
More and more smart and “easy to use” scripts for database administration tasks can be found in the DBA Bundle:
http://dba-tips.blogspot.ae/2014/02/oracle-database-administration-scripts.html
Finally, If you have an opinion/comment/suggestion please share it with me, your feedback is highly appreciated.
Hi GIThub,
I think this script DBAALARM is for Linux only.Is it compatible with AIX also?
Hari
Excellent work. This is working fine
Thanks,
Balajee
Hi, I'm using your script. But ORA errors that get through the mail can not read.For example: ORA-01110: ТЮИК ДЮММШУ 228.
Hi Hari,
Actually I didn't test this script on an AIX environment, so I cannot give you an answer, but I've a doubt that all the functions inside the script will work there.
This comment has been removed by the author.
Is it possible to sent alert for RMAN failure as below :
STATUS IN('RUNNING WITH WARNINGS','RUNNING WITH ERRORS','COMPLETED WITH WARNINGS','COMPLETED WITH ERRORS','FAILED')
Hi Rahul,
Thanks for your suggestion. Today I'll be publishing a new release of the script including your suggestion.
15-10-18 Added RMAN Backup Failure Check. [Recommended by Rahul Malode]
Also, created a new post explaining this feature inside dbalarm script
https://dba-tips.blogspot.com/2018/10/monitoring-of-failed-rman-backup-jobs.html
Thanks for sharing this script Adel. Excellent work
hello,
service monitoring feature looks not working with me, in the script I've updated SERVICEMON parameter to:
SERVICEMON="SUPPORT_SERVICE" but whenever SUPPORT_SERVICE goes down I don't get an alert.
Hi Zaid,
Please enclose the service name between a single quote along with the double quote, then it will work.
e.g.
SERVICEMON="'SUPPORT_SERVICE'"
The last version which send the alerts in html is not working on my servers. no errors but I'm not receiving the emails, any idea?
I doubt that your SMTP server is blocking the Emails that include HTML content. Please check first with your Network team, if that was the case then you can revert back to the TEXT format by setting the following option "HTMLENABLE" under THRESHOLD section to "N" like this:
# #########################
# THRESHOLDS:
# #########################
# Modify the THRESHOLDS to the value you prefer:
HTMLENABLE=N # Enable HTML Email Format [DB]
….
Hi,
I am getting some error on below line. Database is running on Red Hat Enterprise Linux Server release 7.6 (Maipo). Appreciate your help in this.
# Getting Utilized CPU (100-%IDLE):
CPU_UTL_FLOAT=`echo "scale=2; 100-($CPU_IDLE)"|bc`
dbalerm.sh: line 334: bc: command not found
CPU CHECK Completed.
Hi Abdul,
Inside the script please replace line 334:
CPU_UTL_FLOAT=`echo "scale=2; 100-($CPU_IDLE)"|bc`
with this line:
CPU_UTL_FLOAT=`awk "BEGIN {print 100-($CPU_IDLE)}"`
The script hang in this stage when running it for the first time:
[Grid Infrastructure Setup Detected] Locating Grid Infrastructure ALERTLOG …
Checking GRID INFRASTRUCTURE ALERTLOG …
What is the problem?
Just cancel the current execution by pressing Ctrl+c
And execute the script again, you shouldn't face this issue again.
Hi. I am unable to reach to dropbox link. Could you please share your script to vn6498@gmail.com. Appreciate your prompt help.
Able to copy the script, thank you. However, facing many errors as it is not picking up the email id – line 70 , not picking the environment from .bash_profile (FYI – i am running on 12.2.1) etc. Is there anyway to discuss one to one with you.
Sure, you can Email me on mahmmoudadel@hotmail.com