Oh No! My BI Server has Core Dumped!

As well as supporting client's development teams, we often come in and help support teams in client organizations who are taking on Oracle BI and EPM Suite. One such client I was working with this week wanted me to help them with an OBIEE diagnosis checklist, where their staff could follow a series of steps to find out why their OBIEE environment was no longer available. In this case, they'd followed my instructions on working out which OBIEE component (or set of components) was down, but now they wanted a series of guidelines on why they were down.

To take an example, it's become clear that for whatever reason, the BI Server has gone down and people can no longer log in. The same may happen with the Presentation Server, which usually has the effect of making the J2EE server hosting the Presentation Services Plugin report an error message rather than bring up the dashboard login. Now in most cases you can restart the service or rerun the startup scripts under Linux/Unix, but how do you go about working out why they went down and try to fix things?

First off, unless you've done something obvious or you've got hardware problems, for example, if your BI Server or BI Presentation Server are crashing it's usually time to raise an SR and call in Oracle Support. If you look through metalink3 you'll see a fair few references to the BI Server and Presentation Server crashing, and they are usually due to bugs in the software that are then fixed in subsequent releases of OBIEE. There may well be workarounds or even back-ported fixes, but in most cases you'll need Oracle Support to first diagnose the problem and then either provide an interim patch or tell you to upgrade to the latest version (10.1.3.4.1 at the time of writing this article). If you're on an earlier release, such as one of the 10.1.3.3.x releases there are a whole bunch of issues you can hit over such things as scripts in connection pools, pivot tables and nested composite views that are fixed in 10.1.3.4.1, and even with 10.1.3.4 you find Oracle notes on metalink3 saying that time series functions, for example, are buggy in this release and only properly work in 10.1.3.4.1 (which is why time dimensions only appeared in the BI Apps 7.9.6, for example).

So how do you know if it's time to call Oracle support? Well on the principle of keeping it simple at the start, the first place I'd look for diagnostic information are the logs generated by the applications. Assuming you are on Linux, I'd take a look at the $ORACLEBI/server/log:

[oracle@ora11g Log]$ tail -n20 /u01/app/oracle/product/OracleBI/server/Log/NQServer.log

2009-08-06 22:29:42
     [58001] Query Cache Statistics -- Hits:0(0%), Qualified Misses:1(100%), Unqualified Misses:0(0%).
2009-08-06 22:29:42
     [14058] Unloaded all subject areas.
2009-08-07 07:24:34
     [36007] Loading repository /u01/app/oracle/product/OracleBI/server/Repository/samplesales.rpd.
2009-08-07 07:24:34
     [14055] Loading subject area: Sample Sales ...
2009-08-07 07:24:35
     [14056] Finished loading subject area: Sample Sales.
2009-08-07 07:24:35
     [58002] Query Cache loaded with 1 entries from saved cache files.
2009-08-07 07:24:35
     [43030] :  Oracle BI Server started.  Version: 10.1.3.4.1.090414.1900.
2009-08-07 07:44:09
     [43031] :  Oracle BI Server shutdown.
2009-08-07 07:44:09
     [58001] Query Cache Statistics -- Hits:0(0%), Qualified Misses:0(0%), Unqualified Misses:0(0%).
2009-08-07 07:44:09
     [14058] Unloaded all subject areas.
OK, so looking at this log file shows that the BI Server closed down cleanly, which means that something (actually me, in this case) shut it down gracefully. This doesn't tell us why, but it's at least a clue.

Under Linux and Unix each of the OBIEE server processes has an ".out.log" file that provides console messages from the application. If we take a look at this we can see the closedown again.

[oracle@ora11g Log]$ tail -n20 /u01/app/oracle/product/OracleBI/server/Log/nqsserver.out.log
Oracle BI Server starting...
rm: cannot remove `/u01/app/oracle/product/OracleBIData/tmp/nQS_*.TMP': No such file or directory
nqsserver:      Oracle BI Server started.  Version: 10.1.3.4.1.090414.1900.
nqsserver:      Oracle BI Server shutdown.
Oracle BI Server starting...
nqsserver:      Oracle BI Server started.  Version: 10.1.3.4.1.090414.1900.
nqsserver:      Oracle BI Server shutdown.
So this tells us two things. Firstly, the shutdown was graceful and also that it happened at 07.44am - if we have intermittent shutdowns the first log file will be useful in recording when they happen.

It's a similar story for the BI Presentation Server as well.

[oracle@ora11g Log]$ tail -n20 /u01/app/oracle/product/OracleBIData/web/log/sawlog0.log
Type: Information
Severity: 30
Time: Fri Aug  7 07:24:49 2009
File: project/sawserver/sawserver.cpp Line: 459
Properties: ThreadID-3086690528
Location:
        saw.sawserver

Oracle BI Presentation Services have started successfully.
---------------------------------------
Type: Information
Severity: 30
Time: Fri Aug  7 07:51:31 2009
File: project/sawserver/sawserver.cpp Line: 604
Properties: ThreadID-3086690528
Location:
        saw.sawserver

Oracle BI Presentation Services are shutting down.
---------------------------------------
So again, the Presentation Server has shut down, cleanly and with no error message in the file (again, because I shut it down just before). If there are errors with the Presentation Server you'll often see messages in here, and you can in fact turn up the logging level in this file to see more detail (though typically you'd only really do this on the request of Oracle support).

So far then, we know that the BI Server and Presentation Server have gone down, cleanly, but we don't know why. At this point I'd check that there isn't something in my internal processes that brings these down, and I'd check my product versions against known bugs listed on metalink3 and upgrade to 10.1.3.4.1 if I'm not on that release already.

But what about more sudden crashes, which typically don't leave any entries in the log files as the BI Server doesn't have time to write them? Let's simulate this with a "kill -6" on the nqsserver process:

[oracle@ora11g setup]$ ps -ef | grep nqsserver
oracle    7578     1  1 07:59 pts/2    00:00:01 /u01/app/oracle/product/OracleBI/server/Bin/nqsserver -quiet
oracle    7803  5928  0 08:00 pts/2    00:00:00 grep nqsserver

[oracle@ora11g setup]$ kill -6 7578
First of all I'd do another "ps -ef" to list out the processes and check that in this case, the BI Server has gone down.
[oracle@ora11g setup]$ ps -ef | grep nqsserver
oracle    7809  5928  0 08:00 pts/2    00:00:00 grep nqsserver
Indeed it has. So do the log files show anything?
[oracle@ora11g setup]$ tail -n20 /u01/app/oracle/product/OracleBI/server/Log/nqsserver.out.log
Oracle BI Server starting...
rm: cannot remove `/u01/app/oracle/product/OracleBIData/tmp/nQS_*.TMP': No such file or directory
nqsserver:      Oracle BI Server started.  Version: 10.1.3.4.1.090414.1900.
nqsserver:      Oracle BI Server shutdown.
Oracle BI Server starting...
nqsserver:      Oracle BI Server started.  Version: 10.1.3.4.1.090414.1900.
nqsserver:      Oracle BI Server shutdown.
Oracle BI Server starting...
rm: cannot remove `/u01/app/oracle/product/OracleBIData/tmp/nQS_*.TMP': No such file or directory
nqsserver:      Oracle BI Server started.  Version: 10.1.3.4.1.090414.1900.
No, as far as the BI Server log file is concerned, the server is still up. What about the alert log for the OS?
[root@ora11g setup]# grep nqsserver /var/log/messages
[root@ora11g setup]#
Nothing. Quite often you will find useful information here, but in this case we've just got the mystery "BI Server has crashed and left no trace" error. This is, however, were core dump files come in useful. Now most Linux setups I've worked with don't have core dumps enabled by default, and you can check this by running "ulimit -a" on your system.
[root@ora11g setup]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
max nice                        (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 32768
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
max rt priority                 (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 16384
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
What this is saying is that the maximum core dump ("core file size") is 0, which means they won't be produced. To enable these, I need to edit the /etc/security/limits.conf file and ideally set the value to "unlimited". You can also enable core dumps in your session by running the ulimit command directly, like this
[oracle@ora11g ~]$ ulimit -c unlimited
Then I find the PID for the nqsserver process, and then kill it with an invalid memory reference signal to create a core dump.
[oracle@ora11g ~]$ ps -ef | grep nqsserver

oracle    6275     1 18 09:29 pts/2    00:00:00 /u01/app/oracle/product/OracleBI/server/Bin/nqsserver -quiet
oracle    6391  4591  0 09:29 pts/2    00:00:00 grep nqsserver

[oracle@ora11g ~]$ kill -SIGSEGV 6275
[oracle@ora11g ~]$ ps -ef | grep nqsserver

oracle    6411  4591  0 09:30 pts/2    00:00:00 grep nqsserver

[oracle@ora11g ~]$ ls /u01/app/oracle/product/OracleBI/server/Log
core.6275  NQQuery.log  NQServer.log  nqsserver.out.log
And there ("core.6275") is our core dump. Whilst processes generally core dump in the directory that the process is running in, the Oracle BI Server core dumps into the $ORACLEBI/server/Log directory instead.

So, what now? Well generally if you are getting core dumps, as I said it's time to call Oracle support and get them to take a look. When you raise the SR though, what they'll generally get you to do is to analyze the core dump file with the "gdb" utility, which produces diagnostic data that can be useful to product engineering. You can do this yourself if you already have gdb installed, and take a look at what caused previous core dumps.

[oracle@ora11g ~]$ gdb

GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "".
(gdb) core /u01/app/oracle/product/OracleBI/server/Log/core.7225
(no debugging symbols found)
Using host libthread_db library "/lib/libthread_db.so.1".
Core was generated by `/u01/app/oracle/product/OracleBI/server/Bin/nqsserver -quiet'.
Program terminated with signal 11, Segmentation fault.
#0  0x00572402 in __kernel_vsyscall ()

(gdb) quit
[oracle@ora11g ~]$
So as you would expect seeing that I caused the segmentation fault, this is what is listed in the core dump extract. Sometimes this file can be useful even to yourself, as it can reference actions (writing to tables, for example) that you can trace back to something you've just added to the RPD, but usually it's something that Oracle support will use to work out if you've hit a bug.