Some Thoughts on Oracle Portal Sizing and Monitoring

July 6th, 2006 by Mark Rittman

I’m going in to one of our higher education clients next week
to do a healthcheck of their Oracle Application Server 9.0.4.
installation, prior to them going live with a new student records
system next September. One of my colleagues took a look at their
Discoverer deployment earlier in the summer, and this time they’re
looking for some validation around their use of Application Server in
general, and in particular their use of Oracle Portal through which
their new system is deployed. On the basis that it’s worth thinking
things through properly before you turn up on day one, I thought I’d
jot down a few ideas here to give things a bit of structure.

The customer has spent quite a bit of time and money on their
new system, and they need to make sure it works, and that
Oracle Portal can handle a large number of students logging in, viewing
their timetables and results, and enrolling on their student records
system. In particular, they want to review:

  1. Their proposed Oracle Application Server architecture, plus
    any supporting infrastructure databases
  2. Review the hardware sizing they’ve carried out, and if
    neccessary make recommendations to change this
  3. Make recommendations around Portal and Infrastructure
    performance tuning, together with recommendations around Portal
    deployment and development, and
  4. Review their arrangements for resilience and recovery for
    both the application server and the database.

Doing a performance healthcheck of an Oracle Application
Server deployment can get quite interesting as there are potentially so
many moving parts for you to review and understand. Starting off from
the application itself, it’s usually something built in Java (JSPs,
servlets etc), PL/SQL, Forms or similar, and it’s going to interact
with a database just like any other application – it might scale, it
might not, it might have bottlenecks and so on. When you get to the
application server, in Oracle’s case you’ve got WebCache, OC4J,
mod_plsql, the various application server tiers, the single sign-on
server, the infrastructure database and so on. Then you get down to the
server level, where you can use commands such as vmstat, sar, top, iostat or
utilities such as Orca
to monitor CPU load, memory usage or IO load over time. You can take a
more holistic view by using applications such as Grid
Control
to
present data from multiple tiers in a single view, but it’s very
difficult to monitor and diagnose performance problems across the whole
stack as not everything is instrumened and integrated.

In this case though, we can probably narrow down the
investigation to three main areas:

  1. Oracle Portal, i.e. have they sized their hardware
    sufficiently for the proposed load, and have they deployed it properly
  2. Is their overall Application Server deployment correct,
    does it load balance properly, cache properly, can it handle the load,
    and
  3. Have they deployed their databases correctly, i.e. are they
    resilient enough and can they cope with the load they’ll expect?

Now as this is only going to be a two day exercise, and part
of that will be set aside for the report write-up, we’re not going to
be able to go into too much detail, but if we ask the right questions
and know what to look for, it should be possible to give some initial
worthwhile recommendations fairly early on.

Starting off with their Oracle Portal deployment, it’s worth
taking some time out early on to define a few terms and set out a
scope. As a good starting point, the Oracle
Portal : Performance
& Sizing
page on OTN has a set of pretty useful
papers and
presentations on the subject, and there’s a particularly good one by
Jason Pepper titled “How
to Effectively Size Hardware for your Portal
Implementation”
that has a good glossary and set of initial
questions
that would be a good way of starting things off. In particular, some
terms that are worth understanding at the start are:

  • Concurrency – the ability to handle multiple requests at
    the same time. In our instance, the student records portal will need to
    handle many simultateous users logging in, accessing reports, adding
    their details and so on (but how many at a time, and how does their
    activity break down?)
  • Contention – competition for resources, a familiar issue
    for database developers
  • Failover – a method of allowing one machine or process to
    provide a service if the original one fails
  • Latency – the time that one system component spends waiting
    for another
  • Response time, Service Time – needs little introduction
  • Scalability – the ability of a system to provide throughput
    in proportion to, and limited only by, hardware resources. In our case,
    will the system scale up to handle the expected load (one would hope
    so) – but also, if the load increases, can they just add more hardware
    or is there some limiting factor in the software they’ve deployed or
    developed?
  • Throughput – the number of requests – Portal pages, in our
    case – processed in a set amount of time.

This
page on OTN
looks in more detail at three of the metrics -
peak page throughput, page cache hit ratio and peak login rate – and
shows how the figures are calculated. See also “Sizing
Frequently Asked Questions”
.

Clearly there’s a lot to think about. In Jason’s paper and in
the OTN article, three ways to go about sizing your Portal
implementation are put forward:

  1. Calculator, i.e. have some sort of a spreadsheet or
    ready-reckoner that produces figures for numbers of processors, amount
    of RAM and so on for a given set of requirements (concurrent users,
    page throughput, response time etc). There’s something along these
    lines available for Discoverer on OTN. The trouble with this approach,
    the author however argues, is that it’s often the most innaccurate way
    of producing sizings.
  2. Size by Example. This is where you compare your proposed
    layout to some predefined examples, ideally customer case studies
    carried out in the real world. You pick the one nearest to what you’re
    planning and use that as your guideline. This is a bit more accurate
    than calculating using formulas.
  3. A Proof of Concept. This is where you build a
    representative version yourself, scale it up and measure it yourself.
    This is going to be the most accurate way of producing your sizings,
    but it’s also the most costly and takes the most time, as you actually
    have to build something.

Whichever route you go down, there are some intelligent
questions you can ask at the outset to get you off in the right track.
Going back to Jason’s presentation, these can be classified into three
areas: sizing scope, Portal requests/load, and Portal
content/complexity.

  1. Sizing Scope
  • Are we sizing for peak (the highest possible) load, or for
    average loads?
  • Is high availability needed? i.e. is it “mission critical”,
    or can we take a hours/days outage? Remember high availability will
    require redundant servers or oversized machines, and will increase the
    work required to run and maintain the servers.
  • Are there any particular maintenance issues, i.e. preferred
    / mandatory hardware, vendors, OS, storage etc? Are we re-using old
    hardware?
  1. Portal Requests / Load
  • What is the planned total number of users, and how many of
    them will be concurrent at any one time (max, average)?
  • What is the desired page throughput, i.e. given the
    estimated concurrent users above, how many pages will they request in
    say 10 minutes? Divide this down to pages per minute, or even per
    second.
  • What is the login frequency, say as a percentage of total
    pages requested per user? Logins are a special kind of page request, in
    that any login process will keep the portal machine from delivering
    pages, i.e. it’s a serialization event, just like a lock or a latch in
    the database.
  • What is/are the desired response time, including the
    response time for remote as well as local (intranet) users?
  1. Portal Content / Portal Complexity
  • Where does the portal content come from? Is is “internal”
    portlet content, say from database or java portlets build using
    Oracle’s SDKs, or is it external content?
  • How cacheable is the portal content? What is their cache
    strategy?
  • What is the average content lifetime?
  • What is the planned integration technology?

More on these questions can be found in Jason Pepper’s
presentation
and article.

At this point, we know the scope of the sizing exercise, any
desires and constraints, and the sort of workload the Portal setup will
need to be able to handle. Now we know what we’re aiming at, we can
start to look at the three sizing approaches – calculation, examples
and proof of concept.

The paper suggests a way of carrying out the calculation
approach through using a standard Portal example – Oracle’s
GlobalXChange knowledge management portal, which can be “generated”
using a set of scripts, and then using the benchmark figures as the
constants to feed into a series of calculations. The paper has the
calculations and some details on the benchmarking approach, but being
honest I can’t really see this happening with our client as I they’re
looking for something a bit more immediate, more a validation of their
approach. Funny enough, the third approach, to build a proof of
concept, is sort of possible, as the system is already there, and it
(presumably) runs – but they’re more looking for confirmation that what
they’ve got is “correct” and the numbers roughly check out OK.
Therefore, I’ll be using the second approach, to find some examples of
similar systems, and compare it to what they’ve got, and see whether
the figures the example systems turn out compare favourably to the
figures our client is producing.

To do this, I need three things. 

  1. A set of example Portal configurations, with figures to
    indicate the capacity that each could be expected to handle,
  2. Details on how they’ve set up their Portal environment
    (topography, use of hardware etc), and
  3. Some diagnostic and performance data from their setup to
    compare against the example configurations.

Looking back at the Performance
and Sizing page on OTN
, there are in
fact three
sample implementations
that can be used as reference points
when sizing your own Portal implementation. Looking through them, they
are:

  1. The
    small implementation
    , with 10000 registered users, 2.4
    concurrent requestors requesting one page each every 60 seconds,
    equating to 0.081 page requests per second. In this instance, the
    recommended topology a single load-balancing router (A), two mid-tier
    Portal servers (B,C) with 2GB of RAM (either 2 CPU Intel boxes, or Sun
    V280Rs), and a single infrastructure tier (D) with the same number of
    CPUs but with 4GB of RAM.

    Portal Small Implementation

    This compares well with our implementation, except the number of
    concurrent users might be higher. We’ll have to ask the question and
    find out.

  2. The
    medium implementation
    . This has 2,500 users who may be logged
    in at any one time, of which around 15% are active (and therefore
    concurrent). These 375 concurrent requestors generate around 2 page
    requests per minute, giving an overall page throughput of around 12
    pages per second. This configuration works out to be a single
    load-balancing router (A), three Webcache servers (B) with 2 x 1.4GHz
    CPUs (assuming Intel), configured as a cluster (i.e. not using hardware
    load-balancing) and with 4GB of RAM. Then there are 2 Mid-tier servers
    (C), again clustered but this time with 6GB of RAM, and then two
    infrastructure servers (D), clustered using Redhat Cluster Manager
    server for fail-over support (not load-balancing), with the two of them
    attached to a shared disk (E).

    Portal Medium Implementation

    This could be a viable solution if we need to handle more page requests
    per second, and if we’re looking for some resilence/fail-over.

  1. The
    large implementation
    . This has 300,000 registered users, 1%
    concurrent and a total page throughput of around 100 per second (259m
    page requests per month). I won’t list out the details of this solution
    as it’s well beyond what we’d need to cater for, except to say that it
    uses for example 8 webcache servers, separates out identity management
    and the portal repository on to separate servers, and is probably sized
    for something along the lines of OTN or a large company website.

It’s also worth reading through the preamble to each of the
examples especially where it talks about factors such as “hits”, “page
requests”,’ concurrency” and “high availability”. When you size a
client-server application it’s fairly easy to define things such as
concurrent users, but on a web application users tend to connect and
disconnect all the time, and make page requests on an irregular basis,
and therefore the examples define concurrency in terms of page requests
per second, which is the most accurate way of expressing how much
simultaneous work the server is going to have to do.

Now that we’ve got some sample implementations to use as “best
practice cases”, the next step is to get hold of the client’s proposed
topology and deployment. and compare this to the examples. One slight
issue is like most university customers, they’re using “big box” Sun
hardware to hold multiple application server tiers and database
instances, rather than the single-use Linux Intel commodity hardware
proposed in the examples, but it’s still possible to compare the logic
layouts and use of products between the client and example
implementations.

At this point, we know the following things:

  • What the client’s expectations are of their system
  • What Oracle propose as being best practice deployments for
    a set of example customers, and
  • What our customer is proposing as their application server
    deployment.

It should be fairly straightforward to compare the client
setup to the Oracle setup for a similarly sized organization, to see
whether what they’re proposing sounds realistic. What it’d be nice to
do now though, is to capture some performance metrics on the customer
setup if it’s already running, or point them in the direction of some
ways to do this, to see whether what they’ve set up is performing as
expected.

Just like with the Oracle database, there are many ways to
capture performance and diagnostic data so that you can measure the
performance of your system. You can do it graphical, in real-time,
using Oracle Application Server Control, or you can look at data over
time using Grid Control. Portal itself has some Portlets that provide
information on usage, most popular portlet and so on, and you can
access the logs generated by the underlying OC4J and mod_plsql
processes. You can use Unix command line tool such as vmstat and iostat and
utilties such as orca to track the performance of the underlying
servers, and if you’re looking for very low level, trace data you can
use Application Server utilities such as dmstool
and aggrespy
to track
the performance of individual components.

This presentation
and paper
by Mick Andrew and Jitiner Sethi
goes through the facility available in Oracle Portal and Oracle
Enterprise Manager (AS Control and Grid Control) concerned with
monitoring and diagnostics. Every installation (tier) of Oracle
Application Server has an installation of Management Agent on it, and
this collects performance and diagnostic data which then either gets
fed to the local AS Control application or uploaded to a node running
Grid Control. One key difference between AS Control (free with
Application Server) and Grid Control (an extra license cost) is that AS
Control only displays real-time, “as of now” performance data for one
particular node, whilst Grid Control shows real-time data and
historical data for all nodes that it is managing.

The basic Enterprise Manager portal metrics can be accessed
using ASControl and look like this:

ASControl Portal Metrics

This page provides some general metrics such as status and
portal performance, plus information on the status and version of the
Portal Repository. It’s also useful for viewing the status of Portal
components such Providers, Syndication Service and Ultrasearch. Using
Grid Control, you can show metrics collected over a period of time, and
use these metrics to monitor historical trends. I think it’s unlikely
our client will have licensed Grid Control though so I won’t spend too
much time on this.

For more detailed reports on Portal usage, you can run
a few post-configuration steps
and start loading the logs
generated by Oracle Portal into the database and start running
something provided by Oracle called the “Portal
Performance Reports”
, a set of text-file reports that
provides information on:

  • What is the peak login time per day?
  • How many logins per day does the portal receive ?
  • How long have portlets been taking to execute ?
  • What is the slowest portlet ?
  • How many total hits does the portal receive each day ?
  • Most/Least popular portlets
  • How often are users viewing a page or portlet?
  • How many unique users have logged in each day?
  • Which portlets were called?
  • How many hits does each page receive each day?
  • How many hits does each portlet receive each day?
  • Request breakdown by IP address or host name

These replace the reporting and monitoring portlets that you
used to get under the “Monitor” tab with Portal, but that are now
obsolete with the advent of WebCache. Using the post-configuration
steps linked to previously, you can set up logging such that the whole
process is automatic and the reports are generated for you on a daily
basis. For more details on how you can monitor Portal using inbuilt
functionality, check out “Monitoring
and Administering Oracle Portal”
in the online documentation.

If you’re feeling particularly adventurous, “Oracle
Application Server Tuning Techniques”
by John Garmany and Don
Burleson (ODTUG membership required) looks at some database and
application server parameters that are useful when tuning the
infrastructure database, and examines the use of utilities such as the
Dynamic Monitoring Service (dmstool) and Aggrespy, a Java Servlet that
can be used to display metrics for many Application Server 10g
processes. My instinct is to leave the infrastructure database
parameters as they are for the time being – the Oracle example
implemenations just used the default settings – and if I’m looking to
suggest any database tuning at all, it’ll be on the database used to
hold the student record application data. It’s useful to know this
facility is out there though, especially if Forms is used at all as
there’s a fairly comprehensive section in the paper about interpreting
the Forms Servlet logs.

So there we go then. I think I’ve managed to cover off all of
the requirements – review proposed setup, compare to “best practices”,
propose a way of monitoring and benchmarking the Portal installation,
and review their arrangements around resilience and high availability.

Comments are closed.

Website Design & Build: tymedia.co.uk