Some Thoughts on Oracle Portal Sizing and Monitoring

I'm going in to one of our higher education clients next week to do a healthcheck of their Oracle Application Server 9.0.4. installation, prior to them going live with a new student records system next September. One of my colleagues took a look at their Discoverer deployment earlier in the summer, and this time they're looking for some validation around their use of Application Server in general, and in particular their use of Oracle Portal through which their new system is deployed. On the basis that it's worth thinking things through properly before you turn up on day one, I thought I'd jot down a few ideas here to give things a bit of structure.

The customer has spent quite a bit of time and money on their new system, and they need to make sure it works, and that Oracle Portal can handle a large number of students logging in, viewing their timetables and results, and enrolling on their student records system. In particular, they want to review:

  1. Their proposed Oracle Application Server architecture, plus any supporting infrastructure databases
  2. Review the hardware sizing they've carried out, and if neccessary make recommendations to change this
  3. Make recommendations around Portal and Infrastructure performance tuning, together with recommendations around Portal deployment and development, and
  4. Review their arrangements for resilience and recovery for both the application server and the database.

Doing a performance healthcheck of an Oracle Application Server deployment can get quite interesting as there are potentially so many moving parts for you to review and understand. Starting off from the application itself, it's usually something built in Java (JSPs, servlets etc), PL/SQL, Forms or similar, and it's going to interact with a database just like any other application - it might scale, it might not, it might have bottlenecks and so on. When you get to the application server, in Oracle's case you've got WebCache, OC4J, mod_plsql, the various application server tiers, the single sign-on server, the infrastructure database and so on. Then you get down to the server level, where you can use commands such as vmstat, sar, top, iostat or utilities such as Orca to monitor CPU load, memory usage or IO load over time. You can take a more holistic view by using applications such as Grid Control to present data from multiple tiers in a single view, but it's very difficult to monitor and diagnose performance problems across the whole stack as not everything is instrumened and integrated.

In this case though, we can probably narrow down the investigation to three main areas:

  1. Oracle Portal, i.e. have they sized their hardware sufficiently for the proposed load, and have they deployed it properly
  2. Is their overall Application Server deployment correct, does it load balance properly, cache properly, can it handle the load, and
  3. Have they deployed their databases correctly, i.e. are they resilient enough and can they cope with the load they'll expect?

Now as this is only going to be a two day exercise, and part of that will be set aside for the report write-up, we're not going to be able to go into too much detail, but if we ask the right questions and know what to look for, it should be possible to give some initial worthwhile recommendations fairly early on.

Starting off with their Oracle Portal deployment, it's worth taking some time out early on to define a few terms and set out a scope. As a good starting point, the Oracle Portal : Performance & Sizing page on OTN has a set of pretty useful papers and presentations on the subject, and there's a particularly good one by Jason Pepper titled "How to Effectively Size Hardware for your Portal Implementation" that has a good glossary and set of initial questions that would be a good way of starting things off. In particular, some terms that are worth understanding at the start are:

  • Concurrency - the ability to handle multiple requests at the same time. In our instance, the student records portal will need to handle many simultateous users logging in, accessing reports, adding their details and so on (but how many at a time, and how does their activity break down?)
  • Contention - competition for resources, a familiar issue for database developers
  • Failover - a method of allowing one machine or process to provide a service if the original one fails
  • Latency - the time that one system component spends waiting for another
  • Response time, Service Time - needs little introduction
  • Scalability - the ability of a system to provide throughput in proportion to, and limited only by, hardware resources. In our case, will the system scale up to handle the expected load (one would hope so) - but also, if the load increases, can they just add more hardware or is there some limiting factor in the software they've deployed or developed?
  • Throughput - the number of requests - Portal pages, in our case - processed in a set amount of time.

This page on OTN looks in more detail at three of the metrics - peak page throughput, page cache hit ratio and peak login rate - and shows how the figures are calculated. See also "Sizing Frequently Asked Questions".

Clearly there's a lot to think about. In Jason's paper and in the OTN article, three ways to go about sizing your Portal implementation are put forward:

  1. Calculator, i.e. have some sort of a spreadsheet or ready-reckoner that produces figures for numbers of processors, amount of RAM and so on for a given set of requirements (concurrent users, page throughput, response time etc). There's something along these lines available for Discoverer on OTN. The trouble with this approach, the author however argues, is that it's often the most innaccurate way of producing sizings.
  2. Size by Example. This is where you compare your proposed layout to some predefined examples, ideally customer case studies carried out in the real world. You pick the one nearest to what you're planning and use that as your guideline. This is a bit more accurate than calculating using formulas.
  3. A Proof of Concept. This is where you build a representative version yourself, scale it up and measure it yourself. This is going to be the most accurate way of producing your sizings, but it's also the most costly and takes the most time, as you actually have to build something.

Whichever route you go down, there are some intelligent questions you can ask at the outset to get you off in the right track. Going back to Jason's presentation, these can be classified into three areas: sizing scope, Portal requests/load, and Portal content/complexity.

  1. Sizing Scope
  • Are we sizing for peak (the highest possible) load, or for average loads?
  • Is high availability needed? i.e. is it "mission critical", or can we take a hours/days outage? Remember high availability will require redundant servers or oversized machines, and will increase the work required to run and maintain the servers.
  • Are there any particular maintenance issues, i.e. preferred / mandatory hardware, vendors, OS, storage etc? Are we re-using old hardware?
  1. Portal Requests / Load
  • What is the planned total number of users, and how many of them will be concurrent at any one time (max, average)?
  • What is the desired page throughput, i.e. given the estimated concurrent users above, how many pages will they request in say 10 minutes? Divide this down to pages per minute, or even per second.
  • What is the login frequency, say as a percentage of total pages requested per user? Logins are a special kind of page request, in that any login process will keep the portal machine from delivering pages, i.e. it's a serialization event, just like a lock or a latch in the database.
  • What is/are the desired response time, including the response time for remote as well as local (intranet) users?
  1. Portal Content / Portal Complexity
  • Where does the portal content come from? Is is "internal" portlet content, say from database or java portlets build using Oracle's SDKs, or is it external content?
  • How cacheable is the portal content? What is their cache strategy?
  • What is the average content lifetime?
  • What is the planned integration technology?

More on these questions can be found in Jason Pepper's presentation and article.

At this point, we know the scope of the sizing exercise, any desires and constraints, and the sort of workload the Portal setup will need to be able to handle. Now we know what we're aiming at, we can start to look at the three sizing approaches - calculation, examples and proof of concept.

The paper suggests a way of carrying out the calculation approach through using a standard Portal example - Oracle's GlobalXChange knowledge management portal, which can be "generated" using a set of scripts, and then using the benchmark figures as the constants to feed into a series of calculations. The paper has the calculations and some details on the benchmarking approach, but being honest I can't really see this happening with our client as I they're looking for something a bit more immediate, more a validation of their approach. Funny enough, the third approach, to build a proof of concept, is sort of possible, as the system is already there, and it (presumably) runs - but they're more looking for confirmation that what they've got is "correct" and the numbers roughly check out OK. Therefore, I'll be using the second approach, to find some examples of similar systems, and compare it to what they've got, and see whether the figures the example systems turn out compare favourably to the figures our client is producing.

To do this, I need three things.

  1. A set of example Portal configurations, with figures to indicate the capacity that each could be expected to handle,
  2. Details on how they've set up their Portal environment (topography, use of hardware etc), and
  3. Some diagnostic and performance data from their setup to compare against the example configurations.

Looking back at the Performance and Sizing page on OTN, there are in fact three sample implementations that can be used as reference points when sizing your own Portal implementation. Looking through them, they are:

  1. The small implementation, with 10000 registered users, 2.4 concurrent requestors requesting one page each every 60 seconds, equating to 0.081 page requests per second. In this instance, the recommended topology a single load-balancing router (A), two mid-tier Portal servers (B,C) with 2GB of RAM (either 2 CPU Intel boxes, or Sun V280Rs), and a single infrastructure tier (D) with the same number of CPUs but with 4GB of RAM.
    Portal Small Implementation

    This compares well with our implementation, except the number of concurrent users might be higher. We'll have to ask the question and find out.

  2. The medium implementation. This has 2,500 users who may be logged in at any one time, of which around 15% are active (and therefore concurrent). These 375 concurrent requestors generate around 2 page requests per minute, giving an overall page throughput of around 12 pages per second. This configuration works out to be a single load-balancing router (A), three Webcache servers (B) with 2 x 1.4GHz CPUs (assuming Intel), configured as a cluster (i.e. not using hardware load-balancing) and with 4GB of RAM. Then there are 2 Mid-tier servers (C), again clustered but this time with 6GB of RAM, and then two infrastructure servers (D), clustered using Redhat Cluster Manager server for fail-over support (not load-balancing), with the two of them attached to a shared disk (E).
    Portal Medium Implementation

    This could be a viable solution if we need to handle more page requests per second, and if we're looking for some resilence/fail-over.

  1. The large implementation. This has 300,000 registered users, 1% concurrent and a total page throughput of around 100 per second (259m page requests per month). I won't list out the details of this solution as it's well beyond what we'd need to cater for, except to say that it uses for example 8 webcache servers, separates out identity management and the portal repository on to separate servers, and is probably sized for something along the lines of OTN or a large company website.

It's also worth reading through the preamble to each of the examples especially where it talks about factors such as "hits", "page requests",' concurrency" and "high availability". When you size a client-server application it's fairly easy to define things such as concurrent users, but on a web application users tend to connect and disconnect all the time, and make page requests on an irregular basis, and therefore the examples define concurrency in terms of page requests per second, which is the most accurate way of expressing how much simultaneous work the server is going to have to do.

Now that we've got some sample implementations to use as "best practice cases", the next step is to get hold of the client's proposed topology and deployment. and compare this to the examples. One slight issue is like most university customers, they're using "big box" Sun hardware to hold multiple application server tiers and database instances, rather than the single-use Linux Intel commodity hardware proposed in the examples, but it's still possible to compare the logic layouts and use of products between the client and example implementations.

At this point, we know the following things:

  • What the client's expectations are of their system
  • What Oracle propose as being best practice deployments for a set of example customers, and
  • What our customer is proposing as their application server deployment.

It should be fairly straightforward to compare the client setup to the Oracle setup for a similarly sized organization, to see whether what they're proposing sounds realistic. What it'd be nice to do now though, is to capture some performance metrics on the customer setup if it's already running, or point them in the direction of some ways to do this, to see whether what they've set up is performing as expected.

Just like with the Oracle database, there are many ways to capture performance and diagnostic data so that you can measure the performance of your system. You can do it graphical, in real-time, using Oracle Application Server Control, or you can look at data over time using Grid Control. Portal itself has some Portlets that provide information on usage, most popular portlet and so on, and you can access the logs generated by the underlying OC4J and mod_plsql processes. You can use Unix command line tool such as vmstat and iostat and utilties such as orca to track the performance of the underlying servers, and if you're looking for very low level, trace data you can use Application Server utilities such as dmstool and aggrespy to track the performance of individual components.

This presentation and paper by Mick Andrew and Jitiner Sethi goes through the facility available in Oracle Portal and Oracle Enterprise Manager (AS Control and Grid Control) concerned with monitoring and diagnostics. Every installation (tier) of Oracle Application Server has an installation of Management Agent on it, and this collects performance and diagnostic data which then either gets fed to the local AS Control application or uploaded to a node running Grid Control. One key difference between AS Control (free with Application Server) and Grid Control (an extra license cost) is that AS Control only displays real-time, "as of now" performance data for one particular node, whilst Grid Control shows real-time data and historical data for all nodes that it is managing.

The basic Enterprise Manager portal metrics can be accessed using ASControl and look like this:

ASControl Portal Metrics

This page provides some general metrics such as status and portal performance, plus information on the status and version of the Portal Repository. It's also useful for viewing the status of Portal components such Providers, Syndication Service and Ultrasearch. Using Grid Control, you can show metrics collected over a period of time, and use these metrics to monitor historical trends. I think it's unlikely our client will have licensed Grid Control though so I won't spend too much time on this.

For more detailed reports on Portal usage, you can run a few post-configuration steps and start loading the logs generated by Oracle Portal into the database and start running something provided by Oracle called the "Portal Performance Reports", a set of text-file reports that provides information on:

  • What is the peak login time per day?
  • How many logins per day does the portal receive ?
  • How long have portlets been taking to execute ?
  • What is the slowest portlet ?
  • How many total hits does the portal receive each day ?
  • Most/Least popular portlets
  • How often are users viewing a page or portlet?
  • How many unique users have logged in each day?
  • Which portlets were called?
  • How many hits does each page receive each day?
  • How many hits does each portlet receive each day?
  • Request breakdown by IP address or host name

These replace the reporting and monitoring portlets that you used to get under the "Monitor" tab with Portal, but that are now obsolete with the advent of WebCache. Using the post-configuration steps linked to previously, you can set up logging such that the whole process is automatic and the reports are generated for you on a daily basis. For more details on how you can monitor Portal using inbuilt functionality, check out "Monitoring and Administering Oracle Portal" in the online documentation.

If you're feeling particularly adventurous, "Oracle Application Server Tuning Techniques" by John Garmany and Don Burleson (ODTUG membership required) looks at some database and application server parameters that are useful when tuning the infrastructure database, and examines the use of utilities such as the Dynamic Monitoring Service (dmstool) and Aggrespy, a Java Servlet that can be used to display metrics for many Application Server 10g processes. My instinct is to leave the infrastructure database parameters as they are for the time being - the Oracle example implemenations just used the default settings - and if I'm looking to suggest any database tuning at all, it'll be on the database used to hold the student record application data. It's useful to know this facility is out there though, especially if Forms is used at all as there's a fairly comprehensive section in the paper about interpreting the Forms Servlet logs.

So there we go then. I think I've managed to cover off all of the requirements - review proposed setup, compare to "best practices", propose a way of monitoring and benchmarking the Portal installation, and review their arrangements around resilience and high availability.