Testing times

January 26th, 2006 by Peter Scott

As an organisation, my employers firmly believe in working to standards; we are ISO certified, all of the technical staff in our outsourcing group have a minimum of ITIL foundation level certification. Lots of our guys also have vendor certifications in the products they use in their day-to-day working lives (OK, we also make sure we have a good headcount of certified professionals so that we enhance our partner status with various vendors; and it impresses potential customers)
So, we are strong on process. This means that some things take a while to happen. We need to test and document the test, nothing happens on a production system without the electronic equivalent of piece of paper say that somebody is happy that the change does exactly what it should and the can be applied to the production system to resolve some issue or other. Or so the theory goes.
Recently we patched a customer’s Oracle 9.2 test database to the most recent patch-set, tested and found no problems, rolled it out to live and hit a bug with star transform, found a patch on Metalink, retested and patched. And then we found problem with our statistics collection routine taking 15 hours instead of two. This one is going through Oracle’s Service Request mechanism; my guys have a bit of work to do to capture trace data and prepare a test case for support; we can see what is going wrong, we just need to have the evidence so that the problem can be fixed. But for now we have put in a work-around to put enough stats in place for the on-line day.
Neither of these problems were seen on the test system; the first because we did not run that user query; the second did occur but on a far smaller test system (less than 1 TB) it was not obvious. So do we need to revise the way we test or accept that sometimes things can get through and have the mechanisms in place to deal with it – I think the later.

Comments

  1. Howard J. Rogers Says:

    I used to help clients through the ISO 9001 certification process, and was always at pains to stress to them that ISO9001 does not mean you won’t have problems in future; it simply means that there are procedures in place for responding to trouble when it arises.

    Of course, planned testing (as opposed to a chaotic free-for-all) should indeed mean fewer problems make it through to production.

    But I’m definitely with you: things will always sneak through, but you have mechanisms in place to deal with them more efficiently and effectively than your non-certified equivalents.

  2. Nuno Souto Says:

    Exactly. There is an entire branch of engineering that studies, guess what? Errors! It basically starts from the premise that error is possible and in fact common. What is needed is a mechanism to manage and control its impact. It applies to most engineering processes and is one of the main reason why we have ECC memories available.

    Pity so many forget that IT is an engineering branch, no matter how much economists want to trivialize it.

  3. Peter K Says:

    So do we need to revise the way we test or accept that sometimes things can get through and have the mechanisms in place to deal with it – I think the later.

    I would suggest a combination of both. You will probably need to take another look at the way testing is done to ensure that nothing has fallen through the cracks and firm up the mechanisms to deal with problems after.