Parsing XML Files in SQL with Custom Aggregation Function

Parsing XML Files in SQL with Custom Aggregation Function

I was recently given a task to parse an XML file in SQL that contained the definition of an OBI RPD file, which had been exported using the biserverxmlgen command line utility. The purpose of this is for quick analysis of the RPD file and subsequent mapping of the data lineage.

I decided to use an external table so that the XML data can be presented to the database without the need of a load process in PL/SQL, and then use the XMLTABLE function to parse the data. This is fine, except the problem is that each line of the XML file appears as a single row in the external table. I would need to pass a CLOB data type containing the entire file contents to XMLTABLE for the XML to be parsed.

The LISTAGG function is not an option here, as it is limited by the size of VARCHAR2 in Oracle - the RPD XML file I am dealing with is 500KB in size. So after discussion with my colleague Petar Simic, I discovered a LISTAGG_CLOB function created by Connor McDonald which is (almost) perfect for my needs. The function works like LISTAGG but is capable of returning the results in a CLOB.   The original code by Connor can be found here.

The code in LISTAGG_CLOB delimits each element in the listagg with a comma (,) by default, which is hard-coded in the function. I needed the element delimiter to be a space (or chr(10)) so that the XML would parse correctly. I had two options to work around this:

  1. Modify the function to hard-code a space as the delimiter character, or
  2. add an optional parameter to the function so the delimiter can be specified at run time, which defaults to a comma.

I chose the second option.

Technical details

The LISTAGG_CLOB function consists of three parts. A Type specification, a Type body and the function to use the Type.

Type specification

CREATE OR REPLACE TYPE listagg_clob_t AS OBJECT (
        t_varchar2 VARCHAR2(32767),
        t_delim    VARCHAR2(255),
        t_clob     CLOB,
        STATIC FUNCTION odciaggregateinitialize (
               sctx  IN OUT listagg_clob_t,
               delim IN VARCHAR2 DEFAULT ','
           ) RETURN NUMBER,
        MEMBER FUNCTION odciaggregateiterate (
               self  IN OUT listagg_clob_t,
               a_val VARCHAR2
           ) RETURN NUMBER,
        MEMBER FUNCTION odciaggregateterminate (
               self        IN OUT listagg_clob_t,
               returnvalue OUT CLOB,
               flags       IN NUMBER
           ) RETURN NUMBER,
        MEMBER FUNCTION odciaggregatemerge (
               self IN OUT listagg_clob_t,
               ctx2 IN OUT listagg_clob_t
           ) RETURN NUMBER
)

Note the new variable T_DELIM and its position on line 3. This becomes the second of three parameters to the Type - see the body code below. There is also a new default parameter in the prototype of the static function.

Type body

The code can now be modified to replace the original static comma delimiter with the newly introduced T_DELIM variable. Note the new parameter added to the listagg_clob_t call on line 7. This is passed into the T_DELIM variable of the type.

💡
A description of the user defined aggregation functions can be found in the Oracle documentation.
CREATE OR REPLACE TYPE BODY listagg_clob_t IS
    STATIC FUNCTION odciaggregateinitialize (
        sctx  IN OUT listagg_clob_t,
        delim IN VARCHAR2 DEFAULT ','
    ) RETURN NUMBER IS
    BEGIN
        sctx := listagg_clob_t(NULL, delim, NULL);
        RETURN odciconst.success;
    END;
--
    MEMBER FUNCTION odciaggregateiterate (
        self  IN OUT listagg_clob_t,
        a_val VARCHAR2
    ) RETURN NUMBER IS

        PROCEDURE add_val (
            p_val VARCHAR2
        ) IS
        BEGIN
            IF nvl(lengthb(self.t_varchar2), 0) + lengthb(p_val) <= 4000 THEN
                IF self.t_varchar2 IS NULL THEN
                    self.t_varchar2 := self.t_varchar2 || p_val;
                ELSE
                    self.t_varchar2 := self.t_varchar2
                                       || self.t_delim
                                       || p_val;
                END IF;

            ELSE
                IF self.t_clob IS NULL THEN
                    dbms_lob.createtemporary(self.t_clob, TRUE, dbms_lob.call);
                    dbms_lob.writeappend(self.t_clob,
                                         length(self.t_varchar2),
                                         self.t_varchar2);

                ELSE
                    dbms_lob.writeappend(self.t_clob,
                                         length(self.t_varchar2),
                                         self.t_delim || self.t_varchar2);
                END IF;

                self.t_varchar2 := p_val;
            END IF;
        END;

    BEGIN
        add_val(a_val);
        RETURN odciconst.success;
    END;
--
    MEMBER FUNCTION odciaggregateterminate (
        self        IN OUT listagg_clob_t,
        returnvalue OUT CLOB,
        flags       IN NUMBER
    ) RETURN NUMBER IS
    BEGIN
        IF self.t_clob IS NULL THEN
            dbms_lob.createtemporary(self.t_clob, 
                                     TRUE,
                                     dbms_lob.call);
        END IF;

        IF self.t_varchar2 IS NOT NULL THEN
            dbms_lob.writeappend(self.t_clob,
                                 length(self.t_varchar2),
                                 self.t_varchar2);

        END IF;

        returnvalue := self.t_clob;
        RETURN odciconst.success;
    END;
--
    MEMBER FUNCTION odciaggregatemerge (
        self IN OUT listagg_clob_t,
        ctx2 IN OUT listagg_clob_t
    ) RETURN NUMBER IS
    BEGIN
        IF self.t_clob IS NULL THEN
            dbms_lob.createtemporary(self.t_clob,
                                     TRUE,
                                     dbms_lob.call);
        END IF;

        IF self.t_varchar2 IS NOT NULL THEN
            dbms_lob.writeappend(self.t_clob,
                                 length(self.t_varchar2),
                                 self.t_varchar2);

        END IF;

        IF ctx2.t_clob IS NOT NULL THEN
            dbms_lob.append(self.t_clob, ctx2.t_clob);
            dbms_lob.freetemporary(ctx2.t_clob);
        END IF;

        IF ctx2.t_varchar2 IS NOT NULL THEN
            dbms_lob.writeappend(self.t_clob,
                                 length(ctx2.t_varchar2),
                                 ctx2.t_varchar2);

            ctx2.t_varchar2 := NULL;
        END IF;

        RETURN odciconst.success;
    END;
--
END;

Aggregation function

Finally, the custom function listagg_clob with two parameters:

  • AGG the column name containing the data to be aggregated
  • DELIM a custom delimiter of your choice. Defaulted to comma.
CREATE OR REPLACE FUNCTION listagg_clob (
    agg   VARCHAR2,
    delim IN VARCHAR2 DEFAULT ','
) RETURN CLOB
    PARALLEL_ENABLE
AGGREGATE USING listagg_clob_t;
💡
This code has been tested on Oracle Database 19c.

Solving the problem

So now that I have a fast and reliable function to aggregate every row from a 500KB file, all I need to complete the build is the external table and the SQL to extract the data. Below is the external table definition.

💡
I am ignoring field delimiters as the data from the XML may contain commas. I don't expect '!~!' to appear in the RPD XML...
create table RPD_DATA_XML_EXT
(
  XMLDATA varchar2(3000)
)
organization external
(
  type oracle_loader
  default directory <import directory> 
  access parameters
  ( 
    RECORDS DELIMITED BY NEWLINE
    NOLOGFILE 
    NOBADFILE 
    NODISCARDFILE 
    FIELDS TERMINATED BY '!~!' OPTIONALLY ENCLOSED BY '!~!' NOTRIM
  )
  location
  (
    '<file name>'
  )
)

The SQL to parse the XML that is contained in multiple rows in the table is below. This is from the RPD export file I loaded into the import directory, and in this example, I am extracting a list of primary keys from the model.

💡
This SQL includes an example of a dynamic file name on the external table.

And the results:‌