Parsing Badly Formatted JSON in Oracle DB with APEX_JSON

Using APEX_JSON to parse malformed JSON documents allowing the extraction of all the information contained.

Parsing Badly Formatted JSON in Oracle DB with APEX_JSON

After some blogging silence due to project work and holidays, I thought it was a good idea to do a write-up about a problem I faced this week. One of the tasks I was assigned was to parse a set of JSON files stored in an Oracle 12.1 DB Table.

As probably all of you already know JSON (JavaScript Object Notation) is a lightweight data-interchange format and is the format used widely when talking of web-services due to its flexibility. In JSON there is no header to define (think CSV as example), every field is defined in a format like "field name":"field value", there is no "set of required columns" for a JSON object, when a new attribute needs to be defined, the related name and value can be added to the structure. On top of this "schema-free" definition, the field value can either be

  • a single value
  • an array
  • a nested JSON object

Basically, when you start parsing JSON you feel like

RunAway

The Easy Part

The task assigned wasn't too difficult, after reading the proper documentation, I was able to parse a JSON File like

{
 "field1": "abc",
 "field2": "cde"
}

Using a simple SQL like

select * 
 from TBL_NAME d,
 JSON_TABLE(d.text, '$' COLUMNS (
   field1 VARCHAR2(10) PATH '$.field1',
   field2 VARCHAR2(10) PATH '$.field2'
   )
 )

Parsing arrays is not very complex either, a JSON file like

{
 "field1": "abc",
 "field2": "cde",
 "field3": ["fgh","ilm","nop"]
}

Can be easily parsed using the NESTED PATH call

select * 
 from TBL_NAME d, 
 JSON_TABLE(d.text, '$' COLUMNS (
   field1 VARCHAR2(10) PATH '$.field1',
   field2 VARCHAR2(10) PATH '$.field2',
   NESTED PATH '$.field3[*]' COLUMNS (
     field3 VARCHAR2(10) PATH '$'
   )
 )
)

In case the Array contains nested objects, those can be parsed using the same syntax as before, for example, field4 and field5 of the following JSON

{
 "field1": "abc",
 "field2": "cde",
 "field3": [
           {
            "field4":"fgh",
            "field5":"ilm"
           },
           {
            "field4":"nop",
            "field5":"qrs"
           }
           ] 
}

can be parsed with

NESTED PATH '$.field3[*]' COLUMNS ( 
   field4 VARCHAR2(10) PATH '$.field4',
   field5 VARCHAR2(10) PATH '$.field5' 
)

...Where things got complicated

All very very easy with well-formatted JSON files, but then I faced the following

{ 
"field1": "abc", 
"field2": "cde", 
"field3": [
    {
     "field4": "aaaa", 
     "field5":{ 
           "1234":"8881", 
           "5678":"8893" 
          }
     },
     {
      "field4": "bbbb",  
      "field5":{ 
            "9876":"8881", 
            "7654":"8945",
            "4356":"7777"
          }
      } 
      ] 
}

Basically the JSON file started including fields with names representing the Ids meaning an association like Product Id (1234) is member of Brand Id (8881). This immediately triggered my reaction:

Cat

After checking the documentation again, I wasn't able to find anything that could help me parsing that, since all the calls were including a predefined PATH string, that in the case of Ids I couldn't know beforehand.

I then reached out to my network on Twitter

That generated quite a lot of responses. Initially, the discussion was related to the correctness of the JSON structure, that, from a purist point of view should be mapped as

{ 
"field1": "abc",
"field2": "cde",
"field3": [ 
     {
      "field4": "aaaa", 
      "field5":
           { 
             "association": [
                  {"productId":"1234", "brandId":"8881"},
                  {"productId":"5678", "brandId":"8893"}
                  ]
           },
      },
      {
       "field4": "bbbb", 
       "field5":    
           {
             "association": [
                  {"productId":"9876", "brandId":"8881"},
                  {"productId":"7654", "brandId":"8945"},
                  {"productId":"4356", "brandId":"7777"}
                  ]
           }
      } 
      ]
}

basically going back to standard field names like productId and brandId that could be easily parsed. In my case this wasn't possible since the JSON format was aready widely used at the client.

Possible Solutions

Since a change in the JSON format wasn't possible, I needed to find a way of parsing it, few solutions were mentioned in the twitter thread:

  • Regular Expressions
  • Bash external table preprocessor
  • Java Stored functions
  • External parsing before storing data into the database

All the above were somehow discarded since I wanted to try achieving a solution based only on existing database functions. Other suggestion included JSON_DATAGUIDE and JSON_OBJECT.GET_KEYS that unfortunately are available only from 12.2 (I was on 12.1).

But, just a second before surrendering, Alan Arentsen suggested using APEX_JSON.PARSE procedure!

The Chosen One: APEX_JSON

The APEX_JSON package offers a series of procedures to parse JSON in a PL/SQL package, in particular:

  • PARSE: Parses a JSON formatted string contained in a VARCHAR2 or CLOB storing all the members.
  • GET_COUNT: Returns the number of array elements or object members
  • GET_MEMBERS: Returns the table of members of an object

You can already imagine how a combination of those calls can parse the JSON text defined above, let's have a look at the JSON again:

{ 
"field1": "abc", 
"field2": "cde", 
"field3": [
    {
     "field4": "aaaa", 
     "field5":{ 
           "1234":"8881", 
           "5678":"8893" 
          }
     },
     {
      "field4": "bbbb",  
      "field5":{ 
            "9876":"8881", 
            "7654":"8945",
            "4356":"7777"
          }
      } 
      ] 
}

The parsing process should iterate over the field3 entries (2 in this case), and for each entry, then iterate over the fields in field5 to get both the field name as well as the field value.
The number of field3 entries can be found with

APEX_JSON.GET_COUNT(p_path=>'field3',p_values=>j);

And the list of members of field5 with

APEX_JSON.GET_MEMBERS(p_path=>'field3[%d].field5',p_values=>j,p0=>i);

Note the p_path parameter set to field3[%d].field5 meaning that we want to extract the field5 from the nth row in field3. The rownumber is defined by p0=>i with i being the variable we use in our FOR loop.

The complete code is the following

DECLARE 
   j APEX_JSON.t_values; 
   r_count number;
   field5members   WWV_FLOW_T_VARCHAR2;
   p0 number;
   BrandId VARCHAR2(10);
BEGIN
APEX_JSON.parse(j,'<INSERT_JSON_STRING>');
# Getting number of field3 elements
r_count := APEX_JSON.GET_COUNT(p_path=>'field3',p_values=>j);
dbms_output.put_line('Nr Records: ' || r_count);

# Looping for each element in field3
FOR i IN 1 .. r_count LOOP
# Getting field5 members for the ith member of field3
 field5members := APEX_JSON.GET_MEMBERS(p_path=>'field3[%d].field5',p_values=>j,p0=>i);
# Looping all field5 members
 FOR q in 1 .. field5members.COUNT LOOP
# Extracting BrandId
   BrandId := APEX_JSON.GET_VARCHAR2(p_path=>'field3[%d].field5.'||field5members(q) ,p_values=>j,p0=>i);
# Printing BrandId and Product Id
   dbms_output.put_line('Product Id ="'||field5members(q)||'" BrandId="'||BrandId ||'"');
 END LOOP;
END LOOP;
   
END;

Note that, in order to extract the BrandId we used

APEX_JSON.GET_VARCHAR2(p_path=>'field3[%d].field5.'||field5members(q) ,p_values=>j,p0=>i);

Specifically the PATH is field3[%d].field5.'||field5members(q). As you can imagine we are appending the member name (field5members(q)) to the path described previously to extract the value, forming a string like field3[1].field5.1234 that will correctly extract the value associated.

Conclusion

Three things to save from this experience. The first is the usage of JSON_TABLE: with JSON_TABLE you can parse well-constructed JSON documents and it's very easy and powerful.
The second: APEX_JSON useful package to parse "not very well" constructed JSON documents, allows iteration across elements of JSON arrays and object members.
The last, which is becoming every day more relevant in my career, is the importance of networking and knowledge sharing: blogging, speaking at conferences, helping others in various channels allows you to know other people and be known with the nice side effect of sometimes being able with a single tweet to get help solving problems you may face!