Convert AdminTools query output to JSON

Binary Adventures

When querying the BusinessObjects CMS repository through the AdminTools web application, the result is always returned as an HTML file full of (nested) tables, which is not an ideal format if you want to process the information further.

The code below takes the HTML, parses it and extracts the data, taking into account nested tables and returning a JSON data structure.

Note: in recent BI 4.2 releases, you could opt to use the REST SDK to execute your query on the repository and immediately have the result returned in JSON format.

Code

from bs4 import BeautifulSoup
from itertools import islice
import json

def recurse_table(table, header=False):
    data = {}
    for row in islice(table('tr', recursive=False), 1 if header else 0, None):
        cells = row.contents

        key, value = [cell.string for cell in cells]
        key = key.strip()

        if cells[1].table is None:
            # Alphanumeric value
            if key == 'SI_TOTAL':
                continue
            value = value if value is None else value.strip()
        else:
            # Nested table
            value = recurse_table(cells[1].table)

        if key.isnumeric():
            # This is an array, not a dict
            if 'data' not in data.keys():
                # Not yet initialized by previous row
                data['data'] = []
            data['data'].append(value)
        else:
            data[key] = value

    return data

infile = 'Example.html'

with open(infile, 'r') as f:
    soup = BeautifulSoup(f, 'lxml')

data = [recurse_table(table, header=True) for table in islice(soup.body('table', recursive=False),2,None,2)]

outfile = str(Path(infile).with_suffix('.json'))
with open(outfile, 'w') as f:
    f.write(json.dumps(data, indent=4))

Sample output

[
    {
        "SI_SPECIFIC_KIND": "User",
        "SI_ENT_USERGROUPS": {
            "data": [
                "2",
                "5750"
            ]
        },
        "SI_PARENT_FOLDER": "19",
        "SI_DATA": {
            "CADENZA_timeOutCheck": "true",
            "AF_Enable508": "false",
        },
        "SI_ID": "2649762",
        "SI_REL_USERGROUPS": {
            "data": [
                "2",
                "5428",
                "5750"
            ]
        },
        "SI_INSTANCE_OBJECT": "false",
        "SI_PARENT_CUID": "AXhmigik4CBKra9ZYzR2ezE",
        "SI_LAST_PASSWORD_CHANGE_TIME": "11/12/15 4:07:57 PM",
        "SI_OWNER": "Administrator",
        "SI_USERGROUPS": {
            "data": [
                "1",
                "2",
                "5428",
                "5750"
            ]
        },
        "SI_INSTANCE": "false",
        "SI_ALIASES": {
            "data": [
                {
                    "SI_DISABLED": "false",
                    "SI_ID": "secWinAD:D67A55DF9F6C20D7E8DCF1:00021776"
                },
                {
                    "SI_DISABLED": "false",
                    "SI_ID": "secEnterprise:#2649762"
                }
            ]
        }
    }
]

Remarks

Table types

The line islice(soup.body('table', recursive=False),2,None,2) ensures that we start reading from the third table, and skip every other table. The first table in the document contains a summary.

<table class='basic' width='100%' border='0'>
    <tr>
        <td align='left'>Number of InfoObject(s) returned: <b>1919</b></td>
        <td align='right'></td>
    </tr>
</table>

Then, each result table is preceded by a table enumerating the results.

<table class='basic' width='100%' border='0'>
    <tr>
        <td align='left'><b>1/1919</b></td>
        <td align='right'><a href='#top-anchor'>top</a></td>
    </tr>
</table>

The line islice(table('tr', recursive=False), 1 if header else 0, None) and accompanying parameter header allows us to remove the header row which only appears on the result tables, not on the nested/embedded tables.

Arrays

Some keys will contain alphanumeric values (e.g. SI_CUID), while others will contain a data structure (nested table, e.g. SI_USERGROUPS or SI_ALIASES). Those data structures can be simple arrays (with numeric keys), associative arrays (with alphanumeric keys) or a mix. This is why we perform the check key.isnumeric(). If it is, we know that we’re dealing with a row that belongs to an array. In that case, we add the value to the data array, instead of copying the array index into our output.

Reference