...

Collibra Data Analysis With Pyspark

Rupal Sarkar

Oct. 14, 2024

Basic Data Preparation using Collibra Output Module

Collibra Output Module is a lightweight graph query engine exposed through the public API. It allows different output formats, such as JSON, XML, Excel, and CSV. It also provides a single API to query most of the Collibra entities, such as assets, communities, domains and types, using SQL-like filtering capabilities.

Project structure:

├── configJsonFiles
│   ├── AssetDetailsConfig.json
│   ├── AssetResponsibilitiesConfig.json
├── dataSet
│   ├── AssetDetails.json
│   ├── AssetResponsibilities.json
├── images
│   ├── Business Stewards vs Assets.jpg
│   ├── AssetTypes vs Status.jpg
├── get_assets_details.py
├── get_assets_responsibilities.py
  1. We will be using JSON as our query language. First we'll be retrieving asset details. The Output Module supports a tabular output format and uses a different kind of ViewConfig, called TableViewConfig. TableViewConfig has a Columns mapping section that assigns each selected field to a column.
    The following json config file uses TableViewConfig. This is available under the same {{domain}}/rest/2.0/outputModule/export/json endpoint, just using the TableViewConfig as the JSON payload.

  Replace the highlighted domain IDs with your own domainIDs in the above JSON file
"Filter": {
"AND": [{"Field": {"name": "Domain_Id", "operator": "IN", "value": ["a3560bbf-d25a-4328-8488-e20023a09c7c", "45a29de6-ec1a-49d5-be40-71c8e0baaff0"]}}]
},

  1. Next we'll configure following environment variables:
    1. COLLIBRA URL = https://<your_collibra_platform_url>/rest/2.0
    2. COLLIBRA_USERNAME = <collibra_username>
    3. COLLIBRA_PASSWORD = <collibra_password>

  2. Then we'll make HTTP (Hypertext Transfer Protocol) POST (i.e. to submit data to be processed to the server) request in Python. We will be using python requests library
  1. Now we'll extract data regarding asset's responsibilities using following json config file:
  Replace the highlighted domain IDs with your own domainIDs in the above JSON file
"Filter": {
"AND": [{"Field": {"name": "Domain_Id", "operator": "IN", "value": ["a3560bbf-d25a-4328-8488-e20023a09c7c", "45a29de6-ec1a-49d5-be40-71c8e0baaff0"]}}]
},

  1. Again, we'll be using python to make POST request

Data Analysis using PySpark

Spark is a unified analytics engine for large-scale data processing. Spark provides a powerful API that makes it look like you’re working with a cohesive, non-distributed source of data, while working hard in the background to optimize your program to use all the power available. PySpark (Spark + Python) provides an entry point to Python in the computational model of Spark. PySpark is fast, expressive and versatile.
Now, using PySpark we'll analyze the data stored in dataSet folder