Basic Data Preparation using Collibra Output Module
Collibra Output Module is a lightweight graph query engine exposed through the public API. It allows different output formats, such as JSON, XML, Excel, and CSV. It also provides a single API to query most of the Collibra entities, such as assets, communities, domains and types, using SQL-like filtering capabilities.
Project structure:
├── configJsonFiles
│ ├── AssetDetailsConfig.json
│ ├── AssetResponsibilitiesConfig.json
├── dataSet
│ ├── AssetDetails.json
│ ├── AssetResponsibilities.json
├── images
│ ├── Business Stewards vs Assets.jpg
│ ├── AssetTypes vs Status.jpg
├── get_assets_details.py
├── get_assets_responsibilities.py
-
We will be using JSON as our query language. First we'll be retrieving asset details.
The Output Module supports a tabular output format and uses a different kind of ViewConfig, called TableViewConfig. TableViewConfig has a Columns mapping section that assigns each selected field to a column.
The following json config file uses TableViewConfig. This is available under the same {{domain}}/rest/2.0/outputModule/export/json endpoint, just using the TableViewConfig as the JSON payload.
Replace the highlighted domain IDs with your own domainIDs in the above JSON file
"Filter": {
"AND": [{"Field": {"name": "Domain_Id", "operator": "IN", "value": ["a3560bbf-d25a-4328-8488-e20023a09c7c", "45a29de6-ec1a-49d5-be40-71c8e0baaff0"]}}]
},
- Next we'll configure following environment variables:
- COLLIBRA URL = https://<your_collibra_platform_url>/rest/2.0
- COLLIBRA_USERNAME = <collibra_username>
- COLLIBRA_PASSWORD = <collibra_password>
- Then we'll make HTTP (Hypertext Transfer Protocol) POST (i.e. to submit data to be processed to the server) request in Python. We will be using python requests library
- Now we'll extract data regarding asset's responsibilities using following json config file:
Replace the highlighted domain IDs with your own domainIDs in the above JSON file
"Filter": {
"AND": [{"Field": {"name": "Domain_Id", "operator": "IN", "value": ["a3560bbf-d25a-4328-8488-e20023a09c7c", "45a29de6-ec1a-49d5-be40-71c8e0baaff0"]}}]
},
- Again, we'll be using python to make POST request
Data Analysis using PySpark
Spark is a unified analytics engine for large-scale data processing.
Spark provides a powerful API that makes it look like you’re working with a cohesive, non-distributed source of data, while working hard in the background to optimize your program to use all the power available.
PySpark (Spark + Python) provides an entry point to Python in the computational model of Spark.
PySpark is fast, expressive and versatile.
Now, using PySpark we'll analyze the data stored in dataSet folder