Unverified Commit 4ecc65f4 authored by Jin Hyuk Chang's avatar Jin Hyuk Chang Committed by GitHub

Dashboard documentation (#264)

* Added Dashboard model documentation

* Update

* Update

* Update

* Update

* Update

* Update

* Added new lines

* Update

* Update

* Update

* Added transformers doc

* Update

* Update

* Added graph image
parent 134b3293
This diff is collapsed.
......@@ -18,7 +18,7 @@ LOGGER = logging.getLogger(__name__)
class ModeDashboardQueriesExtractor(Extractor):
"""
A Extractor that extracts run (execution) status and timestamp.
A Extractor that extracts Query information
"""
......
......@@ -13,7 +13,9 @@ LOGGER = logging.getLogger(__name__)
class TemplateVariableSubstitutionTransformer(Transformer):
"""
Transforms dictionary into model
Add/Replace field in Dict by string.format based on given template and provide record Dict as a template parameter
https://docs.python.org/3.4/library/string.html#string.Formatter.format
"""
def init(self, conf):
......
# Dashboard Ingestion guidance
(Currently this guidance is about using Databuilder to ingest Dashboard metadata into Neo4j and Elasticsearch)
Dashboard ingestion consists of multiple Databuilder jobs and it can be described in four steps:
1. Ingest base data to Neo4j.
2. Ingest additional data and decorate Neo4j over base data.
3. Update Elasticsearch index using Neo4j data
4. Remove stale data
Note that Databuilder jobs need to be sequenced as 1 -> 2 -> 3 -> 4. To sequencing these jobs, Lyft uses Airflow to orchestrate the job, but Databuilder is not limited to Airflow and you can also simply use Python script to sequence it -- not recommended for production though.
Also, step 1, 3, 4 is expected to have one Databuilder job where Step 2 is expected to have **multiple** Databuilder jobs and number of Databuilder jobs in step 2 is expected to grow as we add more metadata into Dashboard. To improve performance, it is recommended, but not required, to execute Databuilder jobs in step 2 concurrently.
Once finished step 1 and 2, you will have Graph like this:
![Dashboard graph modeling](./assets/dashboard_graph_modeling.png?raw=true "Dashboard graph modeling")
Here this documentation will be using [Mode Dashboard](https://app.mode.com/) as concrete example to show how to ingest Dashboard metadata. However, this ingestion process not limited to Mode Dashboard and any other Dashboard can follow this flow.
### 1. Ingest base data to Neo4j.
Using [ModeDashboardExtractor](../README.md#modedashboardextractor) along with [FsNeo4jCSVLoader](../README.md#fsneo4jcsvloader) and [Neo4jCsvPublisher](../README.md#neo4jcsvpublisher) to add base information such as Dashboard group name, Dashboard group id, Dashboard group description, Dashboard name, Dashboard id, Dashboard description to Neo4j. Use [this job configuration](../README.md#modedashboardextractor) example to configure the job.
### 2. Ingest additional data and decorate Neo4j over base data.
Use other Mode dashboard's extractors in create & launch multiple Databuilder jobs. Note that it all Databuilder job here will use [FsNeo4jCSVLoader](../README.md#fsneo4jcsvloader) and [Neo4jCsvPublisher](../README.md#neo4jcsvpublisher) where their configuration should be almost the same except the `NODE_FILES_DIR` and `RELATION_FILES_DIR` that is being used for temporary location to hold data.
List of other Extractors can be found [here](../README.md#mode-dashboard-extractor)
#### 2.1. Ingest Dashboard usage data and decorate Neo4j over base data.
Mode provide usage data (view count) per Dashboard, but this is accumulated usage data. The main use case of usage is search ranking and `accumulated usage` is not that much useful for Amundsen as we don't want to show certain Dashboard that was popular years ago and potentially deprecated.
To bring recent usage information, we can `snapshot` accumulated usage per report daily and extract recent usage information (past 30 days, 60 days, 90 days that fits our view of recency).
##### 2.1.1. Ingest `accumulated usage` into Data warehouse (e.g: Hive, BigQuery, Redshift, Postgres, etc)
In this step, you can use ModeDashboardUsageExtractor to extract `accumulated_view_count` and load into Data warehouse of your choice by using GenericLoader.
Note that GenericLoader just takes a callback function, and you need to provide a function that `INSERT` record into your Dataware house.
```python
extractor = ModeDashboardUsageExtractor()
loader = GenericLoader()
task = DefaultTask(extractor=extractor,
loader=loader)
job_config = ConfigFactory.from_dict({
'{}.{}'.format(extractor.get_scope(), ORGANIZATION): organization,
'{}.{}'.format(extractor.get_scope(), MODE_ACCESS_TOKEN): mode_token,
'{}.{}'.format(extractor.get_scope(), MODE_PASSWORD_TOKEN): mode_password,
'{}.{}'.format(loader.get_scope(), 'callback_function'): mode_dashboard_usage_loader_callback_function,
})
job = DefaultJob(conf=job_config, task=task)
job.launch()
```
Step 2. Extract past ? days usage data from your Data warehouse and publish it to Neo4j.
You could use [existing extractors](../README.md#list-of-extractors) to achieve this with [DashboardUsage model](./models.md#dashboardusage) along with [FsNeo4jCSVLoader](../README.md#fsneo4jcsvloader) and [Neo4jCsvPublisher](../README.md#neo4jcsvpublisher).
### 3. Update Elasticsearch index using Neo4j data
Once data is ready in Neo4j, extract Neo4j data and push it to Elasticsearch using [Neo4jSearchDataExtractor](../databuilder/extractor/neo4j_search_data_extractor.py) and [ElasticsearchPublisher](../databuilder/publisher/elasticsearch_publisher.py)
```python
tmp_es_file_path = '/var/tmp/amundsen_dashboard/elasticsearch_dashboard_upload/es_data.json'
elasticsearch_new_index_name = 'dashboard_search_index_{ds}_{hex_str}'.format(ds='2020-05-12',
hex_str=uuid.uuid4().hex)
elasticsearch_doc_type = 'dashboard'
elasticsearch_index_alias = 'dashboard_search_index'
job_config = ConfigFactory.from_dict({
'extractor.search_data.extractor.neo4j.{}'.format(Neo4jExtractor.GRAPH_URL_CONFIG_KEY):
neo4j_endpoint,
'extractor.search_data.extractor.neo4j.{}'.format(Neo4jExtractor.MODEL_CLASS_CONFIG_KEY):
'databuilder.models.dashboard_elasticsearch_document.DashboardESDocument',
'extractor.search_data.extractor.neo4j.{}'.format(Neo4jExtractor.NEO4J_AUTH_USER):
neo4j_user,
'extractor.search_data.extractor.neo4j.{}'.format(Neo4jExtractor.NEO4J_AUTH_PW):
neo4j_password,
'extractor.search_data.{}'.format(Neo4jSearchDataExtractor.CYPHER_QUERY_CONFIG_KEY):
Neo4jSearchDataExtractor.DEFAULT_NEO4J_DASHBOARD_CYPHER_QUERY,
'extractor.search_data.{}'.format(neo4j_csv_publisher.JOB_PUBLISH_TAG):
job_publish_tag,
'loader.filesystem.elasticsearch.{}'.format(FSElasticsearchJSONLoader.FILE_PATH_CONFIG_KEY):
tmp_es_file_path,
'loader.filesystem.elasticsearch.{}'.format(FSElasticsearchJSONLoader.FILE_MODE_CONFIG_KEY):
'w',
'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.FILE_PATH_CONFIG_KEY):
tmp_es_file_path,
'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.FILE_MODE_CONFIG_KEY):
'r',
'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.ELASTICSEARCH_CLIENT_CONFIG_KEY):
elasticsearch_client,
'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.ELASTICSEARCH_NEW_INDEX_CONFIG_KEY):
elasticsearch_new_index,
'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.ELASTICSEARCH_DOC_TYPE_CONFIG_KEY):
elasticsearch_doc_type,
'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.ELASTICSEARCH_MAPPING_CONFIG_KEY):
DASHBOARD_ELASTICSEARCH_INDEX_MAPPING,
'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.ELASTICSEARCH_ALIAS_CONFIG_KEY):
elasticsearch_index_alias,
})
job = DefaultJob(conf=job_config,
task=DefaultTask(extractor=Neo4jSearchDataExtractor(),
loader=FSElasticsearchJSONLoader()),
publisher=ElasticsearchPublisher())
job.launch()
```
*Note that `DASHBOARD_ELASTICSEARCH_INDEX_MAPPING` is defined [here](../databuilder/publisher/elasticsearch_constants.py).
### 4. Remove stale data
Dashboard ingestion, like Table ingestion, is UPSERT (CREATE OR UPDATE) operation and there could be some data deleted on source. Not removing it in Neo4j basically leaving a stale data in Amundsen.
You can use [Neo4jStalenessRemovalTask](../README.md#removing-stale-data-in-neo4j----neo4jstalenessremovaltask) to remove stale data.
There are two strategies to remove stale data. One is to use `job_publish_tag` and the other one is to use `milliseconds_to_expire`.
For example, you could use `job_publish_tag` to remove stale `DashboardGroup`, `Dashboard`, and `Query` nodes. And you could use `milliseconds_to_expire` on `Timestamp` node, `READ` relation, and `READ_BY`. One of the main reasons to use `milliseconds_to_expire` is to avoid race condition and it is explained more [here](./README.md#using-publisher_last_updated_epoch_ms-to-remove-stale-data)
# Amundsen Models
## Overview
......@@ -17,8 +18,7 @@ without developers needing to know the internals of the neo4j schema.
## The Models
### TableMetadata
[python class](../databuilder/models/table_metadata.py)
### [TableMetadata](../databuilder/models/table_metadata.py)
*What datasets does my org have?*
......@@ -32,8 +32,7 @@ In general, for Table and Column Metadata, you should be able to use one of the
in the [extractor package](../databuilder/extractor)
### Watermark
[python class](../databuilder/models/watermark.py)
### [Watermark](../databuilder/models/watermark.py)
*What is the earliest data that this table has? What is the latest data?*
This is NOT the same as when the data was last updated.
......@@ -50,8 +49,7 @@ Depending on the datastore of your dataset, you would extract this by:
- a query for the minimum and maximum record of a given timestamp column
### ColumnUsageModel
[python class](../databuilder/models/column_usage_model.py)
### [ColumnUsageModel](../databuilder/models/column_usage_model.py)
*How many queries is a given column getting? By which users?*
......@@ -70,8 +68,7 @@ In other cases, you may need to use audit logs which could require a custom solu
Finally, for none traditional data lakes, getting this information exactly maybe difficult and you may need to rely
on a heuristic.
### User
[python class](../databuilder/models/user.py)
### [User](../databuilder/models/user.py)
*What users are there out there? Which team is this user on?*
......@@ -82,8 +79,7 @@ This is required if you are going to be having authentication turned on.
#### Extraction
TODO
### TableColumnStats
[python class](../databuilder/models/table_stats.py)
### [TableColumnStats](../databuilder/models/table_stats.py)
* What are the min/max values for this column? How many nulls are in this column? *
......@@ -100,8 +96,7 @@ For each table you care about:
For each column you care about:
Calculate statistics that you care about such as min/max/average etc.
### Application
[python class](../databuilder/models/application.py)
### [Application](../databuilder/models/application.py)
* What job/application is writing to this table? *
......@@ -112,8 +107,7 @@ Currently the model assumes the application has to be in airflow, but in theory
#### Extraction
TODO
### Table Owner
[python class](../databuilder/models/table_owner.py)
### [Table Owner](../databuilder/models/table_owner.py)
* What team or user owns this dataset? *
......@@ -126,8 +120,7 @@ Although the main point of entry for owners is through the WebUI, you could in t
extract this information based on who created a given table.
### Table Source
[python class](../databuilder/models/table_source.py)
### [Table Source](../databuilder/models/table_source.py)
* Where is the source code for the application that writes to this dataset? *
......@@ -140,8 +133,7 @@ You will need a github/gitlab/your repository crawler in order to populate this
The idea there would be to search for a given table name or something else that is a unique identifier such that you can be confident
that the source correctly matches to this table.
### TableLastUpdated
[python class](../databuilder/models/table_last_updated.py)
### [TableLastUpdated](../databuilder/models/table_last_updated.py)
* When was the last time this data was updated? Is this table stale or deprecated? *
......@@ -152,4 +144,73 @@ It is a very useful value as it can help users identify if there are tables that
#### Extraction
There are some extractors available for this like [hive_table_last_updated_extractor](../databuilder/extractor/hive_table_last_updated_extractor.py)
that you can refer to. But you will need access to history that provides information on when the last data write happened on a given table.
If this data isn't available for your data source, you maybe able to approximate it by looking at the max of some timestamp column.
If this data isn't available for your data source, you maybe able to approximate it by looking at the max of some timestamp column.
## Dashboard models
Dashboard models are normalized which means that the model is separated so that it can be easily decoupled with how data is extracted. (If model is denormalized, all metadata is in model, then one extraction needs to able to pull all the data which makes extraction hard and complex) There's trade off in this decision of normalized design where it can be inefficient in the case that some ingestion can be done in one job for metadata source happen to provide all data it need. However, to make model flexible for most of metadata, it is normalized.
### [DashboardMetadata](../databuilder/models/dashboard/dashboard_metadata.py)
#### Description
A baseline of Dashboard metadata that consists of dashboard group name, dashboard group description, dashboard description, etc. This model needs to be ingested first as other model builds relation to this.
#### Extraction
[ModeDashboardExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_extractor.py)
#### [DashboardOwner](../databuilder/models/dashboard/dashboard_owner.py)
#### Description
A model that encapsulate Dashboard's owner. Note that it does not create new user as it has insufficient information about user but it builds relation between User and Dashboard
#### Extraction
[ModeDashboardOwnerExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_owner_extractor.py)
#### [DashboardTable](../databuilder/models/dashboard/dashboard_table.py)
A model that link Dashboard with the tables used in various charts of the dashboard. Note that it does not create new dashboard, table as it has insufficient information but it builds relation between Tables and Dashboard.
Supporting extractor: Currently there's no open sourced extractor for this. In Lyft, there's audit table that records SQL query, where it came from with identifier, along with tables that is used in SQL query. We basically query this table via [DBAPIExtractor](../databuilder/extractor/db_api_extractor.py)
#### [DashboardUsage](../databuilder/models/dashboard/dashboard_usage.py)
#### Description
A model that encapsulate Dashboard usage between Dashboard and User
#### Extraction
You can use [ModeDashboardUsageExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_usage_extractor.py) . However, currently Mode only provides accumulated view count where we need recent view counts (past 30, 60, or 90 days). To get recent view count, in Lyft, we use [ModeDashboardUsageExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_usage_extractor.py) to extract accumulated view count and [GenericLoader](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/loader/generic_loader.py) to load its record (no publisher here and publisher is not mandatory in DefaultJob) as a event where event materialized as daily snapshot. Once it captures daily accumulated view count, ingest recent view count by querying the datastore. In Lyft, we query via [DBAPIExtractor](../databuilder/extractor/db_api_extractor.py) through Presto.
#### [DashboardLastModifiedTimestamp](../databuilder/models/dashboard/dashboard_last_modified.py)
#### Description
A model that encapsulate Dashboard's last modified timestamp in epoch
#### Extraction
[ModeDashboardLastModifiedTimestampExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_last_modified_timestamp_extractor.py)
#### [DashboardExecution](../models/dashboard/dashboard_execution.py)
A model that encapsulate Dashboard's execution timestamp in epoch and execution state. Note that this model supports last_execution and last_successful_execution by using [different identifier](../databuilder/models/dashboard/dashboard_execution.py#L23) in the URI.
#### Extraction
[ModeDashboardExecutionsExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_executions_extractor.py) which extracts last_execution.
[ModeDashboardLastSuccessfulExecutionExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_last_successful_executions_extractor.py)
#### [DashboardQuery](../databuilder/models/dashboard/dashboard_query.py)
#### Description
A model that encapsulate Dashboard's query information.
Supporting extractor: [ModeDashboardQueriesExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_queries_extractor.py)
#### [DashboardChart](../databuilder/models/dashboard/dashboard_chart.py)
#### Description
A model that encapsulate Dashboard's charts where chart is associated with query.
#### Extraction
[ModeDashboardChartsExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_charts_extractor.py)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment