Unverified Commit 4ecc65f4 authored by Jin Hyuk Chang's avatar Jin Hyuk Chang Committed by GitHub

Dashboard documentation (#264)

* Added Dashboard model documentation

* Update

* Update

* Update

* Update

* Update

* Update

* Added new lines

* Update

* Update

* Update

* Added transformers doc

* Update

* Update

* Added graph image
parent 134b3293
......@@ -394,6 +394,205 @@ job = DefaultJob(
job.launch()
```
### [RestAPIExtractor](./databuilder/extractor/restapi/rest_api_extractor.py)
A extractor that utilizes [RestAPIQuery](#rest-api-query) to extract data. RestAPIQuery needs to be constructed ([example](./databuilder/extractor/dashboard/mode_analytics/mode_dashboard_extractor.py#L40)) and needs to be injected to RestAPIExtractor.
### Mode Dashboard Extractor
Here are extractors that extracts metadata information from Mode via Mode's REST API.
Prerequisite:
1. You will need to [create API access token](https://mode.com/developer/api-reference/authentication/) that has admin privilege.
2. You will need organization code. This is something you can easily get by looking at one of Mode report's URL.
`https://app.mode.com/<organization code>/reports/report_token`
#### [ModeDashboardExtractor](./databuilder/extractor/dashboard/mode_analytics/mode_dashboard_extractor.py)
A Extractor that extracts core metadata on Mode dashboard. https://app.mode.com/
It extracts list of reports that consists of:
Dashboard group name (Space name)
Dashboard group id (Space token)
Dashboard group description (Space description)
Dashboard name (Report name)
Dashboard id (Report token)
Dashboard description (Report description)
Other information such as report run, owner, chart name, query name is in separate extractor.
It calls two APIs ([spaces API](https://mode.com/developer/api-reference/management/spaces/#listSpaces) and [reports API](https://mode.com/developer/api-reference/analytics/reports/#listReportsInSpace)) joining together.
You can create Databuilder job config like this.
```python
task = DefaultTask(extractor=ModeDashboardExtractor(),
loader=FsNeo4jCSVLoader(), )
tmp_folder = '/var/tmp/amundsen/mode_dashboard_metadata'
node_files_folder = '{tmp_folder}/nodes'.format(tmp_folder=tmp_folder)
relationship_files_folder = '{tmp_folder}/relationships'.format(tmp_folder=tmp_folder)
job_config = ConfigFactory.from_dict({
'extractor.mode_dashboard.{}'.format(ORGANIZATION): organization,
'extractor.mode_dashboard.{}'.format(MODE_ACCESS_TOKEN): mode_token,
'extractor.mode_dashboard.{}'.format(MODE_PASSWORD_TOKEN): mode_password,
'loader.filesystem_csv_neo4j.{}'.format(FsNeo4jCSVLoader.NODE_DIR_PATH): node_files_folder,
'loader.filesystem_csv_neo4j.{}'.format(FsNeo4jCSVLoader.RELATION_DIR_PATH): relationship_files_folder,
'loader.filesystem_csv_neo4j.{}'.format(FsNeo4jCSVLoader.SHOULD_DELETE_CREATED_DIR): True,
'task.progress_report_frequency': 100,
'publisher.neo4j.{}'.format(neo4j_csv_publisher.NODE_FILES_DIR): node_files_folder,
'publisher.neo4j.{}'.format(neo4j_csv_publisher.RELATION_FILES_DIR): relationship_files_folder,
'publisher.neo4j.{}'.format(neo4j_csv_publisher.NEO4J_END_POINT_KEY): neo4j_endpoint,
'publisher.neo4j.{}'.format(neo4j_csv_publisher.NEO4J_USER): neo4j_user,
'publisher.neo4j.{}'.format(neo4j_csv_publisher.NEO4J_PASSWORD): neo4j_password,
'publisher.neo4j.{}'.format(neo4j_csv_publisher.NEO4J_CREATE_ONLY_NODES): [DESCRIPTION_NODE_LABEL],
'publisher.neo4j.{}'.format(neo4j_csv_publisher.JOB_PUBLISH_TAG): job_publish_tag
})
job = DefaultJob(conf=job_config,
task=task,
publisher=Neo4jCsvPublisher())
job.launch()
```
#### [ModeDashboardOwnerExtractor](./databuilder/extractor/dashboard/mode_analytics/mode_dashboard_owner_extractor.py)
An Extractor that extracts Dashboard owner. Mode itself does not have concept of owner and it will use creator as owner. Note that if user left the organization, it would skip the dashboard.
You can create Databuilder job config like this. (configuration related to loader and publisher is omitted as it is mostly the same. Please take a look at this [example](#ModeDashboardExtractor) for the configuration that holds loader and publisher.
```python
extractor = ModeDashboardOwnerExtractor()
task = DefaultTask(extractor=extractor,
loader=FsNeo4jCSVLoader(), )
job_config = ConfigFactory.from_dict({
'{}.{}'.format(extractor.get_scope(), ORGANIZATION): organization,
'{}.{}'.format(extractor.get_scope(), MODE_ACCESS_TOKEN): mode_token,
'{}.{}'.format(extractor.get_scope(), MODE_PASSWORD_TOKEN): mode_password,
})
job = DefaultJob(conf=job_config,
task=task,
publisher=Neo4jCsvPublisher())
job.launch()
```
#### [ModeDashboardLastSuccessfulExecutionExtractor](./databuilder/extractor/dashboard/mode_analytics/mode_dashboard_last_successful_executions_extractor.py)
A Extractor that extracts Mode dashboard's last successful run (execution) timestamp.
You can create Databuilder job config like this. (configuration related to loader and publisher is omitted as it is mostly the same. Please take a look at this [example](#ModeDashboardExtractor) for the configuration that holds loader and publisher.
```python
extractor = ModeDashboardLastSuccessfulExecutionExtractor()
task = DefaultTask(extractor=extractor,
loader=FsNeo4jCSVLoader(), )
job_config = ConfigFactory.from_dict({
'{}.{}'.format(extractor.get_scope(), ORGANIZATION): organization,
'{}.{}'.format(extractor.get_scope(), MODE_ACCESS_TOKEN): mode_token,
'{}.{}'.format(extractor.get_scope(), MODE_PASSWORD_TOKEN): mode_password,
})
job = DefaultJob(conf=job_config,
task=task,
publisher=Neo4jCsvPublisher())
job.launch()
```
#### [ModeDashboardExecutionsExtractor](./databuilder/extractor/dashboard/mode_analytics/mode_dashboard_executions_extractor.py)
A Extractor that extracts last run (execution) status and timestamp.
You can create Databuilder job config like this. (configuration related to loader and publisher is omitted as it is mostly the same. Please take a look at this [example](#ModeDashboardExtractor) for the configuration that holds loader and publisher.
```python
extractor = ModeDashboardExecutionsExtractor()
task = DefaultTask(extractor=extractor,
loader=FsNeo4jCSVLoader(), )
job_config = ConfigFactory.from_dict({
'{}.{}'.format(extractor.get_scope(), ORGANIZATION): organization,
'{}.{}'.format(extractor.get_scope(), MODE_ACCESS_TOKEN): mode_token,
'{}.{}'.format(extractor.get_scope(), MODE_PASSWORD_TOKEN): mode_password,
})
job = DefaultJob(conf=job_config,
task=task,
publisher=Neo4jCsvPublisher())
job.launch()
```
#### [ModeDashboardLastModifiedTimestampExtractor](./databuilder/extractor/dashboard/mode_analytics/mode_dashboard_last_modified_timestamp_extractor.py)
A Extractor that extracts Mode dashboard's last modified timestamp.
You can create Databuilder job config like this. (configuration related to loader and publisher is omitted as it is mostly the same. Please take a look at this [example](#ModeDashboardExtractor) for the configuration that holds loader and publisher.
```python
extractor = ModeDashboardLastModifiedTimestampExtractor()
task = DefaultTask(extractor=extractor, loader=FsNeo4jCSVLoader())
job_config = ConfigFactory.from_dict({
'{}.{}'.format(extractor.get_scope(), ORGANIZATION): organization,
'{}.{}'.format(extractor.get_scope(), MODE_ACCESS_TOKEN): mode_token,
'{}.{}'.format(extractor.get_scope(), MODE_PASSWORD_TOKEN): mode_password,
})
job = DefaultJob(conf=job_config,
task=task,
publisher=Neo4jCsvPublisher())
job.launch()
```
#### [ModeDashboardQueriesExtractor](./databuilder/extractor/dashboard/mode_analytics/mode_dashboard_queries_extractor.py)
A Extractor that extracts Mode's query information.
You can create Databuilder job config like this. (configuration related to loader and publisher is omitted as it is mostly the same. Please take a look at this [example](#ModeDashboardExtractor) for the configuration that holds loader and publisher.
```python
extractor = ModeDashboardQueriesExtractor()
task = DefaultTask(extractor=extractor, loader=FsNeo4jCSVLoader())
job_config = ConfigFactory.from_dict({
'{}.{}'.format(extractor.get_scope(), ORGANIZATION): organization,
'{}.{}'.format(extractor.get_scope(), MODE_ACCESS_TOKEN): mode_token,
'{}.{}'.format(extractor.get_scope(), MODE_PASSWORD_TOKEN): mode_password,
})
job = DefaultJob(conf=job_config,
task=task,
publisher=Neo4jCsvPublisher())
job.launch()
```
#### [ModeDashboardChartsExtractor](./databuilder/extractor/dashboard/mode_analytics/mode_dashboard_charts_extractor.py)
A Extractor that extracts Mode Dashboard charts. Currently, Mode API response schema is undocumented and hard to be used for the schema seems different per chart type. For this reason, this extractor can only extracts Chart token, and Chart URL at this point.
You can create Databuilder job config like this. (configuration related to loader and publisher is omitted as it is mostly the same. Please take a look at this [example](#ModeDashboardExtractor) for the configuration that holds loader and publisher.
```python
extractor = ModeDashboardChartsExtractor()
task = DefaultTask(extractor=extractor, loader=FsNeo4jCSVLoader())
job_config = ConfigFactory.from_dict({
'{}.{}'.format(extractor.get_scope(), ORGANIZATION): organization,
'{}.{}'.format(extractor.get_scope(), MODE_ACCESS_TOKEN): mode_token,
'{}.{}'.format(extractor.get_scope(), MODE_PASSWORD_TOKEN): mode_password,
})
job = DefaultJob(conf=job_config,
task=task,
publisher=Neo4jCsvPublisher())
job.launch()
```
#### [ModeDashboardUsageExtractor](./databuilder/extractor/dashboard/mode_analytics/mode_dashboard_usage_extractor.py)
A Extractor that extracts Mode dashboard's accumulated view count.
Note that this provides accumulated view count which does [not effectively show relevancy](./docs/dashboard_ingestion_guide.md#21-ingest-dashboard-usage-data-and-decorate-neo4j-over-base-data). Thus, fields from this extractor is not directly compatible with [DashboardUsage](./docs/models.md#dashboardusage) model.
If you are fine with `accumulated usage`, you could use TemplateVariableSubstitutionTransformer to transform Dict payload from [ModeDashboardUsageExtractor](./databuilder/extractor/dashboard/mode_analytics/mode_dashboard_usage_extractor.py) to fit [DashboardUsage](./docs/models.md#dashboardusage) and transform Dict to [DashboardUsage](./docs/models.md#dashboardusage) by [TemplateVariableSubstitutionTransformer](./databuilder/transformer/template_variable_substitution_transformer.py), and [DictToModel](./databuilder/transformer/dict_to_model.py) transformers. ([Example](./databuilder/extractor/dashboard/mode_analytics/mode_dashboard_queries_extractor.py#L36) on how to combining these two transformers)
## List of transformers
#### [ChainedTransformer](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/transformer/base_transformer.py#L41 "ChainedTransformer")
A chanined transformer that can take multiple transformer.
......@@ -414,6 +613,16 @@ job = DefaultJob(
job.launch()
```
#### [TemplateVariableSubstitutionTransformer](./databuilder/transformer/template_variable_substitution_transformer.py)
Adds or replaces field in Dict by string.format based on given template and provide record Dict as a template parameter
#### [DictToModel](./databuilder/transformer/dict_to_model.py)
Transforms dictionary into model
#### [TimestampStringToEpoch](./databuilder/transformer/timestamp_string_to_epoch.py)
Transforms string timestamp into int epoch
## List of loader
#### [FsNeo4jCSVLoader](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/loader/file_system_neo4j_csv_loader.py "FsNeo4jCSVLoader")
Write node and relationship CSV file(s) that can be consumed by Neo4jCsvPublisher. It assumes that the record it consumes is instance of Neo4jCsvSerializable.
......@@ -432,6 +641,29 @@ job = DefaultJob(
job.launch()
```
#### [GenericLoader](./databuilder/loader/generic_loader.py)
Loader class that calls user provided callback function with record as a parameter
Example that pushes Mode Dashboard accumulated usage via GenericLoader where callback_function expected to insert record to data warehouse.
```python
extractor = ModeDashboardUsageExtractor()
task = DefaultTask(extractor=extractor,
loader=GenericLoader(), )
job_config = ConfigFactory.from_dict({
'{}.{}'.format(extractor.get_scope(), ORGANIZATION): organization,
'{}.{}'.format(MODE_ACCESS_TOKEN): mode_token,
'{}.{}'.format(MODE_PASSWORD_TOKEN): mode_password,
'loader.generic.callback_function': callback_function
})
job = DefaultJob(conf=job_config, task=task)
job.launch()
```
#### [FSElasticsearchJSONLoader](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/loader/file_system_elasticsearch_json_loader.py "FSElasticsearchJSONLoader")
Write Elasticsearch document in JSON format which can be consumed by ElasticsearchPublisher. It assumes that the record it consumes is instance of ElasticsearchDocument.
......@@ -534,7 +766,7 @@ With this pattern RestApiQuery supports 1:1 and 1:N JOIN relationship.
(GROUP BY or any other aggregation, sub-query join is not supported)
To see in action, take a peek at [ModeDashboardExtractor](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/extractor/dashboard/mode_analytics/mode_dashboard_extractor.py)
Also, take a look at how it extends to support pagination at [ModePaginatedRestApiQuery](./databuilder/rest_api/mode_analytics/mode_paginated_rest_api_query.py).
### Removing stale data in Neo4j -- [Neo4jStalenessRemovalTask](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/task/neo4j_staleness_removal_task.py):
......
......@@ -18,7 +18,7 @@ LOGGER = logging.getLogger(__name__)
class ModeDashboardQueriesExtractor(Extractor):
"""
A Extractor that extracts run (execution) status and timestamp.
A Extractor that extracts Query information
"""
......
......@@ -13,7 +13,9 @@ LOGGER = logging.getLogger(__name__)
class TemplateVariableSubstitutionTransformer(Transformer):
"""
Transforms dictionary into model
Add/Replace field in Dict by string.format based on given template and provide record Dict as a template parameter
https://docs.python.org/3.4/library/string.html#string.Formatter.format
"""
def init(self, conf):
......
# Dashboard Ingestion guidance
(Currently this guidance is about using Databuilder to ingest Dashboard metadata into Neo4j and Elasticsearch)
Dashboard ingestion consists of multiple Databuilder jobs and it can be described in four steps:
1. Ingest base data to Neo4j.
2. Ingest additional data and decorate Neo4j over base data.
3. Update Elasticsearch index using Neo4j data
4. Remove stale data
Note that Databuilder jobs need to be sequenced as 1 -> 2 -> 3 -> 4. To sequencing these jobs, Lyft uses Airflow to orchestrate the job, but Databuilder is not limited to Airflow and you can also simply use Python script to sequence it -- not recommended for production though.
Also, step 1, 3, 4 is expected to have one Databuilder job where Step 2 is expected to have **multiple** Databuilder jobs and number of Databuilder jobs in step 2 is expected to grow as we add more metadata into Dashboard. To improve performance, it is recommended, but not required, to execute Databuilder jobs in step 2 concurrently.
Once finished step 1 and 2, you will have Graph like this:
![Dashboard graph modeling](./assets/dashboard_graph_modeling.png?raw=true "Dashboard graph modeling")
Here this documentation will be using [Mode Dashboard](https://app.mode.com/) as concrete example to show how to ingest Dashboard metadata. However, this ingestion process not limited to Mode Dashboard and any other Dashboard can follow this flow.
### 1. Ingest base data to Neo4j.
Using [ModeDashboardExtractor](../README.md#modedashboardextractor) along with [FsNeo4jCSVLoader](../README.md#fsneo4jcsvloader) and [Neo4jCsvPublisher](../README.md#neo4jcsvpublisher) to add base information such as Dashboard group name, Dashboard group id, Dashboard group description, Dashboard name, Dashboard id, Dashboard description to Neo4j. Use [this job configuration](../README.md#modedashboardextractor) example to configure the job.
### 2. Ingest additional data and decorate Neo4j over base data.
Use other Mode dashboard's extractors in create & launch multiple Databuilder jobs. Note that it all Databuilder job here will use [FsNeo4jCSVLoader](../README.md#fsneo4jcsvloader) and [Neo4jCsvPublisher](../README.md#neo4jcsvpublisher) where their configuration should be almost the same except the `NODE_FILES_DIR` and `RELATION_FILES_DIR` that is being used for temporary location to hold data.
List of other Extractors can be found [here](../README.md#mode-dashboard-extractor)
#### 2.1. Ingest Dashboard usage data and decorate Neo4j over base data.
Mode provide usage data (view count) per Dashboard, but this is accumulated usage data. The main use case of usage is search ranking and `accumulated usage` is not that much useful for Amundsen as we don't want to show certain Dashboard that was popular years ago and potentially deprecated.
To bring recent usage information, we can `snapshot` accumulated usage per report daily and extract recent usage information (past 30 days, 60 days, 90 days that fits our view of recency).
##### 2.1.1. Ingest `accumulated usage` into Data warehouse (e.g: Hive, BigQuery, Redshift, Postgres, etc)
In this step, you can use ModeDashboardUsageExtractor to extract `accumulated_view_count` and load into Data warehouse of your choice by using GenericLoader.
Note that GenericLoader just takes a callback function, and you need to provide a function that `INSERT` record into your Dataware house.
```python
extractor = ModeDashboardUsageExtractor()
loader = GenericLoader()
task = DefaultTask(extractor=extractor,
loader=loader)
job_config = ConfigFactory.from_dict({
'{}.{}'.format(extractor.get_scope(), ORGANIZATION): organization,
'{}.{}'.format(extractor.get_scope(), MODE_ACCESS_TOKEN): mode_token,
'{}.{}'.format(extractor.get_scope(), MODE_PASSWORD_TOKEN): mode_password,
'{}.{}'.format(loader.get_scope(), 'callback_function'): mode_dashboard_usage_loader_callback_function,
})
job = DefaultJob(conf=job_config, task=task)
job.launch()
```
Step 2. Extract past ? days usage data from your Data warehouse and publish it to Neo4j.
You could use [existing extractors](../README.md#list-of-extractors) to achieve this with [DashboardUsage model](./models.md#dashboardusage) along with [FsNeo4jCSVLoader](../README.md#fsneo4jcsvloader) and [Neo4jCsvPublisher](../README.md#neo4jcsvpublisher).
### 3. Update Elasticsearch index using Neo4j data
Once data is ready in Neo4j, extract Neo4j data and push it to Elasticsearch using [Neo4jSearchDataExtractor](../databuilder/extractor/neo4j_search_data_extractor.py) and [ElasticsearchPublisher](../databuilder/publisher/elasticsearch_publisher.py)
```python
tmp_es_file_path = '/var/tmp/amundsen_dashboard/elasticsearch_dashboard_upload/es_data.json'
elasticsearch_new_index_name = 'dashboard_search_index_{ds}_{hex_str}'.format(ds='2020-05-12',
hex_str=uuid.uuid4().hex)
elasticsearch_doc_type = 'dashboard'
elasticsearch_index_alias = 'dashboard_search_index'
job_config = ConfigFactory.from_dict({
'extractor.search_data.extractor.neo4j.{}'.format(Neo4jExtractor.GRAPH_URL_CONFIG_KEY):
neo4j_endpoint,
'extractor.search_data.extractor.neo4j.{}'.format(Neo4jExtractor.MODEL_CLASS_CONFIG_KEY):
'databuilder.models.dashboard_elasticsearch_document.DashboardESDocument',
'extractor.search_data.extractor.neo4j.{}'.format(Neo4jExtractor.NEO4J_AUTH_USER):
neo4j_user,
'extractor.search_data.extractor.neo4j.{}'.format(Neo4jExtractor.NEO4J_AUTH_PW):
neo4j_password,
'extractor.search_data.{}'.format(Neo4jSearchDataExtractor.CYPHER_QUERY_CONFIG_KEY):
Neo4jSearchDataExtractor.DEFAULT_NEO4J_DASHBOARD_CYPHER_QUERY,
'extractor.search_data.{}'.format(neo4j_csv_publisher.JOB_PUBLISH_TAG):
job_publish_tag,
'loader.filesystem.elasticsearch.{}'.format(FSElasticsearchJSONLoader.FILE_PATH_CONFIG_KEY):
tmp_es_file_path,
'loader.filesystem.elasticsearch.{}'.format(FSElasticsearchJSONLoader.FILE_MODE_CONFIG_KEY):
'w',
'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.FILE_PATH_CONFIG_KEY):
tmp_es_file_path,
'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.FILE_MODE_CONFIG_KEY):
'r',
'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.ELASTICSEARCH_CLIENT_CONFIG_KEY):
elasticsearch_client,
'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.ELASTICSEARCH_NEW_INDEX_CONFIG_KEY):
elasticsearch_new_index,
'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.ELASTICSEARCH_DOC_TYPE_CONFIG_KEY):
elasticsearch_doc_type,
'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.ELASTICSEARCH_MAPPING_CONFIG_KEY):
DASHBOARD_ELASTICSEARCH_INDEX_MAPPING,
'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.ELASTICSEARCH_ALIAS_CONFIG_KEY):
elasticsearch_index_alias,
})
job = DefaultJob(conf=job_config,
task=DefaultTask(extractor=Neo4jSearchDataExtractor(),
loader=FSElasticsearchJSONLoader()),
publisher=ElasticsearchPublisher())
job.launch()
```
*Note that `DASHBOARD_ELASTICSEARCH_INDEX_MAPPING` is defined [here](../databuilder/publisher/elasticsearch_constants.py).
### 4. Remove stale data
Dashboard ingestion, like Table ingestion, is UPSERT (CREATE OR UPDATE) operation and there could be some data deleted on source. Not removing it in Neo4j basically leaving a stale data in Amundsen.
You can use [Neo4jStalenessRemovalTask](../README.md#removing-stale-data-in-neo4j----neo4jstalenessremovaltask) to remove stale data.
There are two strategies to remove stale data. One is to use `job_publish_tag` and the other one is to use `milliseconds_to_expire`.
For example, you could use `job_publish_tag` to remove stale `DashboardGroup`, `Dashboard`, and `Query` nodes. And you could use `milliseconds_to_expire` on `Timestamp` node, `READ` relation, and `READ_BY`. One of the main reasons to use `milliseconds_to_expire` is to avoid race condition and it is explained more [here](./README.md#using-publisher_last_updated_epoch_ms-to-remove-stale-data)
# Amundsen Models
## Overview
......@@ -17,8 +18,7 @@ without developers needing to know the internals of the neo4j schema.
## The Models
### TableMetadata
[python class](../databuilder/models/table_metadata.py)
### [TableMetadata](../databuilder/models/table_metadata.py)
*What datasets does my org have?*
......@@ -32,8 +32,7 @@ In general, for Table and Column Metadata, you should be able to use one of the
in the [extractor package](../databuilder/extractor)
### Watermark
[python class](../databuilder/models/watermark.py)
### [Watermark](../databuilder/models/watermark.py)
*What is the earliest data that this table has? What is the latest data?*
This is NOT the same as when the data was last updated.
......@@ -50,8 +49,7 @@ Depending on the datastore of your dataset, you would extract this by:
- a query for the minimum and maximum record of a given timestamp column
### ColumnUsageModel
[python class](../databuilder/models/column_usage_model.py)
### [ColumnUsageModel](../databuilder/models/column_usage_model.py)
*How many queries is a given column getting? By which users?*
......@@ -70,8 +68,7 @@ In other cases, you may need to use audit logs which could require a custom solu
Finally, for none traditional data lakes, getting this information exactly maybe difficult and you may need to rely
on a heuristic.
### User
[python class](../databuilder/models/user.py)
### [User](../databuilder/models/user.py)
*What users are there out there? Which team is this user on?*
......@@ -82,8 +79,7 @@ This is required if you are going to be having authentication turned on.
#### Extraction
TODO
### TableColumnStats
[python class](../databuilder/models/table_stats.py)
### [TableColumnStats](../databuilder/models/table_stats.py)
* What are the min/max values for this column? How many nulls are in this column? *
......@@ -100,8 +96,7 @@ For each table you care about:
For each column you care about:
Calculate statistics that you care about such as min/max/average etc.
### Application
[python class](../databuilder/models/application.py)
### [Application](../databuilder/models/application.py)
* What job/application is writing to this table? *
......@@ -112,8 +107,7 @@ Currently the model assumes the application has to be in airflow, but in theory
#### Extraction
TODO
### Table Owner
[python class](../databuilder/models/table_owner.py)
### [Table Owner](../databuilder/models/table_owner.py)
* What team or user owns this dataset? *
......@@ -126,8 +120,7 @@ Although the main point of entry for owners is through the WebUI, you could in t
extract this information based on who created a given table.
### Table Source
[python class](../databuilder/models/table_source.py)
### [Table Source](../databuilder/models/table_source.py)
* Where is the source code for the application that writes to this dataset? *
......@@ -140,8 +133,7 @@ You will need a github/gitlab/your repository crawler in order to populate this
The idea there would be to search for a given table name or something else that is a unique identifier such that you can be confident
that the source correctly matches to this table.
### TableLastUpdated
[python class](../databuilder/models/table_last_updated.py)
### [TableLastUpdated](../databuilder/models/table_last_updated.py)
* When was the last time this data was updated? Is this table stale or deprecated? *
......@@ -152,4 +144,73 @@ It is a very useful value as it can help users identify if there are tables that
#### Extraction
There are some extractors available for this like [hive_table_last_updated_extractor](../databuilder/extractor/hive_table_last_updated_extractor.py)
that you can refer to. But you will need access to history that provides information on when the last data write happened on a given table.
If this data isn't available for your data source, you maybe able to approximate it by looking at the max of some timestamp column.
If this data isn't available for your data source, you maybe able to approximate it by looking at the max of some timestamp column.
## Dashboard models
Dashboard models are normalized which means that the model is separated so that it can be easily decoupled with how data is extracted. (If model is denormalized, all metadata is in model, then one extraction needs to able to pull all the data which makes extraction hard and complex) There's trade off in this decision of normalized design where it can be inefficient in the case that some ingestion can be done in one job for metadata source happen to provide all data it need. However, to make model flexible for most of metadata, it is normalized.
### [DashboardMetadata](../databuilder/models/dashboard/dashboard_metadata.py)
#### Description
A baseline of Dashboard metadata that consists of dashboard group name, dashboard group description, dashboard description, etc. This model needs to be ingested first as other model builds relation to this.
#### Extraction
[ModeDashboardExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_extractor.py)
#### [DashboardOwner](../databuilder/models/dashboard/dashboard_owner.py)
#### Description
A model that encapsulate Dashboard's owner. Note that it does not create new user as it has insufficient information about user but it builds relation between User and Dashboard
#### Extraction
[ModeDashboardOwnerExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_owner_extractor.py)
#### [DashboardTable](../databuilder/models/dashboard/dashboard_table.py)
A model that link Dashboard with the tables used in various charts of the dashboard. Note that it does not create new dashboard, table as it has insufficient information but it builds relation between Tables and Dashboard.
Supporting extractor: Currently there's no open sourced extractor for this. In Lyft, there's audit table that records SQL query, where it came from with identifier, along with tables that is used in SQL query. We basically query this table via [DBAPIExtractor](../databuilder/extractor/db_api_extractor.py)
#### [DashboardUsage](../databuilder/models/dashboard/dashboard_usage.py)
#### Description
A model that encapsulate Dashboard usage between Dashboard and User
#### Extraction
You can use [ModeDashboardUsageExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_usage_extractor.py) . However, currently Mode only provides accumulated view count where we need recent view counts (past 30, 60, or 90 days). To get recent view count, in Lyft, we use [ModeDashboardUsageExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_usage_extractor.py) to extract accumulated view count and [GenericLoader](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/loader/generic_loader.py) to load its record (no publisher here and publisher is not mandatory in DefaultJob) as a event where event materialized as daily snapshot. Once it captures daily accumulated view count, ingest recent view count by querying the datastore. In Lyft, we query via [DBAPIExtractor](../databuilder/extractor/db_api_extractor.py) through Presto.
#### [DashboardLastModifiedTimestamp](../databuilder/models/dashboard/dashboard_last_modified.py)
#### Description
A model that encapsulate Dashboard's last modified timestamp in epoch
#### Extraction
[ModeDashboardLastModifiedTimestampExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_last_modified_timestamp_extractor.py)
#### [DashboardExecution](../models/dashboard/dashboard_execution.py)
A model that encapsulate Dashboard's execution timestamp in epoch and execution state. Note that this model supports last_execution and last_successful_execution by using [different identifier](../databuilder/models/dashboard/dashboard_execution.py#L23) in the URI.
#### Extraction
[ModeDashboardExecutionsExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_executions_extractor.py) which extracts last_execution.
[ModeDashboardLastSuccessfulExecutionExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_last_successful_executions_extractor.py)
#### [DashboardQuery](../databuilder/models/dashboard/dashboard_query.py)
#### Description
A model that encapsulate Dashboard's query information.
Supporting extractor: [ModeDashboardQueriesExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_queries_extractor.py)
#### [DashboardChart](../databuilder/models/dashboard/dashboard_chart.py)
#### Description
A model that encapsulate Dashboard's charts where chart is associated with query.
#### Extraction
[ModeDashboardChartsExtractor](../databuilder/extractor/dashboard/mode_analytics/mode_dashboard_charts_extractor.py)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment