Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables). Think of it as Google search for data. The project is named after Norwegian explorer [Roald Amundsen](https://en.wikipedia.org/wiki/Roald_Amundsen), the first person to discover South Pole.
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables). Think of it as Google search for data. The project is named after Norwegian explorer [Roald Amundsen](https://en.wikipedia.org/wiki/Roald_Amundsen), the first person to discover South Pole.
It includes three microservices and a data ingestion library.
The frontend service leverages a separate [search service](https://github.com/lyft/amundsensearchlibrary) for allowing users to search for data resources, and a separate [metadata service](https://github.com/lyft/amundsenmetadatalibrary) for viewing and editing metadata for a given resource. It is a Flask application with a React frontend.
-[amundsenfrontendlibrary](https://github.com/lyft/amundsenfrontendlibrary): Frontend service which is a Flask application with a React frontend.
-[amundsensearchlibrary](https://github.com/lyft/amundsensearchlibrary): Search service, which leverages Elasticsearch for search capabilities, is used to power frontend metadata searching.
-[amundsenmetadatalibrary](https://github.com/lyft/amundsenmetadatalibrary): Metadata service, which leverages Neo4j or Apache Atlas as the persistent layer, to provide various metadata.
-[amundsendatabuilder](https://github.com/lyft/amundsendatabuilder): Data ingestion library for building metadata graph and search index.
Users could either load the data with [a python script](https://github.com/lyft/amundsendatabuilder/blob/master/example/scripts/sample_data_loader.py) with the library
or with an [Airflow DAG](https://github.com/lyft/amundsendatabuilder/blob/master/example/dags/sample_dag.py) importing the library.
For information about Amundsen and our other services, visit the [main repository](https://github.com/lyft/amundsen). Please also see our instructions for a [quick start](https://github.com/lyft/amundsen/blob/master/docs/installation.md#bootstrap-a-default-version-of-amundsen-using-docker) setup of Amundsen with dummy data, and an [overview of the architecture](https://github.com/lyft/amundsen/blob/master/docs/architecture.md).
## Requirements
## Requirements
- Python >= 3.5
- Python >= 3.5
...
@@ -44,36 +39,6 @@ Please note that the mock images only served as demonstration purpose.
...
@@ -44,36 +39,6 @@ Please note that the mock images only served as demonstration purpose.


## Get Involved in the Community
Want help or want to help?
Use the button in our [header](https://github.com/lyft/amundsenfrontendlibrary#amundsen) to join our slack channel. Please join our [mailing list](https://groups.google.com/forum/#!forum/amundsen-dev) as well.
## Powered By
Here is the list of organizations that are using Amundsen today. If your organization uses Amundsen, please file a PR and update this list.
Please visit the Amundsen documentation for help with [installing Amundsen](https://github.com/lyft/amundsenfrontendlibrary/blob/master/docs/installation.md#install-standalone-application-directly-from-the-source)
and getting a [quick start](https://github.com/lyft/amundsenfrontendlibrary/blob/master/docs/installation.md#bootstrap-a-default-version-of-amundsen-using-docker) with dummy data
or an [overview of the architecture](docs/architecture.md).
## Architecture Overview
Please visit [Architecture](docs/architecture.md) for Amundsen architecture overview.
## Installation
## Installation
Please visit [Installation guideline](docs/installation.md) on how to install Amundsen.
Please visit [Installation guideline](docs/installation.md) on how to install Amundsen.
...
@@ -86,18 +51,5 @@ Please visit [Configuration doc](docs/configuration.md) on how to configure Amun
...
@@ -86,18 +51,5 @@ Please visit [Configuration doc](docs/configuration.md) on how to configure Amun
Please visit [Developer guidelines](docs/developer_guide.md) if you want to build Amundsen in your local environment.
Please visit [Developer guidelines](docs/developer_guide.md) if you want to build Amundsen in your local environment.
## Roadmap
Please visit [Roadmap](docs/roadmap.md) if you are interested in Amundsen upcoming roadmap items.
## Publications
-[Disrupting Data Discovery](https://www.slideshare.net/taofung/strata-sf-amundsen-presentation)(Strata SF 2019)
-[Amundsen - Lyft's data discovery & metadata engine](https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9)(Lyft engineering blog)
-[Amundsen: A Data Discovery Platform from Lyft](https://www.slideshare.net/taofung/data-council-sf-amundsen-presentation)(Data council 19 SF)
-[Software Engineering Daily podcast on Amundsen](https://softwareengineeringdaily.com/2019/04/16/lyft-data-discovery-with-tao-feng-and-mark-grover/)(April 2019)
-[Disrupting Data Discovery](https://www.slideshare.net/markgrover/disrupting-data-discovery)(Strata London 2019)
-[Disrupting Data Discovery (video)](https://www.youtube.com/watch?v=m1B-ptm0Rrw)(Strata SF 2019)
- ING Data Analytics Platform (Amundsen is mentioned) { [slides](https://static.sched.com/hosted_files/kccnceu19/65/ING%20Data%20Analytics%20Platform.pdf), [video](https://www.youtube.com/watch?v=8cE9ppbnDPs&t=465) } (Kubecon Barcelona 2019)
The following diagram shows the overall architecture for Amundsen.

The frontend service serves as web UI portal for users interaction.
It is Flask-based web app which representation layer is built with React with Redux, Bootstrap, Webpack, and Babel.
The search service leverages Elasticsearch's search functionality and
provides a RESTful API to serve search requests from the frontend service.
Currently only [table resources](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/models/elasticsearch_document.py) are indexed and searchable.
The search index is built with the [elasticsearch publisher](https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/publisher/elasticsearch_publisher.py).
The metadata service currently uses a Neo4j proxy to interact with Neo4j graph db and serves frontend service's metadata.
The metadata is represented as a graph model:

The above diagram shows how metadata is modeled in Amundsen.
Amundsen provides a [data ingestion library](https://github.com/lyft/amundsendatabuilder) for building the metadata. At Lyft, we build the metadata once a day
using an Airflow DAG([example](https://github.com/lyft/amundsendatabuilder/blob/master/example/dags/sample_dag.py)).
See [this doc](https://github.com/lyft/amundsen/blob/master/docs/authentication/oidc.md) in our main repository for information on how to set up end-to-end authentication using OIDC.
Setting up end-to-end authentication using OIDC is fairly simple and can be done using a Flask wrapper i.e., [flaskoidc](https://github.com/verdan/flaskoidc).
`flaskoidc` leverages the Flask's `before_request` functionality to authenticate each request before passing that to
the views. It also accepts headers on each request if available in order to validate bearer token from incoming requests.
## Installation
Please refer to the [flaskoidc documentation](https://github.com/verdan/flaskoidc/blob/master/README.md)
for the installation and the configurations.
Note: You need to install and configure `flaskoidc` for each microservice of Amundsen
i.e., for frontendlibrary, metadatalibrary and searchlibrary in order to secure each of them.
## Amundsen Configuration
Once you have `flaskoidc` installed and configured for each microservice, please set the following environment variables:
- amundsenfrontendlibrary:
```bash
APP_WRAPPER: flaskoidc
APP_WRAPPER_CLASS: FlaskOIDC
```
- amundsenmetadatalibrary:
```bash
FLASK_APP_MODULE_NAME: flaskoidc
FLASK_APP_CLASS_NAME: FlaskOIDC
```
- amundsensearchlibrary: _(Needs to be implemented)_
```bash
FLASK_APP_MODULE_NAME: flaskoidc
FLASK_APP_CLASS_NAME: FlaskOIDC
```
By default `flaskoidc` whitelist the healthcheck URLs, to not authenticate them. In case of metadatalibrary and searchlibrary
we may want to whitelist the healthcheck APIs explicitly using following environment variable.
# Deployment of non-production Amundsen on AWS ECS using aws-cli
The following is a set of intructions to run Amundsen on AWS Elastic Container Service. The current configuration is very basic but it is working. It is a migration of the docker-amundsen.yml to run on AWS ECS.
## Install ECS CLI
The first step is to install ECS CLI, please follow the instructions from AWS [documentation](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ECS_CLI_installation.html)
### Get your access and secret keys from IAM
```bash
# in ~/<your-path-to-cloned-repo>/amundsenfrontendlibrary/docs/instalation-aws-ecs
$ export AWS_ACCESS_KEY_ID=xxxxxxxx
$ export AWS_SECRET_ACCESS_KEY=xxxxxx
$ export AWS_PROFILE=profilename
```
For the purpose of this instruction we used the [tutorial](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-cli-tutorial-ec2.html#ECS_CLI_tutorial_compose_create) on AWS documentation
## STEP 1: Create a cluster configuration:
```bash
# in ~/<your-path-to-cloned-repo>/amundsenfrontendlibrary/docs/instalation-aws-ecs
# in ~/<your-path-to-cloned-repo>/amundsenfrontendlibrary/docs/instalation-aws-ecs
$ ecs-cli compose --cluster-config amundsen --file docker-ecs-amundsen.yml up --create-log-groups
```
You can use the ECS CLI to see what tasks are running.
```bash
$ ecs-cli ps
```
### STEP 5 Open the EC2 Instance
Edit the Security Group to allow traffic to your IP, you should be able to see the frontend, elasticsearch and neo4j by visiting the URLs:
- http://xxxxxxx:5000/
- http://xxxxxxx:9200/
- http://xxxxxxx:7474/browser/
## TODO
- Configuration sent to services not working properly (amunsen.db vs graph.db)
- Create a persistent volume for graph/metadata storage. [See this](https://aws.amazon.com/blogs/compute/amazon-ecs-and-docker-volume-drivers-amazon-ebs/)
- Refactor the VPC and default security group permissions
@@ -31,44 +31,3 @@ You should now have the application running at http://localhost:5000, but will n
...
@@ -31,44 +31,3 @@ You should now have the application running at http://localhost:5000, but will n
```bash
```bash
$ python3 amundsen_application/wsgi.py
$ python3 amundsen_application/wsgi.py
```
```
## Bootstrap a default version of Amundsen using Docker
The following instructions are for setting up a version of Amundsen using Docker. At the moment, we only support a bootstrap for connecting the Amundsen application to an example metadata service.
1. Install `docker` and `docker-compose`.
2. Clone [this repo](https://github.com/lyft/amundsenfrontendlibrary) or download the [docker-amundsen.yml](https://github.com/lyft/amundsenfrontendlibrary/blob/master/docker-amundsen.yml) file directly.
3. Enter the directory where the `docker-amundsen.yml` file is and then:
```bash
$ docker-compose -f docker-amundsen.yml up
```
4. Ingest dummy data into Neo4j by doing the following:
* Run the following commands in the `amundsendatabuilder` directory:
```bash
$ python3 -m venv venv
$ source venv/bin/activate
$ pip3 install -r requirements.txt
$ python3 setup.py install
$ python3 example/scripts/sample_data_loader.py
```
5. View UI at [`http://localhost:5000`](http://localhost:5000) and try to search `test`, it should return some result.
### Verify setup
1. You can verify dummy data has been ingested into Neo4j by by visiting [`http://localhost:7474/browser/`](http://localhost:7474/browser/) and run `MATCH (n:Table) RETURN n LIMIT 25` in the query box. You should see two tables:
1.`hive.test_schema.test_table1`
2.`dynamo.test_schema.test_table2`
2. You can verify the data has been loaded into the metadataservice by visiting:
**Mission**: To organize all information about data and make it universally actionable<br/>
**Vision (2020)**: Centralize a comprehensive and actionable map of all our data resources that can be leveraged to solve a growing number of use cases and workflows
The following roadmap gives an overview of what we are currently working on and what we want to tackle next. We share it so that the community can plan work together. Let us know in the Slack channel if you are interested in taking a stab at leading the development of one of these features (or of a non listed one!).
## Current focus
**Search & Resource page redesign**<br/>
*What*: Redesign the search experience and the resource page, to make them scalable in the number of resources types and the number of metadata<br/>
*Status*: Designs are ready, engineering work has started<br/>
*What*: We are creating an email notification system to reach Amundsen’s users. The primary goal is to use this system to help solve the lack of ownership for data assets at Lyft. The secondary goal is to engage with users for general purposes.<br/>
*Status*: Designs are ready, engineering work has started
## Next steps
**Index Dashboards**
*What*: We want to help with the discovery of existing analysis work, dashboards. This is going to help avoid reinventing the wheel, create value for less technical users and help give context on how tables are used.<br/>
*Status*: Product + technical specs are ready, designs are ready, implementation has not started<br/>
*What*: We want to create a native lineage integration in Amundsen, to better surface how data assets interact with each other<br/>
*Status*: implementation has not started
**Landing page**<br/>
*What*: We are creating a proper landing page to provide more value, with an emphasis on helping users finding data when then don’t really know what to search for (exploration)<br/>
*Status*: being spec’d out
**Push ingest API**<br/>
*What*: We want to create a push API so that it is as easy as possible for a new data resource type to be ingested<br/>
*Status*: implementation has started (around 80% complete)
**GET Rest API**<br/>
*What*: enable users to access our data map programmatically through a Rest API<br/>
*Status*: implementation has started
**Index Druid tables and S3 buckets**<br/>
*What*: add these new resource types to our data map and create resource pages for them<br/>
*Status*: implementation has not started
**Granular Access Control**<br/>
*What*: we want to have a more granular control of the access. For example, only certain types of people would be able to see certain types of metadata/functionality<br/>
*Status*: implementation has not started
**Show distinct column values**<br/>
*What*: When a column has a limited set of possible values, we want to make then easily discoverable<br/>
*Status*: implementation has not started
**“Order by” for columns**<br/>
*What*: we want to help users make sense of what are the columns people use in the tables we index. Within a frequently used table, a column might not be used anymore because it is know to be deprecated<br/>
*Status*: implementation has not started
**Index online datastores**<br/>
*What*: We want to make our DynamoDB and other online datastores discoverable by indexing them. For this purpose, we will probably leverage the fact that we have a centralized IDL (interface definition language)<br/>
*Status*: implementation has not started
**Integration with BI Tools**<br/>
*What*: get the richness of Amundsen’s metadata to where the data is used: in Bi tools such as Mode, Superset and Tableau<br/>
*Status*: implementation has not started
**Index Processes**<br/>
*What*: we want to index ETLs and pipelines from our Machine Learning Engine<br/>
*Status*: implementation has not started
**Versioning system**<br/>
*What*: We want to create a versioning system for our indexed resources, to be able to index different versions of the same resource. This is especially required for machine learning purposes.<br/>
*Status*: implementation has not started
**Index Teams**<br/>
*What*: We want to add teams pages to enable users to see what are the important tables and dashboard a team uses<br/>
*Status*: implementation has not started
**Index Services**<br/>
*What*: With our microservices architecture, we want to index services and show how these services interact with data artifacts<br/>
*Status*: implementation has not started
**Index Pub/Sub systems**<br/>
*What*: We want to make our pub/sub systems discoverable<br/>