Unverified Commit 0c77114a authored by Tamika Tannis's avatar Tamika Tannis Committed by GitHub

Configure submodules to child repos (#4)

* Add submodules

* Update doc with recent changes

* Add a PR template
parent b9e3a6e3
### Summary of Changes
_Include a summary of changes then remove this line_
### Documentation
_What documentation did you add or modify and why? Add any relevant links then remove this line_
### CheckList
Make sure you have checked **all** steps below to ensure a timely review.
- [ ] PR title addresses the issue accurately and concisely.
- [ ] PR includes a summary of changes.
- [ ] I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)"
[submodule "amundsendatabuilder"]
path = amundsendatabuilder
url = https://github.com/lyft/amundsendatabuilder
[submodule "amundsenfrontendlibrary"]
path = amundsenfrontendlibrary
url = https://github.com/lyft/amundsenfrontendlibrary
[submodule "amundsenmetadatalibrary"]
path = amundsenmetadatalibrary
url = https://github.com/lyft/amundsenmetadatalibrary
[submodule "amundsensearchlibrary"]
path = amundsensearchlibrary
url = https://github.com/lyft/amundsensearchlibrary
......@@ -77,6 +77,8 @@ Please visit [Roadmap](docs/roadmap.md) if you are interested in Amundsen upcomi
- [Amundsen: A Data Discovery Platform from Lyft](https://www.slideshare.net/taofung/data-council-sf-amundsen-presentation) (Data council 19 SF)
- [Software Engineering Daily podcast on Amundsen](https://softwareengineeringdaily.com/2019/04/16/lyft-data-discovery-with-tao-feng-and-mark-grover/) (April 2019)
- [Disrupting Data Discovery](https://www.slideshare.net/markgrover/disrupting-data-discovery) (Strata London 2019)
- [Disrupting Data Discovery (video)](https://www.youtube.com/watch?v=m1B-ptm0Rrw) (Strata SF 2019)
- [ING Data Analytics Platform (Amundsen is mentioned)](https://static.sched.com/hosted_files/kccnceu19/65/ING%20Data%20Analytics%20Platform.pdf) (Kubecon Barcelona 2019)
# License
[Apache 2.0 License.](/LICENSE)
Subproject commit f73c8128671b37020503558e6cd00ac02fd26306
Subproject commit 525d4323854f8f74f4c5198cc4efdf0283ebb13b
Subproject commit 2b33102d3f9511537656f60f987e3e79caef0c72
Subproject commit 46513a881e7b49f862b2a8b67131135d9026aed2
......@@ -4,8 +4,7 @@ services:
image: neo4j:3.3.0
container_name: neo4j_amundsen
environment:
- CREDENTIALS_PROXY_USER=neo4j
- CREDENTIALS_PROXY_PASSWORD=test
- NEO4J_AUTH=neo4j/test
ulimits:
nofile:
soft: 40000
......@@ -46,7 +45,6 @@ services:
- amundsennet
environment:
- PROXY_HOST=bolt://neo4j_amundsen
# - CREDENTIALS_PROXY_PASSWORD=neo4j_NOTE_FOR_NOW_IT_SEEMS_NEO4JCONFIG_DISREGARDS_CREDENTIALS_WE_SHOULD_FILE_A_BUG
amundsenfrontend:
image: amundsendev/amundsen-frontend:1.0.5
container_name: amundsenfrontend
......
# OIDC (Keycloak) Authentication
Setting up end-to-end authentication using OIDC is fairly simple and can be done using a Flask wrapper i.e., [flaskoidc](https://github.com/verdan/flaskoidc).
Setting up end-to-end authentication using OIDC is fairly simple and can be done using a Flask wrapper i.e., [flaskoidc](https://github.com/verdan/flaskoidc).
`flaskoidc` leverages the Flask's `before_request` functionality to authenticate each request before passing that to
the views. It also accepts headers on each request if available in order to validate bearer token from incoming requests.
`flaskoidc` leverages the Flask's `before_request` functionality to authenticate each request before passing that to
the views. It also accepts headers on each request if available in order to validate bearer token from incoming requests.
## Installation
Please refer to the [flaskoidc documentation](https://github.com/verdan/flaskoidc/blob/master/README.md)
for the installation and the configurations.
Please refer to the [flaskoidc documentation](https://github.com/verdan/flaskoidc/blob/master/README.md)
for the installation and the configurations.
Note: You need to install and configure `flaskoidc` for each microservice of Amundsen
Note: You need to install and configure `flaskoidc` for each microservice of Amundsen
i.e., for frontendlibrary, metadatalibrary and searchlibrary in order to secure each of them.
## Amundsen Configuration
......@@ -19,7 +19,7 @@ Once you have `flaskoidc` installed and configured for each microservice, please
APP_WRAPPER: flaskoidc
APP_WRAPPER_CLASS: FlaskOIDC
```
- amundsenmetadatalibrary:
```bash
FLASK_APP_MODULE_NAME: flaskoidc
......@@ -31,16 +31,16 @@ Once you have `flaskoidc` installed and configured for each microservice, please
FLASK_APP_MODULE_NAME: flaskoidc
FLASK_APP_CLASS_NAME: FlaskOIDC
```
By default `flaskoidc` whitelist the healthcheck URLs, to not authenticate them. In case of metadatalibrary and searchlibrary
we may want to whitelist the healthcheck APIs explicitly using following environment variable.
By default `flaskoidc` whitelist the healthcheck URLs, to not authenticate them. In case of metadatalibrary and searchlibrary
we may want to whitelist the healthcheck APIs explicitly using following environment variable.
```bash
FLASK_OIDC_WHITELISTED_ENDPOINTS: 'api.healthcheck'
```
## Setting Up Request Headers
To communicate securely between the microservices, you need to pass the bearer token from frontend in each request
To communicate securely between the microservices, you need to pass the bearer token from frontend in each request
to metadatalibrary and searchlibrary. This should be done using `REQUEST_HEADERS_METHOD` config variable in frontendlibrary.
- Define a function to add the bearer token in each request in your config.py:
......@@ -58,15 +58,42 @@ def get_access_headers(app):
return {'Authorization': 'Bearer {}'.format(access_token)}
except Exception:
return None
```
```
- Set the method as the request header method in your config.py:
```python
REQUEST_HEADERS_METHOD = get_access_headers
```
This function will be called using the current `app` instance to add the headers in each request when calling any endpoint of
This function will be called using the current `app` instance to add the headers in each request when calling any endpoint of
metadatalibrary and searchlibrary [here](https://github.com/lyft/amundsenfrontendlibrary/blob/master/amundsen_application/api/utils/request_utils.py)
## Setting Up Auth User Method
In order to get the current authenticated user (which is being used in Amundsen for many operations), we need to set
`AUTH_USER_METHOD` config variable in frontendlibrary.
This function should return email address, user id and any other required information.
- Define a function to fetch the user information in your config.py:
```python
def get_auth_user(app):
"""
Retrieves the user information from oidc token, and then makes
a dictionary 'UserInfo' from the token information dictionary.
We need to convert it to a class in order to use the information
in the rest of the Amundsen application.
:param app: The instance of the current app.
:return: A class UserInfo
"""
from flask import g
user_info = type('UserInfo', (object,), g.oidc_id_token)
# noinspection PyUnresolvedReferences
user_info.user_id = user_info.preferred_username
return user_info
```
- Set the method as the auth user method in your config.py:
```python
AUTH_USER_METHOD = get_auth_user
```
Once done, you'll have the end-to-end authentication in Amundsen without any proxy or code changes.
\ No newline at end of file
# Installation
## Bootstrap a default version of Amundsen using Docker
The following instructions are for setting up a version of Amundsen using Docker. At the moment, we only support a bootstrap for connecting the Amundsen application to an example metadata service.
The following instructions are for setting up a version of Amundsen using Docker.
1. Install `docker`, `docker-compose`, and `docker-machine`.
2. Install `virtualbox` and `virtualenv`.
3. Start a managed docker virtual host using the following command:
```bash
# in our examples our machine is named 'default'
$ docker-machine create -d virtualbox default
```
4. Check your docker daemon locally using:
```bash
$ docker-machine ls
```
You should see the `default` machine listed, running on virtualbox with no errors listed.
5. Set up the docker environment using
```bash
$ eval $(docker-machine env default)
```
TODO (ttannis): Once submodules configured, they _should_ be able to `cd amundsenfrontendlibrary`, etc. Will go through setup again and verify it works.
6. Setup your local environment.
* Clone [amundsenfrontendlibrary](https://github.com/lyft/amundsenfrontendlibrary), [amundsenmetadatalibrary](https://github.com/lyft/amundsenmetadatalibrary), and [amundsensearchlibrary](https://github.com/lyft/amundsensearchlibrary).
* In your local versions of each library, update the `LOCAL_HOST` in the `LocalConfig` with the IP used for the `default` docker machine. You can see the IP in the `URL` outputted from running `docker-machine ls`.
7. Start all of the services using:
```bash
# in ~/<your-path-to-cloned-repo>/amundsen
$ docker-compose -f docker-amundsen.yml up
```
8. Ingest dummy data into Neo4j by doing the following:
* Clone [amundsendatabuilder](https://github.com/lyft/amundsendatabuilder).
* Update the `NEO4J_ENDPOINT` and `Elasticsearch host` in [sample_data_loader.py](https://github.com/lyft/amundsendatabuilder/blob/master/example/scripts/sample_data_loader.py) and replace `localhost` with the IP used for the `default` docker machine. You can see the IP in the `URL` outputted from running `docker-machine ls`.
* Run the following commands:
1. Install `docker` and `docker-compose`.
2. Clone [amundsenfrontendlibrary](https://github.com/lyft/amundsenfrontendlibrary) or download the [docker-amundsen.yml](https://github.com/lyft/amundsenfrontendlibrary/blob/master/docker-amundsen.yml) file directly.
3. Enter the directory where the `docker-amundsen.yml` file is and then:
```bash
# in ~/<your-path-to-cloned-repo>/amundsendatabuilder
$ virtualenv -p python3 venv3
$ source venv3/bin/activate
$ pip3 install -r requirements.txt
$ python setup.py install
$ python example/scripts/sample_data_loader.py
$ docker-compose -f docker-amundsen.yml up
```
9. Verify dummy data has been ingested by viewing in Neo4j by visiting `http://YOUR-DOCKER-HOST-IP:7474/browser/` and run `MATCH (n:Table) RETURN n LIMIT 25` in the query box. You should see two tables -- `hive.test_schema.test_table1` and `dynamo.test_schema.test_table2`.
10. View UI at `http://YOUR-DOCKER-HOST-IP:5000/table_detail/gold/hive/test_schema/test_table1` or `/table_detail/gold/dynamo/test_schema/test_table2`
11. View UI at `http://YOUR-DOCKER-HOST-IP:5000` and try to search `test`, it should return some result.
4. Ingest dummy data into Neo4j by doing the following:
* Clone [amundsendatabuilder](https://github.com/lyft/amundsendatabuilder).
* Run the following commands in the `amundsenddatabuilder` directory:
```bash
$ python3 -m venv venv
$ source venv/bin/activate
$ pip3 install -r requirements.txt
$ python3 setup.py install
$ python3 example/scripts/sample_data_loader.py
```
5. View UI at [`http://localhost:5000`](http://localhost:5000) and try to search `test`, it should return some result.
### Verify setup
1. You can verify dummy data has been ingested into Neo4j by by visiting [`http://localhost:7474/browser/`](http://localhost:7474/browser/) and run `MATCH (n:Table) RETURN n LIMIT 25` in the query box. You should see two tables:
1. `hive.test_schema.test_table1`
2. `dynamo.test_schema.test_table2`
2. You can verify the data has been loaded into the metadataservice by visiting:
1. [`http://localhost:5000/table_detail/gold/hive/test_schema/test_table1`](http://localhost:5000/table_detail/gold/hive/test_schema/test_table1)
2. [`http://localhost:5000/table_detail/gold/dynamo/test_schema/test_table2`](http://localhost:5000/table_detail/gold/dynamo/test_schema/test_table2)
### Troubleshooting
1. If the docker container doesn't have enough heap memory for Elastic Search, `es_amundsen` will fail during `docker-compose`.
1. docker-compose error: `es_amundsen | [1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]`
2. Increase the heap memory [detailed instructions here](https://www.elastic.co/guide/en/elasticsearch/reference/7.1/docker.html#docker-cli-run-prod-mode)
1. Edit `/etc/sysctl.conf`
2. Make entry `vm.max_map_count=262144`. Save and exit.
3. Reload settings `$ sysctl -p`
4. Restart `docker-compose`
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment