Configure submodules to child repos (#4)

* Add submodules * Update doc with recent changes * Add a PR template

Configure submodules to child repos (#4)
* Add submodules * Update doc with recent changes * Add a PR template
0c77114a · Tamika Tannis · GitHub · b9e3a6e3 · 0c77114a · 0c77114a
Unverified Commit 0c77114a authored May 30, 2019 by Tamika Tannis Committed by GitHub May 30, 2019
10 changed files
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
+### Summary of Changes
+_Include a summary of changes then remove this line_
+### Documentation
+_What documentation did you add or modify and why? Add any relevant links then remove this line_
+### CheckList
+Make sure you have checked **all** steps below to ensure a timely review.
+- [ ] PR title addresses the issue accurately and concisely.
+- [ ] PR includes a summary of changes.
+- [ ] I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)"
--- a/.gitmodules
+++ b/.gitmodules
+[submodule "amundsendatabuilder"]
+	path = amundsendatabuilder
+	url = https://github.com/lyft/amundsendatabuilder
+[submodule "amundsenfrontendlibrary"]
+	path = amundsenfrontendlibrary
+	url = https://github.com/lyft/amundsenfrontendlibrary
+[submodule "amundsenmetadatalibrary"]
+	path = amundsenmetadatalibrary
+	url = https://github.com/lyft/amundsenmetadatalibrary
+[submodule "amundsensearchlibrary"]
+	path = amundsensearchlibrary
+	url = https://github.com/lyft/amundsensearchlibrary
--- a/README.md
+++ b/README.md
@@ -77,6 +77,8 @@ Please visit [Roadmap](docs/roadmap.md) if you are interested in Amundsen upcomi
 - [Amundsen: A Data Discovery Platform from Lyft](https://www.slideshare.net/taofung/data-council-sf-amundsen-presentation) (Data council 19 SF)
 - [Software Engineering Daily podcast on Amundsen](https://softwareengineeringdaily.com/2019/04/16/lyft-data-discovery-with-tao-feng-and-mark-grover/) (April 2019)
 - [Disrupting Data Discovery](https://www.slideshare.net/markgrover/disrupting-data-discovery) (Strata London 2019)
+- [Disrupting Data Discovery (video)](https://www.youtube.com/watch?v=m1B-ptm0Rrw) (Strata SF 2019)
+- [ING Data Analytics Platform (Amundsen is mentioned)](https://static.sched.com/hosted_files/kccnceu19/65/ING%20Data%20Analytics%20Platform.pdf) (Kubecon Barcelona 2019)
 # License
 [Apache 2.0 License.](/LICENSE)
--- a/amundsendatabuilder @ f73c8128
+++ b/amundsendatabuilder @ f73c8128
+Subproject commit f73c8128671b37020503558e6cd00ac02fd26306
--- a/amundsenfrontendlibrary @ 525d4323
+++ b/amundsenfrontendlibrary @ 525d4323
+Subproject commit 525d4323854f8f74f4c5198cc4efdf0283ebb13b
--- a/amundsenmetadatalibrary @ 2b33102d
+++ b/amundsenmetadatalibrary @ 2b33102d
+Subproject commit 2b33102d3f9511537656f60f987e3e79caef0c72
--- a/amundsensearchlibrary @ 46513a88
+++ b/amundsensearchlibrary @ 46513a88
+Subproject commit 46513a881e7b49f862b2a8b67131135d9026aed2
--- a/docker-amundsen.yml
+++ b/docker-amundsen.yml
@@ -4,8 +4,7 @@ services:
      image: neo4j:3.3.0
      container_name: neo4j_amundsen
      environment:
-        - CREDENTIALS_PROXY_USER=neo4j
+        - NEO4J_AUTH=neo4j/test
-        - CREDENTIALS_PROXY_PASSWORD=test
      ulimits:
        nofile:
          soft: 40000
@@ -46,7 +45,6 @@ services:
        - amundsennet
      environment:
         - PROXY_HOST=bolt://neo4j_amundsen
-#         - CREDENTIALS_PROXY_PASSWORD=neo4j_NOTE_FOR_NOW_IT_SEEMS_NEO4JCONFIG_DISREGARDS_CREDENTIALS_WE_SHOULD_FILE_A_BUG
  amundsenfrontend:
      image: amundsendev/amundsen-frontend:1.0.5
      container_name: amundsenfrontend

--- a/docs/authentication/oidc.md
+++ b/docs/authentication/oidc.md
 # OIDC (Keycloak) Authentication
-Setting up end-to-end authentication using OIDC is fairly simple and can be done using a Flask wrapper i.e., [flaskoidc](https://github.com/verdan/flaskoidc). 
+Setting up end-to-end authentication using OIDC is fairly simple and can be done using a Flask wrapper i.e., [flaskoidc](https://github.com/verdan/flaskoidc).
-`flaskoidc` leverages the Flask's `before_request` functionality to authenticate each request before passing that to 
+`flaskoidc` leverages the Flask's `before_request` functionality to authenticate each request before passing that to
-the views. It also accepts headers on each request if available in order to validate bearer token from incoming requests. 
+the views. It also accepts headers on each request if available in order to validate bearer token from incoming requests.
 ## Installation
-Please refer to the [flaskoidc documentation](https://github.com/verdan/flaskoidc/blob/master/README.md) 
+Please refer to the [flaskoidc documentation](https://github.com/verdan/flaskoidc/blob/master/README.md)
-for the installation and the configurations. 
+for the installation and the configurations.
-Note: You need to install and configure `flaskoidc` for each microservice of Amundsen 
+Note: You need to install and configure `flaskoidc` for each microservice of Amundsen
 i.e., for frontendlibrary, metadatalibrary and searchlibrary in order to secure each of them.
 ## Amundsen Configuration
@@ -19,7 +19,7 @@ Once you have `flaskoidc` installed and configured for each microservice, please
    APP_WRAPPER: flaskoidc
    APP_WRAPPER_CLASS: FlaskOIDC
 ```
 - amundsenmetadatalibrary:
 ```bash
    FLASK_APP_MODULE_NAME: flaskoidc
@@ -31,16 +31,16 @@ Once you have `flaskoidc` installed and configured for each microservice, please
    FLASK_APP_MODULE_NAME: flaskoidc
    FLASK_APP_CLASS_NAME: FlaskOIDC
 ```
-By default `flaskoidc` whitelist the healthcheck URLs, to not authenticate them. In case of metadatalibrary and searchlibrary 
+By default `flaskoidc` whitelist the healthcheck URLs, to not authenticate them. In case of metadatalibrary and searchlibrary
-we may want to whitelist the healthcheck APIs explicitly using following environment variable. 
+we may want to whitelist the healthcheck APIs explicitly using following environment variable.
 ```bash
    FLASK_OIDC_WHITELISTED_ENDPOINTS: 'api.healthcheck'
 ```
 ## Setting Up Request Headers
-To communicate securely between the microservices, you need to pass the bearer token from frontend in each request 
+To communicate securely between the microservices, you need to pass the bearer token from frontend in each request
 to metadatalibrary and searchlibrary. This should be done using `REQUEST_HEADERS_METHOD` config variable in frontendlibrary.
 - Define a function to add the bearer token in each request in your config.py:
@@ -58,15 +58,42 @@ def get_access_headers(app):
        return {'Authorization': 'Bearer {}'.format(access_token)}
    except Exception:
        return None
-``` 
+```
 - Set the method as the request header method in your config.py:
 ```python
 REQUEST_HEADERS_METHOD = get_access_headers
 ```
-This function will be called using the current `app` instance to add the headers in each request when calling any endpoint of 
+This function will be called using the current `app` instance to add the headers in each request when calling any endpoint of
 metadatalibrary and searchlibrary [here](https://github.com/lyft/amundsenfrontendlibrary/blob/master/amundsen_application/api/utils/request_utils.py)
+## Setting Up Auth User Method
+In order to get the current authenticated user (which is being used in Amundsen for many operations), we need to set
+`AUTH_USER_METHOD` config variable in frontendlibrary.
+This function should return email address, user id and any other required information.
+- Define a function to fetch the user information in your config.py:
+```python
+def get_auth_user(app):
+    """
+    Retrieves the user information from oidc token, and then makes
+    a dictionary 'UserInfo' from the token information dictionary.
+    We need to convert it to a class in order to use the information
+    in the rest of the Amundsen application.
+    :param app: The instance of the current app.
+    :return: A class UserInfo
+    """
+    from flask import g
+    user_info = type('UserInfo', (object,), g.oidc_id_token)
+    # noinspection PyUnresolvedReferences
+    user_info.user_id = user_info.preferred_username
+    return user_info
+```
+- Set the method as the auth user method in your config.py:
+```python
+AUTH_USER_METHOD = get_auth_user
+```
 Once done, you'll have the end-to-end authentication in Amundsen without any proxy or code changes.
\ No newline at end of file
--- a/docs/installation.md
+++ b/docs/installation.md
 # Installation
 ## Bootstrap a default version of Amundsen using Docker
-The following instructions are for setting up a version of Amundsen using Docker. At the moment, we only support a bootstrap for connecting the Amundsen application to an example metadata service.
+The following instructions are for setting up a version of Amundsen using Docker.
-1. Install `docker`, `docker-compose`, and `docker-machine`.
+1. Install `docker` and  `docker-compose`.
-2. Install `virtualbox` and `virtualenv`.
+2. Clone [amundsenfrontendlibrary](https://github.com/lyft/amundsenfrontendlibrary) or download the [docker-amundsen.yml](https://github.com/lyft/amundsenfrontendlibrary/blob/master/docker-amundsen.yml) file directly.
-3. Start a managed docker virtual host using the following command:
+3. Enter the directory where the `docker-amundsen.yml` file is and then:
-```bash
-# in our examples our machine is named 'default'
-$ docker-machine create -d virtualbox default
-```
-4. Check your docker daemon locally using:
-```bash
-$ docker-machine ls
-```
-  You should see the `default` machine listed, running on virtualbox with no errors listed.
-5. Set up the docker environment using
-```bash
-$ eval $(docker-machine env default)
-```
-TODO (ttannis): Once submodules configured, they _should_ be able to `cd amundsenfrontendlibrary`, etc. Will go through setup again and verify it works.
-6. Setup your local environment.
-  * Clone [amundsenfrontendlibrary](https://github.com/lyft/amundsenfrontendlibrary), [amundsenmetadatalibrary](https://github.com/lyft/amundsenmetadatalibrary), and [amundsensearchlibrary](https://github.com/lyft/amundsensearchlibrary).
-  * In your local versions of each library, update the `LOCAL_HOST` in the `LocalConfig` with the IP used for the `default` docker machine. You can see the IP in the `URL` outputted from running `docker-machine ls`.
-7. Start all of the services using:
-```bash
-# in ~/<your-path-to-cloned-repo>/amundsen
-$ docker-compose -f docker-amundsen.yml up
-```
-8. Ingest dummy data into Neo4j by doing the following:
-  * Clone [amundsendatabuilder](https://github.com/lyft/amundsendatabuilder).
-  * Update the `NEO4J_ENDPOINT` and `Elasticsearch host` in [sample_data_loader.py](https://github.com/lyft/amundsendatabuilder/blob/master/example/scripts/sample_data_loader.py) and replace `localhost` with the IP used for the `default` docker machine. You can see the IP in the `URL` outputted from running `docker-machine ls`.
-  * Run the following commands:
    ```bash
-    # in ~/<your-path-to-cloned-repo>/amundsendatabuilder
+    $ docker-compose -f docker-amundsen.yml up
-    $ virtualenv -p python3 venv3
-    $ source venv3/bin/activate  
-    $ pip3 install -r requirements.txt
-    $ python setup.py install      
-    $ python example/scripts/sample_data_loader.py
    ```
-9. Verify dummy data has been ingested by viewing in Neo4j by visiting `http://YOUR-DOCKER-HOST-IP:7474/browser/` and run `MATCH (n:Table) RETURN n LIMIT 25` in the query box. You should see two tables -- `hive.test_schema.test_table1` and `dynamo.test_schema.test_table2`.
+4. Ingest dummy data into Neo4j by doing the following:
-10. View UI at `http://YOUR-DOCKER-HOST-IP:5000/table_detail/gold/hive/test_schema/test_table1` or `/table_detail/gold/dynamo/test_schema/test_table2`
+   * Clone [amundsendatabuilder](https://github.com/lyft/amundsendatabuilder).
-11. View UI at `http://YOUR-DOCKER-HOST-IP:5000` and try to search `test`, it should return some result.
+   * Run the following commands in the `amundsenddatabuilder` directory:
+   ```bash
+    $ python3 -m venv venv
+    $ source venv/bin/activate  
+    $ pip3 install -r requirements.txt
+    $ python3 setup.py install
+    $ python3 example/scripts/sample_data_loader.py
+   ```
+5. View UI at [`http://localhost:5000`](http://localhost:5000) and try to search `test`, it should return some result.
+### Verify setup
+1. You can verify dummy data has been ingested into Neo4j by by visiting [`http://localhost:7474/browser/`](http://localhost:7474/browser/) and run `MATCH (n:Table) RETURN n LIMIT 25` in the query box. You should see two tables:
+   1. `hive.test_schema.test_table1`
+   2. `dynamo.test_schema.test_table2`
+2. You can verify the data has been loaded into the metadataservice by visiting:
+   1. [`http://localhost:5000/table_detail/gold/hive/test_schema/test_table1`](http://localhost:5000/table_detail/gold/hive/test_schema/test_table1)
+   2. [`http://localhost:5000/table_detail/gold/dynamo/test_schema/test_table2`](http://localhost:5000/table_detail/gold/dynamo/test_schema/test_table2)
+### Troubleshooting
+1. If the docker container doesn't have enough heap memory for Elastic Search, `es_amundsen` will fail during `docker-compose`.
+   1. docker-compose error: `es_amundsen | [1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]`
+   2. Increase the heap memory [detailed instructions here](https://www.elastic.co/guide/en/elasticsearch/reference/7.1/docker.html#docker-cli-run-prod-mode)
+      1. Edit `/etc/sysctl.conf`
+      2. Make entry `vm.max_map_count=262144`. Save and exit.
+      3. Reload settings `$ sysctl -p`
+      4. Restart `docker-compose`