added README.md

bb46c5d2 · Venkata Pandu Ranga Bujjam · 23eef515 · bb46c5d2 · bb46c5d2 · bb46c5d2
Commit bb46c5d2 authored May 26, 2023 by Venkata Pandu Ranga Bujjam
4 changed files
--- a/README.md
+++ b/README.md
+Case # 3: MongoDB Spark Connector -> (MongoDB - Database#1) -> kafka -> Spark Streaming -> (MongoDB - Database#2)
+--------
+
+Story :- To persist the list of b2c user information with card details who have opted "VISA" cards for purchasing the products.
+
+Algorithm:
+a.) Read user information from the MongoDB (DB#1) collection named "users_info" and the collection named "users_cards_info"
+b.) Join the two collections information - collection named "users_info" and the collection named "users_cards_info" and then publish the
+information into kafka topic.
+c.) Spark Structured Streaming as a consumer consumes the information from the kafka topic.
+d.) Apply the transformation at the spark layer to filter the user data who are having only of type VISA cards.
+e.) Via, Mongo Spark Helper, Filtered user data would be persisted into separate collection of different MongoDB (#2).
+
+
+Requirement :-
+
+
+List of b2c user with cards information to be retrieved from MongoDB collection(s) - from the (MongoDB - Database#1) collection named "users_info" and the collection named "users_cards_info". Then, Persist the output (user data who are having only of type VISA cards) into (MongoDB - Database#2 ) collection named - "users_cards_info_list"
+
+Collection details:- (MongoDB - Database#1)
+
+
+1.) Collection name: "users_info"
+For Eg:-
+[userId : P1001, userName : Raju, regMobileNo : 91111111111, gender : M]
+[userId : P1002, userName : Deepak, regMobileNo : 92222222222, gender : M]
+[userId : P1003, userName : Amar, regMobileNo : 93333333333, gender : M]
+[userId : P1004, userName : Arun, regMobileNo : 94444444444, gender : M]
+
+
+
+2.) Collection name: "users_cards_info"
+
+
+
+For Eg:-
+[userId : P1001, cardId : c001, cardType : Visa, cardNumber : 23456908710, cardStatus : A]
+[userId : P1002, cardId : c002, cardType : Master, cardNumber : 23456908711, cardStatus : A]
+[userId : P1003, cardId : c003, cardType : Visa, cardNumber : 23456908712, cardStatus : A]
+[userId : P1004, cardId : c004, cardType : Rupay, cardNumber : 23456908730, cardStatus : A]
+
+
+
+
+
+2.) Collection details:- (MongoDB - Database#2) [After Persist - Output)
+
+
+For Eg:-
+[userId : P1001, userName : Raju, regMobileNo : 91111111111, gender : M, cardType : Visa, cardNumber : 23456908710, cardStatus : A]
+[userId : P1003, userName : Amar, regMobileNo : 93333333333, gender : M, cardType : Visa, cardNumber : 23456908712, cardStatus : A]
+
+==============================================================================================================================================
+
+Provided algorithm outlines the steps for achieving the desired result of persisting user information with VISA card details from MongoDB Database #1 to MongoDB Database #2 using Kafka and Spark Streaming. Here's a breakdown of each step:
+
+a.) Read user information from MongoDB (DB#1) collection named "users_info" and the collection named "users_cards_info":
+In this step, you retrieve user information from two collections in MongoDB Database #1. These collections are "users_info" and "users_cards_info". The user information likely includes details such as names, addresses, and other relevant data.
+
+b.) Join the two collections' information and publish it into a Kafka topic:
+After retrieving the user information from both collections, you perform a join operation to combine the relevant data. The purpose of this join is to associate the user information with their corresponding card details. Once the join is complete, you publish the combined information into a Kafka topic. This allows other systems or processes to consume the data from the Kafka topic.
+
+c.) Spark Structured Streaming consumes the information from the Kafka topic:
+Spark Structured Streaming, which is a real-time processing engine, acts as a consumer in this step. It consumes the data that was published to the Kafka topic in the previous step. Spark Structured Streaming provides the capability to process the data in a streaming fashion, allowing for real-time analysis and transformations.
+
+d.) Apply transformations at the Spark layer to filter the user data with only VISA cards:
+In this step, you apply transformations to the data received from the Kafka topic using Spark. The purpose is to filter out users who only have VISA cards. The filtering condition is likely based on a specific field or attribute in the user's card details. Once the filtering is applied, the resulting dataset will only contain users who meet the criteria of having VISA cards.
+
+e.) Persist the filtered user data into a separate collection in MongoDB Database #2:
+Finally, using the Mongo Spark Helper library, you persist the filtered user data into a separate collection in MongoDB Database #2. This separate collection will contain the user information of individuals who have opted for VISA cards for their purchases. The Mongo Spark Helper library provides a convenient way to interact with MongoDB from Spark and enables seamless data transfer between the two systems.
+
+By following these steps, you can achieve the goal of persisting user information with VISA card details from MongoDB Database #1 to MongoDB Database #2, leveraging Kafka for data transportation and Spark Streaming for real-time processing and filtering.
+
+
+code:
+================
+// Step 1: Create a SparkSession
+val spark = SparkSession.builder()
+.appName("MongoDB Spark Connector Example")
+.config("spark.mongodb.input.uri", "mongodb://localhost:27017/DB1.users_info")
+.config("spark.mongodb.output.uri", "mongodb://localhost:27017/DB2.filtered_users_info")
+.getOrCreate()
+
+// Step 2: Read user information from MongoDB collection "users_info"
+val usersInfoDF = spark.read.format("mongo").load()
+
+// Step 3: Read user card information from MongoDB collection "users_cards_info"
+val usersCardsInfoDF = spark.read.format("mongo").option("collection", "users_cards_info").load()
+
+// Step 4: Join the two collections based on a common key (e.g., user ID)
+val joinedDF = usersInfoDF.join(usersCardsInfoDF, Seq("user_id"))
+
+// Step 5: Publish the joined information into a Kafka topic
+joinedDF
+.select(to_json(struct("*")).alias("value"))
+.write
+.format("kafka")
+.option("kafka.bootstrap.servers", "localhost:9092")
+.option("topic", "user_info_topic")
+.save()
+
+// Step 6: Spark Structured Streaming as a consumer
+val kafkaDF = spark.readStream
+.format("kafka")
+.option("kafka.bootstrap.servers", "localhost:9092")
+.option("subscribe", "user_info_topic")
+.load()
+
+// Step 7: Apply transformations to filter user data with VISA cards
+val filteredDF = kafkaDF
+.select(from_json(col("value").cast("string"), joinedDF.schema).alias("data"))
+.select("data.*")
+.filter(col("card_type") === "VISA")
+
+// Step 8: Persist the filtered user data into MongoDB collection "filtered_users_info"
+filteredDF.writeStream
+.format("mongo")
+.option("collection", "filtered_users_info")
+.outputMode("append")
+.option("checkpointLocation", "/path/to/checkpoint")
+.start()
+.awaitTermination()
+
+MongoDB configurations, Kafka setup, and data schema. Make sure to replace the placeholder values (localhost:27017, DB1, DB2, /path/to/checkpoint, etc.) with the appropriate values for your setup.
+
+This code assumes that you have the necessary dependencies and libraries set up correctly, including the MongoDB Spark Connector and Kafka integration with Spark. You may need to add the required dependencies to your project configuration.
+
--- a/src/main/java/com/nisum/consumer/SparkConsumer.java
+++ b/src/main/java/com/nisum/consumer/SparkConsumer.java
@@ -25,7 +25,7 @@ public class SparkConsumer {
        SparkSession spark = SparkSession.builder()
                .master("local[*]")
                .appName("MongoSparkConnectorIntro")
-                .config("spark.mongodb.output.uri", USERDB_HOST)
+                .config("spark.mongodb.output.uri", USERDATA_HOST)
                .getOrCreate();

        Map<String, String> kafkaConfigMap = new HashMap<>();

--- a/src/main/java/com/nisum/producer/SparkProducer.java
+++ b/src/main/java/com/nisum/producer/SparkProducer.java
@@ -2,13 +2,10 @@ package com.nisum.producer;

 import com.mongodb.spark.config.ReadConfig;

-import java.util.concurrent.TimeoutException;
-
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.SparkSession;
 import org.apache.spark.sql.functions;
-import org.apache.spark.sql.streaming.StreamingQueryException;
 import org.apache.spark.storage.StorageLevel;
 import static com.nisum.utils.UserConstants.*;

@@ -19,7 +16,7 @@ public class SparkProducer {
        SparkSession spark = SparkSession.builder()
                .master("local")
                .appName("MongoSparkConnectorIntro")
-                .config("spark.mongodb.input.uri", USERDB_HOST)
+                .config("spark.mongodb.input.uri", USERDATA_HOST)
                .getOrCreate();

        ReadConfig readConfigUserInfo = ReadConfig.create(spark)

--- a/src/main/java/com/nisum/utils/UserConstants.java
+++ b/src/main/java/com/nisum/utils/UserConstants.java
@@ -3,8 +3,8 @@ package com.nisum.utils;
 public class UserConstants {
    public static final String KAFKA_HOST = "localhost:9092";
    public static final String TOPIC = "UserInfoTopic";
-    public static final String USERDB_HOST = "mongodb://127.0.0.1/UserDB.users_info";
    public static final String URI = "mongodb://localhost:27017";
+    public static final String USERDATA_HOST = URI+"UserDB.users_info";
    public static final String CARD_DETAILS_DB = "CardDetailsDB";
    public static final String USER_CARD_DETAILS = "user_card_details";
    public static final String USER_DB = "UserDB";