Story :- As a ecommerce website analyst, To verify the list of products which have been shipped.
Algorithm:
a.) Read data from multiple file(s) which are in different format or from single file (.csv (or) .txt (or) .json) and then publish the data into Kafka Topic.
b.) Spark Structured Streaming as a consumer that connects to this kafka topic with the confgiured interval (10 secs) and sends the data to Spark job.
c.) Spark job then transforms the data and persist the output (sink) into another file (of type csv).
Requirement :- Product status Report - Real Time Processing
1.) The source file should contain product information with status
2.) Then, Spark job should filter the product(s) data based on the status - Shipped [within the time range in hours for eg : from 10 am to 12 pm, filtered products with the status "Shipped"]
3.) Once Spark job filters the prodcuts data within the time range then persist the output,
[productId : P1001, productName : Mobile, productPrice : 1000.00, deliveryStatus : Shipped, timestamp] to another file (.csv) in where this file cotains only
the prodcuts with the status as Shipped within the time range based on the timestamp)