Scaling and Improving Data Pipelines for an E-commerce Platform
The client, an online marketplace, started with online grocery shopping and later expanded into categories like fashion, home decor, electronics, lifestyle, etc. Having already launched in more than 200 local cities, they made quick business, shipping within an hour. As an increasingly global client with operations in four of the biggest cities in Poland, the client wants to make consumers’ lives easier and more efficient. To drive growth and cut operational costs, they needed a good data pipeline to better understand and optimize their business.
Challenge
One of our renowned e-commerce clients was experiencing major performance and cost issues with its current information processing processes. The client had a data processing pipeline stack (PySpark) built to handle a large flow of information in the Google Cloud Platform (GCP). However, these pipelines were inadequate and took a long time to execute, causing inconsistent results. Slow performance and wait times were not only increasing infrastructure expenses but also hampering clients’ access to timely insights. It also posed multiple performance issues for the client, such as unreliable Spark and Dataproc cluster configuration, inconsistent data formats, and code optimization. As the volume of information became larger and more processed, the client realized they needed to improve the code and use best practices when storing and processing it.
Our Approach
We solved the problems in several ways:
Code Optimization: The first thing we did was thoroughly analyze the client’s existing PySpark code and identify that it was executing multiple operations inefficiently, including data shuffling. To resolve the issue, we began by refactoring the code to eliminate unnecessary operations, which improved memory management.
Optimizing Formats: The client had inefficient data formats, which caused slow reads/writes. We advised migrating to more scalable columnar formats to accelerate information read speed through compression and partitioning. This approach not only speeds up the process but also reduces storage costs.
Streaming & Processing of Instantaneous Data: The client needed streaming data to be processed in real-time. We added Kafka as a message broker for streamed information flow coming from other sources. Through Spark Structured Streaming in Databricks, we were able to process the information in real time and get almost instantaneous insights.
Took Advantage of BigQuery and Cloud SQL: We leveraged BigQuery and Cloud SQL to analyze and store information at scale. For analytics on fast structured details, BigQuery was applied, and for transactional data workloads, Cloud SQL was applied. These tools together helped us execute complex queries in almost real-time and perform the extraction, transformation, and loading of details from multiple sources seamlessly.
Impactful Storage and Management Solutions: We improved the data storage and management system by using Cloud Datastore to ensure the details were well structured and available in a systematic format for further processing.
Technologies We’ve Used
For the project, we use technologies like;
Google Cloud Platform to manage and create information storage
Utilize GCP services like Big Query, KAFKA, Cloud Data Store, and Cloud SQL
Spark SQL in Databricks
Final Outcome
Now the client gets robust, faster, and more efficient data pipelines. Our solution automated the processing workflows, resulting in a 15% reduction in cloud infrastructure costs. The redesign of Spark clusters, the use of high-performance formats, and integration with GCP tools like BigQuery and Cloud SQL made all the difference in improving performance and storage efficiency. Insights are delivered to the management team in real-time using Kafka and Spark, ensuring they are informed quickly. The client has received, in sum, a scalable, cost-effective information processing infrastructure that provides operational value to drive better business performance and more deliberate business decisions.