For a prominent Pennsylvanian Steel Manufacturer looking to update its legacy website, Algowork[...]
New Hadoop Business Analytics Solution Gives An E-Commerce Firm Competitive Edge And Boost Sales
Algoworks helped a US-based e-commerce business implement a fresh business analytics solution that is both cost-effective and powerful enough to tackle their overgrowing data analytics requirements.
Our client was a US based e-commerce business with an active and established user base all over the country. The size of the business can be guessed by the fact that they have more than 250 thousand products listed on the website and cater to more than a million active users. The client was scaling their business very aggressively and needed a new business intelligence solution that can keep pace with their growing business. Impressed by our expertise in creating scalable and flexible business solutions, the e-commerce firm approached Algoworks.
The client was originally using a combination of Informatica and traditional MySQL databases for managing and maintaining data. As their business started to expand, they came across the requirement to expand the business analytics solutions as well. Scaling up the Informatica based solution along with updating the current versions would require substantial investments. Based on this research we suggested building the new solution using Cloudera Hadoop, Spark, and its related technology stack. Creating a new Cloudera Hadoop-based solution would not only save resources but also make the final solution more scalable and flexible; make the final analytics engine a Big Data Solution.
How We Did It
- Migration From Informatica to Cloudera Hadoop
- Creating Data Processing Engines
- Incremental Data Transformation
- Advanced Analytics
- Integration with Tableau
Migration From Informatica to Cloudera Hadoop
Challenge: How to migrate terabytes of complex data
The client was initially using an Informatica based legacy business intelligence and data transformation solution. The old solution was not very efficient in handling terabytes of data generated by multiple streams of data generation channels. The current solution was also comparatively expensive to scale up. The new solution should address all these problems. However, the first challenge would be data migration without compromising or stopping the data stream so that no real-time data is lost.
Solution: Migrating data to Hadoop and integrating it with data channels
Our first priority was to configure and setup the Cloudera Hadoop solution and setup the new data warehouse based on researched requirements. We then integrated all multiple types of data streams to the new solution making sure that no real-time data like clickstream, purchases, etc. was lost. In the next step, we used Scoop scripts to migrate all legacy data from existing Oracle data warehouse to the new Hadoop Distributed File System based data warehouse clusters. We had built custom Scoop tool scripts to automate one-time historical load transfer of 15+ terabytes of complexly structured data to different HDFS clusters.
Creating Data Processing Engines
Challenge: How to design a data processing and storage cluster
With the scaling of business operations, the amount of data generated increase exponentially. A large amount of data was in all with the term Big Data. The legacy solution was not able to efficiently analyze and extract insights in a timely manner from a large amount of data. In addition, the e-commerce firm was expanding globally, adding new business initiatives in different parts of the world, each initiative generating large amounts of data. These insights are very important especially for an organization whose day to day activities are directly dependent on these insights. The e-commerce firm needed a new data processing engine capable enough of working with large amounts of data.
Solution: Design, develop, and deploy a Hadoop and Spark stack BI cluster
We designed, developed and deployed a multi-node Hadoop Cluster based business intelligence and analytics solution. We started with developing and deploying location-based clusters, where each node handled data from the single location. We integrated multiple data sources with each cluster node using custom Scoop scripts and APIs and used Hive and Pig scripts to transform raw data. The transformation was performed through Apache Spark-based data processing engine. The large data load was handled in a specially designed data warehouse. In addition, we implemented load balancing and disaster recovery measures between the clusters to maintain data integrity and efficient use of available resources. All clusters were managed using Hadoop Yarn tool.
Incremental Data Transformation
Challenge: How to extract updated data and merge with existing data sets
Efficient storage of big data is an important challenge. However extracting meaningful insights from the stored data was also of utmost importance. The client was saving multiple types of data like user information, product inventory, product delivery, logistics, employees’ details, end-to-end accounting data, etc. Filtering, transforming, and then uploading data to data warehouse in a timely manner from this large amount of data was a challenge in itself. In addition, it was also business critical to do automated ETL transformation on some specific types of data to accelerate report generation process.
Solution: Pig and Hive Based Data Transformation Layer
Hadoop data warehouses are very efficient in storing a large amount of information. However, it comes with an added advantage i.e Apache Hive, a data warehouse infrastructure built on top of Hadoop that aids in data summarization, query, and big data analysis. We used Apache Hive over Hadoop warehouse to accelerate data query. We created custom User Defined Functions (UDFs) to handle business-specific use cases and accelerating specific data queries. Hive was complemented by Apache Pig Scripts, custom execution scripts that helped in automatic transformation and loading of incoming data to facilitate various business reports.
Challenge: Less latency and advanced analytics
Getting useful insights in a timely manner can help the organization maintain a competitive edge. The client needed an advanced analytics solution that can extract information in a much faster manner from data like user click stream, real time transactions, inventory etc. and prepare predictive analytics insights for areas such as user engagement, buying trends, promotion scheme performance, and even price optimizations. The new solutions should learn from the regular data extraction and manipulation tasks to intelligently speed up various insight extraction processes and help in decision making.
Solution: Apache Spark Advanced Analytics Engine
For advanced analysis and low latency data processing, we deployed a custom Apache Spark engine built upon the standard HDFS warehouse. Apache Spark’s in-memory computing capabilities, Spark SQL, and DataFrames allow it to deliver faster model executions and faster runtime on Hadoop Hive queries.
For real-time streaming data, Spark Streaming was used to analyze and present live reports on critical business data like user behavior, clickstream, scheme performance, transactions, etc.
For machine learning, we used Spark’s MLlib machine learning framework over Spark Core. MLlib helps the decision makers in taking informed decisions and thereby optimizing running campaigns.
Integration with Tableau
Challenge: How to visualize data in user intuitive manner
Data analysis at the backend would be useless without an effective solution for data representation. The client was using an outdated version for Tableau and their team was well versed with its usage. They needed the new solution to integrate seamlessly with the Tableau solution. They also wanted us to update Tableau, and create new dashboards and reports over the tool.
Solution: Integration with Tableau at Virtualization Layer
The first step was to update the Tableau solution to latest editions without losing the already built dashboards and reports structure. Once that was completed, we integrated new Hadoop-based data analytics solution with Tableau through Tableau APIs. The integration is done at the virtualization layer allowed real-time generation of reports and summaries of processed Hadoop data. Next, we created new dashboards and reports to showcase all new types of raw and processed data provided by the new Hadoop solution. We also modernized all the reports and dashboards making them more visually appealing and user intuitive.
Migrated 3+ TB Data In 9 Days
Migrated more than 3 terabytes of complex data from old Informatica Data Warehouse to new Hadoop Data Warehouse in 9 days
100% Improvement In Campaign Performance
New business intelligence and predictive analysis gave a competitive edge that improved campaign performance by as much as 100%.
2x improvements in CTR performance
All metrics related to e-commerce platform's daily performance such as click through rate, user engagement, sales, user feedback, etc. were improved by as much as 100%.
Campaign cost reduced upto 20%
New insights allowed decision makers to root out non-performing campaigns early, saving overall campaign costs.
Checkout these related project to know about our company, our work, and our expertise.
Algoworks provided Salesforce and Computer Telephony Integration (CTI) Services for an E-Commer[...]
Are you facing similar problem in your project or if you want a custom solution fitting your needs, dont hesitate to contact us for a free consult and quote. Get our expert advice before moving forward.