Build Your Own “Best-fit” Data Repository

Published by admin on

CDS technical team is always using a mixture of commercial licensed and open source software together for our solutions and professional services to clients.

With the rapid development of open source technology, there are more choices available to build data warehouse and / or Big data platform.   There are several different real application examples available with our technical team members:

  1. Modern Data Warehouse
  2. NoSQL Database for IoT
  3. Big Data Repository
  4. Data Lake

Modern Data Warehouse

In this year, we have used PostgreSQL database to build a modern data warehouse for an asset management company.   The PostgreSQL database is running on a cluster with SSD storage for the speed to store the commodities’ prices and transaction history.   For the modern data warehouse, it is better to have the possibility to integrate with Apache Cassandra / Hadoop for semi-structured or even unstructured data as a future proven solution and details available in the later part of this article.  PostgreSQL is a very good solution and also the MariaDB is another possible candidate with Enterprise class technical support subscription services available.  However, the latest version MariaDB is no longer official supported Apache Cassandra NoSQL direct connection.

A Typical Architecture for a Data Warehouse

NoSQL Database for IoT

We are helping a courier corporation to track their tracks on the road with a GPS sensor module for every truck.  Then, the route for each truck is being recorded and upload to an Apache Cassandra NoSQL database for further analysis for the performance of each individual driver and the traffic conditions.  For NoSQL database, we are also using Redis for storing temperature and humidity information for different data center or computer lab owners for the air-conditioning monitoring and auto-adjustment by system integration with their air-conditioning system.

GPS Sensor

Big Data Repository

In order to handle large amount of unstructured data, one of the most efficient way is to store it into the Apache Hadoop.  We are helping different clients mostly retailers to store multiple social media logs to apply sentiment analysis for better customer services.  They are feeding the unstructured data from different social media to the HDFS.  Then, they are streaming the data with Spark to do near real-time analytics.  This is important to have prompt responses to the customers both online & offline, then leading to better answers in logistics and inventory management.

Big Data Repository Design – available for near real-time analytics

Data Lake

For a large scale corporation, it is possible for them to have hundreds of systems and data silos with different owners.  It is difficult to analyze their data as a whole.  There are several options to build a data lake:

  • Apache Hadoop – a Big Data platform to answer everything in a hard way
  • Data Virtualization – fast answer without any fundamental changes for the original data silos
  • Database / Big Data direct linkage – some database like PostgreSQL could connect to Apache Hadoop

If you would like to save time and effort, it is better to go for Data Virtualization.  However, if there is just a few number of source systems (less than 3), it is not too hard to have the Apache Hadoop in place.  Another option is running a database with ability to integrate with Hadoop or Cassandra, such as PostgreSQL.  We are helping one of our client in luxury trading to build their Data Lake on top of Apache Hadoop.  All of these 3 options are being implemented depending on different clients – unique requirements and environmental constraints.    We have Fortune 500 insurance corporation using Apache Hadoop, large manufacturers using Database – Big Data direct attach and a statutory body using data virtualization.

Apache Hadoop (Ecosystem)

Data Virtualization

Database / Big Data direct linkage (PostgreSQL as an example)

Conclusion

There are some misconception of handling and storing data.  For instance, there are lots of people taking Apache Hadoop to everywhere as the repository for data analytics.  With the above examples, we are sharing different use cases with best practices and proper application methods.  If one tool is fit for everything, it is totally ridiculous to have so many tools and technologies available with huge usage.  Another concern is related to on-premises versus cloud service.  For data storage and data analytic, cloud services are always expensive and extremely difficult to migrate.  It is really important to have a comprehensive review being making the decision.   We would like to suggest “Private Cloud” or on-premises rather than taking any public cloud services for the data science platform in a reasonable production scale.

If you would like raise a question or discuss with us, please contact us at here.