What To Consider When Moving To A Big Data Platform
/Dealing with the massive amount of data that powers big data applications is going to need you to take several issues into consideration before adopting a permanent platform. This includes data management issues, lack of sufficient manpower and synchronising data across different sources.
The Type of Big Data Platform
When selecting a big data platform, the foremost consideration should be the platform’s technology heritage. Most vendors tend to pick a single offering and stick to it as their core technology for the remainder of the company’s existence.
For big data heritage, there are three primary categories: Hadoop-like infrastructure, relational databases and cloud-managed services. These categories aren’t necessarily always distinct, and there is often overlap between them. For instance, a vendor may provide Hadoop distribution through the cloud.
Hadoop-Like Distributions: Offered by internet giants like MapR, they aren’t strictly the same as cloning Hadoop on Github and hosting it yourself, since they come with performance tweaks and security improvements. The greatest advantage of Hadoop distributions is allowing users to dump any kind of data into it, rather than taking months to model and load it.
In spite of this, it suffers from the curse of sheer size. It’s impossible to overhaul the whole system and rewrite it from the ground up, but it misses out on some of the more advanced developments of recent times, as a result.
For instance, by the time Hive was finally completed to bring in-memory processing to the platform, the Hadoop vs Spark battle had all but been decided already.
Big Data Relational Databases: Hadoop is really good at storing almost every kind of data in large lakes. However, these can easily break down into unmanageable data swamps, which is why relational databases exist.
They only accept structured data but have more high-end performance than their distant cousins like Postgres, at least at scale. This solution is more viable for companies that already rely on relational databases, compared to a company moving from, say, Hadoop.
You might have to consider the additional cost, since its normally orders of magnitude more expensive than Hadoop. They rely on proprietary software, so companies have to reach licensing agreements that could run into millions of dollars. In contrast, Hadoop is open source and costs a lot less.
Cloud-Managed Services: The biggest difference between Hadoop-like services and cloud-managed ones is the fact that cloud services are almost always proprietary software. While MapR technically offers cloud solutions, it does so based on open-source software.
Cloud-managed software is more specialized than tools like Hadoop that are basically Swiss knives for all things big data. They present the chance for companies to try out the feel of modern big data before fully-committing. They also take away the need for skilled manpower to set up and maintain several clusters.
Privacy, Security & Compliance Requirements
No company is developed in a vacuum. Any platform with users is eternally answerable to those users and the subset of countries in which it operates. There are different laws that need to be considered depending on where the company is based, who it collects data from and how the data is going to be used.
Users should always be given an option to opt out of collecting data and another option for the deletion of said data to minimize liability. Any other laws such as processing the data where it originates should also be followed.
Regarding security, different platforms are built with different goals in mind. For instance, Hadoop is infamous for being so insecure by default that anyone with access to a web browser has administrative access to the data.
The potential for abuse on such a platform is massive. Companies should always adopt a series of permission-granting technology, whether legacy or new, to keep out unauthorized parties from sensitive areas of the app.
The Types of Data You Deal With
Companies today deal with data in all forms and sizes: from incredibly large and difficult-to-process videos that can reach hundreds of gigabytes to the lowly text file.
Before settling, companies should be able to identify the kind of server specifications, cloud services and platforms they need to support the estimated workload. Most big data platforms support at least two of the three broad categories of data analysis: storage, processing and deployment. The best platform is one which simplifies every one of these processes.
Hadoop was considered a revolutionary piece of technology for its ability to go beyond SQL’s structured nature. It can deal with data such as images, audio files, pdf documents and more, all at the same time, while giving little room for performance drops.
The one area that it hopelessly falls flat on its face is the old tenent it sought to replace in the first place - structured data. Libraries such as Pig and Hive can be attached to a Hadoop cluster to grant the programmer the power of SQL, but it’s a far from perfect implementation.
Author’s Bio
Edward Huskin is a freelance data and analytics consultant. He specializes in finding the best technical solution for companies to manage their data and produce meaningful insights. You can reach him via his LinkedIn profile.
Smooth operations, regulatory compliance, and efficient execution are no small feats when it comes to clinical trials. This is where a clinical trial management system (CTMS) becomes indispensable. It's designed to centralize and streamline clinical trial management, from planning and recruitment to data analysis and reporting.