Published by Krishna
With the advent of fast genome sequencing techniques, biological datasets worldwide have exploded to tremendous sizes today. For instance, a single patient’s sample after sequencing and several stages of data processing and analysis could run into over a Terra byte! Raw sequencing data that comes out of the sequencing machine is at an abstract level of potentially useful information, requiring significant processing to be converted into meaningful form to drive genomics research.
Some of the data conversion steps being highly computation intensive and/or requiring specialized bioinformatics algorithms, a large portion of the bio-informatics data processing pipeline is implemented in the cloud today. However, as the data resident in the “genomics cloud” reaches the hands of the researcher, it is only as good for research as the analytics and visualization capabilities.
Visualization is a graphical representation of data intended to provide the user a qualitative understanding of information. Data visualization techniques greatly enhance the user’s understanding and interpretation of these massive data sets. A visualization-integrated bio-informatics pipeline provides researchers with the ability to explore genomics data and enables them to progressively iterate, backtrack or zero-in on their analysis steps, thereby enabling them to infer high-impact conclusions with an improved degree of confidence within a reasonable time.
The two essential attributes of a successful data visualization framework are:
1) High interactivity
2) Performance at the speed of analysis
Interactivity implies the ability to manipulate graphical entities to derive intuitive data representations. Interactive graphics involves the detection, measurement and comparison between points, lines, shapes and images being represented for the effectiveness of user interpretation, accuracy of quantitative evaluation, aesthetics and adaptability. Enhancing data interpretation by varying the views, labelling to retrieve the original data, zooming in to focus the clarity of data, exploring the neighboring points and a user adjustable mapping can create a good data exploration experience to the user.
Consequently, as the user continuously manipulates data (applies filters, adjusts thresholds, tunes parameters like scale and dynamic range of values) to make “research sense” out of the data, the visualization framework should permit
1) Discrete or continuously variable settings with user-friendly controls like text boxes, selection drop-downs, sliders, knobs etc. and
2) Quick redrawing of the updated graphical representation after every change is made in user settings.
General-purpose and traditional analytics software packages that have been adopted in bio-informatics often come with add-on packages for interactive visualization to a basic level of utility for research. With an easy non-programmer model that appeals very much to researchers, these packages provide interactive graphs and plots. Having an in-built web server eliminates the need to install any client applications, all that the user needs is a browser and an URL to point it to.
However, when it comes to enormous datasets that range millions of data points, these in-built/add-on visualization frameworks are found to be incapable of giving the user an acceptable (sub 1-second?) performance each time a user setting is changed. Therefore, guaranteeing an analysis-continuum to the users remains challenging. Besides the rendering stability of these in-built/add-on packages is often found problematic when large data sets are thrown at them, with statistical methods applied on the data. Rendering inaccuracies including gross misrepresentations of data are frequently encountered that expose the limitations of their scalability.
Here comes the need for evaluating, piloting and implementing visualization frameworks based on customized graphical libraries that leverage fast rendering techniques in a browser environment. As was proven by our experiments with multiple fast-visualization techniques, a customized visualization framework for bio-informatics is the sole solution to match the user’s speed of analysis to provide an enhanced time-to-insights experience.
In conclusion, bio-informatics visualization framework needs to be highly interactive and lightning fast to handle data sets in the millions. Further, from the bioinformatics pipeline provider’s perspective, scalability for a large number of concurrent users and security of data are the other key attributes to be satisfied by the visualization framework, as is applicable to the other modules like data transformation and analytics modules in the pipeline.
Big data as a technology passed through various stages of evolution during the last few years, which still keeps it hot in the list of tech buzzwords! Starting with handling the 3 V’s of data – Volume (of data to be handled), Velocity (of data generated) and Variety (of data generated), it has spread wings to more V’s – Veracity (to ensure data integrity and reliability), Vulnerability (to address privacy and confidentiality concerns) and Value (of information)!
As Google showed the way, collection and collation of huge volumes of data and applying the right analytics to gain valuable insights into the business and optimization possibilities is the key to extracting the full potential of the data-driven industry. Today Chief Data Officers are building strategies to organize their data and to derive business intelligence from it to drive radical transformation of businesses in many sectors such as industrial, retail, logistics, healthcare etc.
BDaaS (Big Data-as-a-Service) is gaining momentum, enabling external experts to take the company’s customer data to the cloud and to provide analytical insights for decision making. Offered as a managed service, it frees up the customer from substantial initial investment and helps offer RoI-driven spending. This article focusses on BDaaS, describing the potential that enables our customers to conceptualize and launch new business models.
Large corporations with structured and centralized ERP systems wouldn’t benefit as much from BDaaS as compared to unorganized sectors comprising of diverse players each with their own fragmented IT infrastructures. For instance, unorganized retail is a heterogeneous sector with a geographically distributed supply chain that spans across medium and small players, having considerable differences in their levels of process maturity. Stand-alone islands of software application are encountered many times and so are ad-hoc (or legacy) structures of data storage and archival. B2B companies providing services to geographically spread out customers in many traditional supply chains like chemicals/reagents for laboratory use, petrochemical (non-fuel) derivatives and medical drugs could benefit from the transformational potential of BDaaS.
Suppose you are a B2B player in one of these or similar sectors, let us take a closer look at your business and customer data! Could your expertise in the industry be leveraged to identify a new data-driven model by “integrating your customer data” to offer new intelligence gleaned from it? This integration gives you the data in ‘sector level’ rather than ‘individual’ customer level. You will be able to identify sector level intelligence and provide it to all your customers which will be mutually beneficial for all.
In order to accomplish this outcome, you will most often need external expertise in big data to work collaboratively with you (or your domain consultants) in order to build a BDaaS platform to offer your customers. The value of business intelligence that the platform brings helps them win in their businesses and their patronage in turn helps your business model succeed.
So has been our experience working with a world leader in the pharmacy supply chain across North America. Besides supplying medicines and medical equipment to their customers, they also provide inventory and patient management software to their customers. The software installed in each of the numerous hospitals gathers transactional data over time. We worked closely with the customer’s consultants on the feasibility of data integration and created a centralized control center using big data technologies such as Spark and Kafka. Hosted on the cloud, the platform captures streaming data from different hospitals and pushes them to the centralized system that offers a metered BDaaS service to end-customers, the analytics insights helping them to optimize their businesses.
The path to big data implementation, however, was filled with several challenges, a few of which are:
With the regulatory requirements concerning medical information like the HIPPA standards, compliance is mandatory. Only non-sensitive data at a lower level of granularity is collected, that respects privacy concerns of the individual hospitals of exposure of their patients’ sensitive information. This is the key factor to the success of the project both from customer buy-in and regulatory compliance points of view. The collected data is pushed to cloud securely with transport layer security.
Verity of data
The data being heterogeneous and scattered is the foremost challenge while implementing big data solutions. Even though most hospitals use our customer’s software a few others use their own legacy software. Data could be isolated even across the departments in the same organization! We built data collector modules which can be customized easily to collect data from various sources and push it to the cloud. Rationalizing the relevant data fields from these diversified sources and integrating it provides a lot of insight into possibilities of analytics.
Time to market and initial investment
Being a metered service we had to make sure that customer’s cost is kept linear with usage. Databricks big data platform with reliable Open Source Kafka data injector gives us a balanced and scalable framework to meet this objective.
After data was made available from all sources centrally for analysis it was discovered that information on the availability of particular medicines in each hospital along with demand predictability has the potential to reduce the associated transportation costs by around 20%. Data-driven drill down revealed for instance that for a particular area with a prevalence of influenza but with shortage of the corresponding medicine, the system can identify the best possible area (nearest, where there is enough stock but no demand currently) from which this medicine can be arranged. Supply chain demand mitigation by coordinating drug supply between customers can significantly save inventory and transportation cost for customers. More importantly, it saves precious reaction time for their end-users, which would not have been possible without the magic of BDaaS.
In your own strategy to connect your fragmented customer data centrally to provide mutually beneficial information, the role of an experienced big data partner is indeed crucial. Combine the power of your domain expertise with big data specialists to create new data-driven business models which besides increasing your revenues could make you the hub to all customers thereby increasing the bonding of existing ones and attracting new ones.
‘Platform as a Service’ (PaaS) in the distributed systems arena is gaining wide adoption nowadays as the cloud is gaining more customer confidence. The latest IDC forecast states “By 2020, IDC forecasts that public cloud spending will reach $203.4 billion worldwide”. They also predict a fast growth in the PaaS segment, precisely in the next five years, Compound Annual Growth Rate (CAGR) is predicted at 32.2% which is very promising. PaaS Solutions for distributed systems have captured the serious attention of big players, like Amazon (AWS EMR), Google (Google Cloud Platform), Microsoft (HDinsight), Databricks (Unified Analytic platform) etc., and the count is growing by the day. The same is the case for IOT, with platforms from Amazon (AWS IOT), IBM (Bluemix), CISCO (Cloud Connect) etc. being the major ones in the growing list.
The explosive growth of PaaS Solutions is boosted by the complexity of DevOps and administration nightmares encountered in distributed systems; we still remember the Apache Hadoop version upgrades that always led to sleepless nights!
PaaS Solutions absorb a lot of complexities of the distributed systems which allows us to,
1. Do the evaluation of platforms straight away. You don’t need to wait anymore for cost approvals, deployment completion as in the case of On Premises or IaaS deployments.
2. IOT enabling becomes as fast as just plugging in an agent in your device.
3. Automatic version upgrades of opensource distributed platforms like Apache Spark, Apache Hadoop, Apache Kaa etc. becomes just configuration changes.
4. You can enjoy the additional features like Notebook integration, REST API service support etc. provided by the vendors
All fine! But are there any hidden factors in PaaS Solutions that need to be considered? From my experience of the past few years, it is a big YES! Especially for IoT and Big Data applications.
A ready-made dress may still need alterations!!!
PaaS solutions allow us to remain focused on the application use case by simplifying the spinning up of any platform with few clicks. Moving to another platform configuration is as easy as changing a few parameters and doing a restart. Major configurations and optimizations inside the platform are completely transparent to the user, which is an advantage most of the time.
However complete transparency to the system is not always insightful. You may need to play around with platform configurations to tune your application on top of it – scenarios like trying a few customized or new plugins into the platform which can give extra muscle to your application. As the open source incubations are growing rapidly and lot of new innovative tools in distributed systems are getting released every month, you need to have the flexibility to use them on the platform. Debugging or performance benchmarking of the application running over a totally transparent underlying platform is not good news for system designers. So when the platform is said to be transparent, we should also check the level of control we have over the platform.
For instance, while working with a major US healthcare player for collecting their large data streams for predictive and descriptive analytics, we were using Kafka for data injection and Apache Spark Streaming PaaS for data landing and processing. The initial evaluation and selection of the platform went well with standard architectural considerations and we were happy with the platform choice. Once the development of the application’s functionality was over and alpha tests completed, we started looking to make a few optimizations and tuning as part of the refactoring, for which accessing the platform cluster nodes became essential. We requested the platform vendor for access to the cluster nodes, but their reply was disappointing. Their customer support said “It’s completely transparent to user and we do not recommend any access or modification of the platform configurations”. We were stuck!!!
In another case of a Smart Battery IoT project, we were pushing status info from the smart device to an IOT PaaS platform for self-tuning. The data was being stored internally in the PaaS system. Things were working great and we were able to view the data using their custom tools and REST API based limited query access. However in our project, we made a strategy to create a raw data lake into AWS S3 for future analytics. To our surprise we found that there isn’t an option for data export! Being a very basic yet important feature, we contacted the IoT platform technical support. Their response was “Yeah, it is a simple feature, but it is not in our ‘Business priority list’ of features. So, it may take us some more time to do it”. How much more time was unknown! We were stuck again, and had to review the raw data lake policy.
In both cases, our development plans were seriously impacted and we were forced to skip/postpone major use cases, or start looking out for alternative platform to migrate to, although so late into the project. Let’s closely observe the responses from technical support in these two cases for a few interesting facts.
Case 1: “It’s completely transparent to user and we do not recommend any access or modification of the platform configurations”
Transparency of platform complexities is definitely an important motivation to opt for PaaS, as it gives a quick, efficient and cost effective way of building the distributed system. But it is important to have an insight into the platform internals and in few cases some control as well. Being a system designer, we don’t like to swallow things as they are!!! After all, “platform limitations” is definitely not the story we want to tell our customers! In this specific case, we were looking to try out external monitoring tools that need a few agents to be installed into the cluster nodes. Eventually supporting a third-party BI tool took us roughly two months, in coordinating between the technical teams at the PaaS vendor and the BI vendor. This is simply not acceptable to the customer in terms of time or budget.
Case 2: “Yeah, it is a simple feature, but it is not in our ‘Business priority list’ of features. So, it may take us some more time to do it”
Not just disappointing, this is alarming!!! Technical interoperability for the customer’s data should not be restricted for the sake of business priority. Unfortunately, the so called “business priority” often loses focus in retaining customers which reminds one of a “My way or the Highway” strategy! No customer wants his data being stuck in a specific platform. We need the flexibility to move it through multiple platforms, as business data has latent insights which could be extracted through different systems today or in the future.
To sum up, apart from traditional architectural considerations for selection between OnPremises or IaaS, PaaS or SaaS, we should be vigilant regarding these hidden factors during the selection of distributed platforms especially for IoT and Big Data applications, where large amounts of data are generated. The hidden factors are tricky in the sense that they may not be visible in the first look.
Some of the architectural considerations that help mitigate these hidden factors are given below.
1. Create a proper migration plan – This may not be a short-term goal. But it becomes very important because as and when the data grows you may end up in a world of restrictions.
2. Make sure you have enough control over the platform internals – Although you want to avoid administration overheads as much as possible, you still need good control of the platform for development, refactoring and analysis. Distributed system usage without platform control is painful in the long run. Telnet or SSH access to the cluster nodes, privilege to install custom tools and configuration level flexibility are few items to be verified in general.
3. Third party integration flexibility – Most of the time, the system that we develop would be part of a pipeline and may need integration with customized systems like monitoring tools, custom logging methods etc. which make the integration hooks critical.
4. Platform vendor’s willingness to provide functionality on demand – Platform vendors should be able to handle custom functionality requests on demand. We cannot wait indefinitely for the platform to support it in due course. Make sure that their quick and efficient response is covered in your Service License Agreement (SLA).
Distributed Platform as a Service is definitely growing rapidly, and customers will continue to invest heavily for the combo-advantages of reduced Capex cost, reduced time to market and reduced maintenance/administrative complexity. But I hope the quality and competitiveness of PaaS Solutions also matures fast for the benefit of investing customers, like our IoT and Big Data customers at Sequoia AT. Let’s hope a day will come soon when the platform vendors start advertising their respective platforms by throwing an open challenge “Hey, try out our PaaS solution and if you don’t like it, migrate to any other PaaS solution in 24 Hours or 1 Week!!!”
Link to Article on Linkedin
IOT and Big data are considered as two sides of a same coin. For some industry requirements, you won’t be able to differentiate whether it is IOT or big data solution. At the same time, in some cases your IOT solutions may not be having any big data use cases. In such cases, it is always a tough call for the IOT strategist to decide on the feasibility of considering big data in their system.
I suggest to keep your data big data friendly.
What does big data friendly mean?
Current big data solutions are spending lot of efforts to clean up their large amount of complex & messy data. It happened just because the data was not collected big data friendly. We should learn from the mistakes of others!!! Your data may not be “big” now, but it is growing and soon will be large for any big data processing in future. Industry already started adopting new approaches, moving from descriptive to predictive and prescriptive analytics. We should prepare our data for various big data analytics in future to improve our business value.
How can we make our IOT strategies big data friendly?
There are a few design considerations,
1. Finding ‘target-rich’ data :-
All data may not be worth saving for future. Organizations need to focus on the high-value ‘target-rich’ data. Business analyst and data scientist should work on identifying the high value data which can be an asset for the organization in near future.
2. Data saving strategies :-
How data is stored is a major decision point for the big data system.
2.1 Data Format
Saving text data in formats like JSON data won’t be easy to handle when the data become huge. We should also consider using more big data friendly formats like Parquet, Avro, ORC etc
2.2 Dynamic schema for version handling
The data veracity increases when the system grows with different versions have different fields added or renamed or removed. Manually handling schema is always an overhead. Self describing data formats will be a suitable solution. We should consider using data formats having dynamic schema generation capability (like parquet).
2.3 Partitioning strategy of data
Data partitioning strategy should be defined to having better segmentation of data for easy processing.
3. Security considerations.
Data security is one of the major concerns of both big data and IOT systems. We should implement the privacy by design. Privacy impact assessment and data anonymization are few things to be considered.
In a multi tenant environment, you don’t know how fast your data is growing and become matured as big data. Preparing your data for future will enable to extract maximum value out of it.
“Information is wealth”, Sequoia Applied Technologies will help design solutions to Capture it before you lose it!! .