By Anu P
With the advent of fast genome sequencing techniques, biological datasets worldwide have exploded to tremendous sizes today. For instance, a single patient’s sample after sequencing and several stages of data processing and analysis could run into over a Terra byte! Raw sequencing data that comes out of the sequencing machine is at an abstract level of potentially useful information, requiring significant processing to be converted into meaningful form to drive genomics research.
Some of the data conversion steps being highly computation intensive and/or requiring specialized bioinformatics algorithms, a large portion of the bio-informatics data processing pipeline is implemented in the cloud today. However, as the data resident in the “genomics cloud” reaches the hands of the researcher, it is only as good for research as the analytics and visualization capabilities.
Visualization is a graphical representation of data intended to provide the user a qualitative understanding of information. Data visualization techniques greatly enhance the user’s understanding and interpretation of these massive data sets. A visualization-integrated bio-informatics pipeline provides researchers with the ability to explore genomics data and enables them to progressively iterate, backtrack or zero-in on their analysis steps, thereby enabling them to infer high-impact conclusions with an improved degree of confidence within a reasonable time.
The two essential attributes of a successful data visualization framework are:
1) High interactivity
2) Performance at the speed of analysis
Interactivity implies the ability to manipulate graphical entities to derive intuitive data representations. Interactive graphics involves the detection, measurement and comparison between points, lines, shapes and images being represented for the effectiveness of user interpretation, accuracy of quantitative evaluation, aesthetics and adaptability. Enhancing data interpretation by varying the views, labelling to retrieve the original data, zooming in to focus the clarity of data, exploring the neighboring points and a user adjustable mapping can create a good data exploration experience to the user.
Consequently, as the user continuously manipulates data (applies filters, adjusts thresholds, tunes parameters like scale and dynamic range of values) to make “research sense” out of the data, the visualization framework should permit
1) Discrete or continuously variable settings with user-friendly controls like text boxes, selection drop-downs, sliders, knobs etc. and
2) Quick redrawing of the updated graphical representation after every change is made in user settings.
General-purpose and traditional analytics software packages that have been adopted in bio-informatics often come with add-on packages for interactive visualization to a basic level of utility for research. With an easy non-programmer model that appeals very much to researchers, these packages provide interactive graphs and plots. Having an in-built web server eliminates the need to install any client applications, all that the user needs is a browser and an URL to point it to.
However, when it comes to enormous datasets that range millions of data points, these in-built/add-on visualization frameworks are found to be incapable of giving the user an acceptable (sub 1-second?) performance each time a user setting is changed. Therefore, guaranteeing an analysis-continuum to the users remains challenging. Besides the rendering stability of these in-built/add-on packages is often found problematic when large data sets are thrown at them, with statistical methods applied on the data. Rendering inaccuracies including gross misrepresentations of data are frequently encountered that expose the limitations of their scalability.
Here comes the need for evaluating, piloting and implementing visualization frameworks based on customized graphical libraries that leverage fast rendering techniques in a browser environment. As was proven by our experiments with multiple fast-visualization techniques, a customized visualization framework for bio-informatics is the sole solution to match the user’s speed of analysis to provide an enhanced time-to-insights experience.
In conclusion, bio-informatics visualization framework needs to be highly interactive and lightning fast to handle data sets in the millions. Further, from the bioinformatics pipeline provider’s perspective, scalability for a large number of concurrent users and security of data are the other key attributes to be satisfied by the visualization framework, as is applicable to the other modules like data transformation and analytics modules in the pipeline.