How to start a successful Big Data Application from scratch

Let’s imagine that you are a Software Developer working in a highly innovative data-driven start-up delivering a cutting-edge product called “Data Digger Solution”. It gathers raw data from various heterogeneous sources (e.g. social medias, websites, CRM, online sales, servers, emails, etc.) and process them to gain tangible insights using fresh semantics. These insights can be used to provide concrete and profitable interpretations (e.g. in terms of sales and web presence) to the clients.

The start-up you work for is growing and is signing more and more contracts with major players in various industries (finance, insurance, retail, media, etc.), which is great! Your boss is a visionary man or maybe he just reads the new IDC forecast which predicts that the Big Data technology and services market will grow at a 26.4% compound annual growth rate to $41.5 billion by 2018, which will be majorly driven by wide adoption across industries (https://www.idc.com/prodserv/4Pillars/bigdata). To avoid being the victim of your own success, your boss asked you to rapidly design and build a prototype of “Data Digger Solution” aka DDS using Big Data and Cloud technologies and implement it in accordance with the unstoppable start-up business acceleration especially in terms of performance, reliability and scalability: “Do it fast, cheap, at scale and don’t lose data!

To be able to deliver this prototype in time will exempt you from explaining to your boss why you deserve a raise! So motivated, you open your favorite search engine, write “build big data application”, get thousands of articles, read some of them and at the end of the day you have a plethora of words such as Map Reduce, Hadoop, Spark, Cassandra, Storm, VM, Linux, Cloudify, Zookeeper, Kafka, Akka, Java, Scala and may be also Lambda Architecture. Since you are a clever guy, you get that these are not point-and-click technologies. Yet, you are puzzled on how to start the project? How to design your Big Data application? How could you satisfy all the quality requirements? What architecture to adopt keeping in mind the future evolution of the system? How to accelerate quality testing for your release?

Actually offering an answer to these questions (and more) is the role of the DICE methodology. It is step-by-step scenarios-driven workflows, which can be continuously tested and validated on actual Data-Intensive Applications (DIA). Netfective Technology leads this activity and shares with the DICE consortium its expertise in complex IT projects management in order to build a strong and industrial solution.

The DICE project will produce 14 tools, some of which are design-focused, some are runtime-oriented and some have both design and runtime aspects. Nevertheless, all of them relate to the production environment. Indeed, DICE wants to automate Operators’ activities and postulates that Model-Driven Engineering, which already successfully automated Developers’ tasks, is the way to do that.

DICE tools
Design
  • DICE/UML Profile
  • Simulation
  • Optimization
  • Verification
Runtime
  • Monitoring
  • Quality Testing
  • Fault Injection
Design-to-runtime
  • Delivery
Runtime-to-design
  • Configuration optimization
  • Anomaly detection
  • Trace checking
  • Enhancement
General
  • DICE IDE
Design

Design tools operates on models only. The DICE/UML profile is a UML-based modeling language allowing its user to create models of the production environment at three levels of abstraction: Platform-Independent, Technology-Specific, and Deployment-Specific.

DICE Platform-Independent Models (DPIM) specify, in a technology-agnostic way, the services on which a Big Data software depends on. For example data sources, communication channels, processing frameworks, and storage systems. Designers can add quality of service expectations such as performances a service must meet in order to be useful for the application.

DICE Technology-Specific Models (DTSM) are refinements of DPIMs where every possible technological alternatives are decided. For instance, Apache Kafka can be selected as a communication channel, Apache Spark as a processing framework, and Apache Cassandra as both a data source and a storage system. DTSMs do not settle infrastructure and platform options. These are resolved in DICE Deployment-Specific Models (DDSM). DDSMs elucidate deployments of technologies onto infrastructures and platforms. For instance, how Cassandra will be deployed onto any private/public Cloud. The simulation tool works with DPIMs or DTSMs, and simulates the behavior of the Big Data Frameworks therein (e.g., Spark). The optimization tool performs multiple calls of the simulation tool to refactor a DDSM, and to produce a new one with an optimized deployment scenario. The verification tool explores the possible runs of formal mathematical models of the technologies mentioned in a DTSM (e.g., Storm), and check whether some behavioral properties are satisfied.

Runtime

In contrast with design tools, runtime tools examine or modify the production environment directly not models of it. The monitoring tool collects runtime metrics about the components present in a production environment. The quality testing tool and the fault injection tool injects workloads and problems into the production environment; for instance, shutdowns of computational resources.

Design To Runtime

Some tools cannot be classified as design or runtime tools because they have both design and runtime facets. For instance, the delivery tool is a model-to-runtime (M2R) tool that generates a production environment from a DDSM. However, the configuration optimization, anomaly detection, trace checking, and enhancement tools are all runtime-to-model (R2M) tools that suggest revisions of models of the production environment according to data gathered by the monitoring tool. The configuration optimization finds optimal values for technological and infrastructural configuration parameters put in a DDSM. The anomaly detection, trace checking, and enhancement tools analyze monitoring data in order to:

  • detect anomalous changes of performances across executions of different versions of a Big Data application;
  • check that some logical properties expressed in a DTSM are maintained when the program runs;
  • search anti-patterns in a DTSM.

Application codes, models, and monitoring data are saved in a sharable repository, and all tools can be invoked from the DICE IDE. The DICE Knowledge Repository (created and managed by Netfective Technology) contains tutorials, videos and documentations for each tool: https://github.com/dice-project/DICE-Knowledge-Repository/wiki/DICE-Knowledge-Repository.

Dice ecosystem

DICE ecosystem

The DICE methodology adapts to the purpose of the user. Although DICE advertises a Model-Driven approach to Big Data software development. We consider the employment of DICE tools in three use case scenarios:
Standalone – The user has identified a specific need which can be managed using one or more DICE tools. For example, if the user has a running Big Data application and needs to gather runtime metrics, then he will use the monitoring tool. In this scenario, the user will only have to follow a tutorial or read the documentation of this tool.
Architecture Verification and Simulation – A development team has to implement a software that must respect a list of requirements. Before starting the construction, they want to study simulations of possible production environments in order to predict behaviors and performances for different implementation plans.
DevOps – A team of Developers has built a software and wants to automate (1) the creation of a matching production environment, (2) the deployment of their program into it, and (3) the monitoring of its behavior in reaction to the actions performed by their application. Normally, these three tasks are assigned to Operators. By automating them, DICE is “augmenting DevOps with NoOps”, which stands for “no human operations”.

Each of these scenarios needs some actors to perform identified tasks with existing tools. From design to deployment, they are guided and assisted by the DICE IDE, which interacts with helpful DICE design and runtime tools. The workflows allow iterations between various steps in order to better meet the designer requirements and let users take full advantage of the DevOps capabilities of the DICE toolset.

The DICE philosophy built into the DICE IDE proposes an innovative architecture/approach, which lets the entire environment be flexible and extensible. Choosing The Eclipse Platform (http://www.eclipse.org/) and the Papyrus Modelling Environment (https://eclipse.org/papyrus/) is a deliberate choice mainly made because of the built-in extension mechanisms offered by these platforms and widely adopted by developers, i.e., potential end-users of DICE (you). The extensibility is even more significant in DICE which proposes to the users to adapt/enrich the list of supported Big-Data technologies, which will, for sure, evolve and become longer and longer. Existing and adopted solutions such as Spark, Cassandra or Hadoop are already supported but more and more new emerging solutions will appear. Thanks to Eclipse and to DICE extension mechanisms, DICE users will be able to integrate these technologies with no effort in order to make benefits of the whole DICE ecosystem. This extensibility feature is also part of the whole methodology.

To reiterate what was promised in start, the DICE methodology will (1) guide you through steps to build an efficient architecture, to test it, to simulate it, to optimize it and to deploy your DIA and (2) to adapt it according to your needs. Before I thank you for reading this post, let me tell you that the astonishing growth in data in general will profoundly affect businesses and this fictive story will become, in the near future, an actual challenge for many SMEs. Coming posts and deliverables will give more details about the DICE methodology.


Written by Youssef RIDENE & Joas KINOUANI (Netfective Technology – www.bluage.com).

Mr. Joas KINOUANI joined Netfective Technology in 2015 as a R&D Software Engineer. He is involved the EU H2020 Project: www.dice-h2020.eu mainly to design and implement a validation use case. Joas passed his CS Master degree in 2015 at the University of Bordeaux (France).

Netfective Technology joined leading organizations and universities across Europe in DICE (www.dice-h2020.eu) – a collaborative research project funded under the ICT theme of HORIZON 2020 Research Program of the European Union. The DICE consortium, coordinated by Imperial College (UK) also includes ProDevelop (Spain), Flexiant (UK), University of Zaragoza (Spain), Athens Technology Centre (Greece), Xlab Razvoj (Slovenia), Institutul E-AUS (Romania) and Politecnico di Milano (Italy).

Category: Article, Best practice

Blu Age Community

The Blu Age® Community
Application modernization & Generation

bluage.com

Categories

Archives

About

All trademarks and registered trademarks referreded in this website are the exclusive property of their respective owners.
MDA, UML and MDD are either registered trademarks or trademarks of OMG, Inc. in the United States and/or other countries.