Back to Top

Stream and Batch Processing Techniques

Stream Processing

One  of  the  major  goals  set  in  the  VaVeL  project  is  to  process  and  analyze  in  real-time heterogeneous urban data streams using an elastic and resilient infrastructure.  More specifically, the VaVeL infrastructure would need to effectively handle a varied number of input streaming data in order to provide a set of services.  For this purpose, novel approaches are required for automatically determining the number of machines required at each time in order to ensure smooth and real-time processing .  It addition, avoiding wasting the  computational  resources  is  essential.  Under  the  VaVeL  use  cases,  both  for  Dublin  and Warsaw, there is the need to analyze, in a streaming fashion, massive data that exhibit a high variation in their input load and this way pose a number of challenges for real-time processing.

The  real-time  processing  modules  of  the  received  data  streams  will  provide  beneficial aid to the traffic authorities,  allowing them to monitor automatically the data streams and extract meaningful event when they occur. These modules will satisfy the following three main requirements of the traffic monitoring systems:

  1. Work in real time
  2. Be able to process large amounts of streaming data
  3. Respond quickly in unusual conditions.

Nevertheless  there  are  several  issues  that  should  be  considered  when  building  real-time stream processing components, including (i) the varying input load, (ii) the limited available computing  resources,  (iii)  the  costs  associated  with  reconfiguring  the  system  and  (iv)  the massive volume of the input data (v) monetary cost. These problems, that appear in the processing of urban data, are described bellow:

  • Varying  Load: The  volume  of  the  data  retrieved  from  the  urban  environment  often exhibits  varying  input  load.  Commonly,  during  the  rush  hours  increased  input  load  is expected  while  during  the  night  much  fewer  input  load  is  expected.   This  fact  poses challenges is selecting the computing resources to use.  Clearly, a dynamic way to select what resources to use at each time is required.  In D3.2, Section 2.2 we describe an efficient technique to accomplish that by utilizing powerful forecasting models that are capable to accurately predict the load in the short future.
  • Limited Resources: The limited number of the available computational resources is possible  to  hinder  the  real-time  processing  of  the  traffic  data.   Thus  techniques  that efficiently  utilize  the  available  resources  are  necessary.   The  techniques  described  in D3.2, Section 2.1 (ZZP+16) and in D3.2, Section 2.3 (NZ17) deal with these issues.
  • Reconfiguration  Costs: Identifying  the  appropriate  number  of  resources  to  use  is required  in  order  to  adapt  to  the  sudden  changes  of  the  input  load.   However,  when deciding about moving to a new configuration with more resources, the communication and computational costs of moving to that specific configuration should be taken into account.  In D3.2, Section 2.1 and in ZZP+16 we propose a method that is able to minimize the reconfiguration cost by limiting the number of data transmissions between different computing nodes.
  • Massive  Volume: In urban environments massive amounts of data are generated by heterogeneous data sources (i.e.  vehicles moving at the city, CCTV cameras, citizens reports, etc.).  It is required to build systems that are able to cope with this large volume of  received  data  and  guarantee  real-time  and  stable  performance.   In  D3.2, Section 2.3 and in NZ17  we present a technique that achieves this by compressing the data.
  • Monetary Costs: A common solution, when dealing with massive amounts of stream data,  is  to  rent  computational  resources  available  in  the  cloud.   The  usage  of  such resources should be very thoughtful, without acquiring more resources than the necessary. Renting  more  resources  than  the  required  often  results  to  the  waste  of  the  available monetary budget.  Thus techniques that automatically detect the exact number of the required resources are required.  A technique that automatically decides the number of the computational resources to use is described in D3.2, Section 2.2.

Batch Processing

In the context of the VaVeL project it is necessary to analyze and process a large volume of historical data from the different data sources in order to build prediction models and complex event processing rules that enable us to detect events of interest.  More specifically, both for Dublin (D7.1) and Warsaw (D8.1) use cases, there is a need to analyze a high volume of data so the VaVel infrastructures utilizes batch processing analytical components.

There  are  several  challenges  that  should  be  addressed  when  running  batch  processing analytical  components,  including:  (i)  determining  the  number  of  resources  that  should  be reserved by the jobs, (ii) adjusting the jobs’ configuration parameters and (iii) enabling the cost-efficient scheduling of components across multiple MapReduce clusters.  Furthermore, we should be able to provide solutions that quickly search the parameters space to optimize the balance between performance, complexity and cost.  These problems, that appear in the batch processing of urban data, are described in more details bellow:

  • Jobs’ Resource Allocation: When multiple MapReduce jobs are submitted concurrently in a cluster (like in the VaVel system where the analytical components submit MapReduce jobs that process data from the different data sources) it is important to determine the resources  ( i.e., machines)  that  should  be  allocated  to  each  job  in  order  to  minimize the total end-to-end execution time.  In D3.2, Section 3.1 and ZK16b we describe a novel Pareto-based scheduler that is able to determine the resources that should be allocated to each job to  minimize  the  workload’s  end-to-end  execution  and  at  the  same  time  minimize  the spending budget when jobs execute in public cloud infrastructures.
  • Jobs’  Parameters  Tuning: It  is  of  utmost  importance  to  adjust  efficiently  the  con- figuration parameters of MapReduce jobs as they have significant impact both on the jobs’ performance and their monetary cost.  In D3.2, Section 3.1 and in ZK16b we provide a technique that enables us to tune the buffer-size of map tasks in order to miminize their I/O operations and thus their execution time and required budget.
  • Scheduling  across  multiple  MapReuce  clusters: In  VaVel  we  receive  input  from multiple heterogeneous data sources so we need to execute multiple batch processing jobs which tend to vary in terms of data size and processing requirements.  Therefore, assigning all jobs to a single MapReduce cluster may deteriorate their performance.  So an approach that can be followed is the use of multiple MapReduce clusters in order to have jobs isolation.  However, it is not an easy task to determine in which cluster each job should be assigned.  In D3.2, Section 3.2 and in ZK16a we provide a novel framework, called ChEsS , that automatically determines in which cluster each job should be allocated considering the budget/makespan tradeoff.  Our approach exploits the Adaptive Weighted Sum (AWS) technique and is able to significantly reduce the search time compared to state of the art techniques.

References:

ZZP+16 N. Zacheilas,  N. Zygouras,  N. Panagiotou,  V. Kalogeraki,  and D. Gunopulos. Dynamic  load  balancing  techniques  for  distributed  complex  event  processing systems.  In Distributed Applications and Interoperable Systems , pages 174–188. Springer, 2016.
NZ17 V. K. N. Zacheilas. Disco: Dynamic data compression in distributed stream processing systems. In Distributed Applications and Interoperable Systems. Springer, 2017.
ZK16a N. Zacheilas and V. Kalogeraki. Chess: Cost-effective scheduling across multiple heterogeneous mapreduce clusters. In Autonomic Computing (ICAC), 2016 IEEE International Conference on, pages 65–74. IEEE, 2016.
ZK16b Pareto-based  Scheduling  of  MapReduce  Workloads,  N.  Zacheilas,  V.  Kalogeraki, ISORC 2016, York, UK, May 2016