Stream and Batch Processing Techniques

Stream Processing

One of the major goals set in the VaVeL project is to process and analyze in real-time heterogeneous urban data streams using an elastic and resilient infrastructure. More specifically, the VaVeL infrastructure would need to effectively handle a varied number of input streaming data in order to provide a set of services. For this purpose, novel approaches are required for automatically determining the number of machines required at each time in order to ensure smooth and real-time processing . It addition, avoiding wasting the computational resources is essential. Under the VaVeL use cases, both for Dublin and Warsaw, there is the need to analyze, in a streaming fashion, massive data that exhibit a high variation in their input load and this way pose a number of challenges for real-time processing.

The real-time processing modules of the received data streams will provide beneficial aid to the traffic authorities, allowing them to monitor automatically the data streams and extract meaningful event when they occur. These modules will satisfy the following three main requirements of the traffic monitoring systems:

Work in real time
Be able to process large amounts of streaming data
Respond quickly in unusual conditions.

Nevertheless there are several issues that should be considered when building real-time stream processing components, including (i) the varying input load, (ii) the limited available computing resources, (iii) the costs associated with reconfiguring the system and (iv) the massive volume of the input data (v) monetary cost. These problems, that appear in the processing of urban data, are described bellow:

Varying Load: The volume of the data retrieved from the urban environment often exhibits varying input load. Commonly, during the rush hours increased input load is expected while during the night much fewer input load is expected. This fact poses challenges is selecting the computing resources to use. Clearly, a dynamic way to select what resources to use at each time is required. In D3.2, Section 2.2 we describe an efficient technique to accomplish that by utilizing powerful forecasting models that are capable to accurately predict the load in the short future.
Limited Resources: The limited number of the available computational resources is possible to hinder the real-time processing of the traffic data. Thus techniques that efficiently utilize the available resources are necessary. The techniques described in D3.2, Section 2.1 (ZZP+16) and in D3.2, Section 2.3 (NZ17) deal with these issues.
Reconfiguration Costs: Identifying the appropriate number of resources to use is required in order to adapt to the sudden changes of the input load. However, when deciding about moving to a new configuration with more resources, the communication and computational costs of moving to that specific configuration should be taken into account. In D3.2, Section 2.1 and in ZZP+16 we propose a method that is able to minimize the reconfiguration cost by limiting the number of data transmissions between different computing nodes.
Massive Volume: In urban environments massive amounts of data are generated by heterogeneous data sources (i.e. vehicles moving at the city, CCTV cameras, citizens reports, etc.). It is required to build systems that are able to cope with this large volume of received data and guarantee real-time and stable performance. In D3.2, Section 2.3 and in NZ17 we present a technique that achieves this by compressing the data.
Monetary Costs: A common solution, when dealing with massive amounts of stream data, is to rent computational resources available in the cloud. The usage of such resources should be very thoughtful, without acquiring more resources than the necessary. Renting more resources than the required often results to the waste of the available monetary budget. Thus techniques that automatically detect the exact number of the required resources are required. A technique that automatically decides the number of the computational resources to use is described in D3.2, Section 2.2.

Batch Processing

In the context of the VaVeL project it is necessary to analyze and process a large volume of historical data from the different data sources in order to build prediction models and complex event processing rules that enable us to detect events of interest. More specifically, both for Dublin (D7.1) and Warsaw (D8.1) use cases, there is a need to analyze a high volume of data so the VaVel infrastructures utilizes batch processing analytical components.

There are several challenges that should be addressed when running batch processing analytical components, including: (i) determining the number of resources that should be reserved by the jobs, (ii) adjusting the jobs’ configuration parameters and (iii) enabling the cost-efficient scheduling of components across multiple MapReduce clusters. Furthermore, we should be able to provide solutions that quickly search the parameters space to optimize the balance between performance, complexity and cost. These problems, that appear in the batch processing of urban data, are described in more details bellow:

Jobs’ Resource Allocation: When multiple MapReduce jobs are submitted concurrently in a cluster (like in the VaVel system where the analytical components submit MapReduce jobs that process data from the different data sources) it is important to determine the resources ( i.e., machines) that should be allocated to each job in order to minimize the total end-to-end execution time. In D3.2, Section 3.1 and ZK16b we describe a novel Pareto-based scheduler that is able to determine the resources that should be allocated to each job to minimize the workload’s end-to-end execution and at the same time minimize the spending budget when jobs execute in public cloud infrastructures.
Jobs’ Parameters Tuning: It is of utmost importance to adjust efficiently the con- figuration parameters of MapReduce jobs as they have significant impact both on the jobs’ performance and their monetary cost. In D3.2, Section 3.1 and in ZK16b we provide a technique that enables us to tune the buffer-size of map tasks in order to miminize their I/O operations and thus their execution time and required budget.
Scheduling across multiple MapReuce clusters: In VaVel we receive input from multiple heterogeneous data sources so we need to execute multiple batch processing jobs which tend to vary in terms of data size and processing requirements. Therefore, assigning all jobs to a single MapReduce cluster may deteriorate their performance. So an approach that can be followed is the use of multiple MapReduce clusters in order to have jobs isolation. However, it is not an easy task to determine in which cluster each job should be assigned. In D3.2, Section 3.2 and in ZK16a we provide a novel framework, called ChEsS , that automatically determines in which cluster each job should be allocated considering the budget/makespan tradeoff. Our approach exploits the Adaptive Weighted Sum (AWS) technique and is able to significantly reduce the search time compared to state of the art techniques.

References:

ZZP+16	N. Zacheilas, N. Zygouras, N. Panagiotou, V. Kalogeraki, and D. Gunopulos. Dynamic load balancing techniques for distributed complex event processing systems. In Distributed Applications and Interoperable Systems , pages 174–188. Springer, 2016.
NZ17	V. K. N. Zacheilas. Disco: Dynamic data compression in distributed stream processing systems. In Distributed Applications and Interoperable Systems. Springer, 2017.
ZK16a	N. Zacheilas and V. Kalogeraki. Chess: Cost-effective scheduling across multiple heterogeneous mapreduce clusters. In Autonomic Computing (ICAC), 2016 IEEE International Conference on, pages 65–74. IEEE, 2016.
ZK16b	Pareto-based Scheduling of MapReduce Workloads, N. Zacheilas, V. Kalogeraki, ISORC 2016, York, UK, May 2016

Log in to post comments

Stream and Batch Processing Techniques

Stream Processing

Batch Processing

References:

Acknowledgement