Evaluating replication for parallel jobs: an efficient approach

Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, req...

Descripción completa

Detalles Bibliográficos
Autores Principales: Qiu, Zhan, Pérez, Juan F.
Formato: Artículo (Article)
Lenguaje:Inglés (English)
Publicado: IEEE 2015
Materias:
Acceso en línea:https://repository.urosario.edu.co/handle/10336/27694
https://doi.org/10.1109/TPDS.2015.2496593
id ir-10336-27694
recordtype dspace
spelling ir-10336-276942021-09-23T17:38:12Z Evaluating replication for parallel jobs: an efficient approach Evaluación de la replicación para trabajos paralelos: un enfoque eficiente Qiu, Zhan Pérez, Juan F. Time factors Reliability Correlation Program processors Computational modeling Absorption Servers Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, requiring re-processing, may increase the response time substantially. To overcome these issues, we explore concurrent replication with canceling. This mechanism executes two job replicas concurrently, and retrieves the result of the first replica that completes, immediately canceling the other one. To analyze this mechanism we propose a stochastic model that considers replication at both job-level and task-level. We find that task-level replication achieves a much higher reliability and shorter response times than job-level replication. We also observe that the impact of replication depends on the system utilization, the subtask reliability, and the correlation among replica failures. Based on the model, we propose a resource-provisioning strategy that determines the minimum number of computing nodes needed to achieve a service-level objective (SLO) defined as a response-time percentile. This strategy is evaluated by considering realistic traffic patterns from a parallel cluster, where task-level replication shows the potential to reduce the resource requirements for tight response-time SLOs. 2015-10-30 2020-08-19T14:43:23Z info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion ISSN: 1045-9219 EISSN: 1558-2183 https://repository.urosario.edu.co/handle/10336/27694 https://doi.org/10.1109/TPDS.2015.2496593 eng info:eu-repo/semantics/restrictedAccess application/pdf IEEE IEEE Transactions on Parallel and Distributed Systems
institution EdocUR - Universidad del Rosario
collection DSpace
language Inglés (English)
topic Time factors
Reliability
Correlation
Program processors
Computational modeling
Absorption
Servers
spellingShingle Time factors
Reliability
Correlation
Program processors
Computational modeling
Absorption
Servers
Qiu, Zhan
Pérez, Juan F.
Evaluating replication for parallel jobs: an efficient approach
description Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, requiring re-processing, may increase the response time substantially. To overcome these issues, we explore concurrent replication with canceling. This mechanism executes two job replicas concurrently, and retrieves the result of the first replica that completes, immediately canceling the other one. To analyze this mechanism we propose a stochastic model that considers replication at both job-level and task-level. We find that task-level replication achieves a much higher reliability and shorter response times than job-level replication. We also observe that the impact of replication depends on the system utilization, the subtask reliability, and the correlation among replica failures. Based on the model, we propose a resource-provisioning strategy that determines the minimum number of computing nodes needed to achieve a service-level objective (SLO) defined as a response-time percentile. This strategy is evaluated by considering realistic traffic patterns from a parallel cluster, where task-level replication shows the potential to reduce the resource requirements for tight response-time SLOs.
format Artículo (Article)
author Qiu, Zhan
Pérez, Juan F.
author_facet Qiu, Zhan
Pérez, Juan F.
author_sort Qiu, Zhan
title Evaluating replication for parallel jobs: an efficient approach
title_short Evaluating replication for parallel jobs: an efficient approach
title_full Evaluating replication for parallel jobs: an efficient approach
title_fullStr Evaluating replication for parallel jobs: an efficient approach
title_full_unstemmed Evaluating replication for parallel jobs: an efficient approach
title_sort evaluating replication for parallel jobs: an efficient approach
publisher IEEE
publishDate 2015
url https://repository.urosario.edu.co/handle/10336/27694
https://doi.org/10.1109/TPDS.2015.2496593
_version_ 1712098348429737984
score 12,131701