A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs

Boné Ribó, Aleix; Aguirre López, Alejandro; Álvarez Robert, David; Martínez Ferrer, Pedro José; Beltran Querol, Vicenç

A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs

dc.contributor

Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors

dc.contributor

Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors

dc.contributor

Universitat Politècnica de Catalunya. PM - Programming Models

dc.contributor.author

Boné Ribó, Aleix

dc.contributor.author

Aguirre López, Alejandro

dc.contributor.author

Álvarez Robert, David

dc.contributor.author

Martínez Ferrer, Pedro José

dc.contributor.author

Beltran Querol, Vicenç

dc.date.accessioned

2026-03-27T12:54:10Z

dc.date.available

2026-03-27T12:54:10Z

dc.date.issued

2026-07

dc.identifier

Boné, A. [et al.]. A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs. «Future generation computer systems», Juliol 2026, vol. 180, núm. article 108383.

dc.identifier

1872-7115

dc.identifier

https://arxiv.org/abs/2602.21897

dc.identifier

https://hdl.handle.net/2117/459534

dc.identifier

10.1016/j.future.2026.108383

dc.identifier.uri

https://hdl.handle.net/2117/459534

dc.description.abstract

Heterogeneous nodes that combine multi-core CPUs with diverse accelerators are rapidly becoming the norm in both high-performance computing (HPC) and AI infrastructures. Exploiting these platforms, however, requires orchestrating several low-level accelerator APIs such as CUDA, SYCL, and Triton. In some occasions they can be combined with optimized vendor math libraries: e.g., cuBLAS and oneAPI. Each API or library introduces its own abstractions, execution semantics, and synchronization mechanisms. Combining them within a single application is therefore error-prone and labor-intensive. We propose reusing a task-based data-flow methodology together with Task-Aware APIs (TA-libs) to overcome these limitations and facilitate the seamless integration of multiple accelerator programming models, while still leveraging the best-in-class kernels offered by each API. Applications are expressed as a directed acyclic graph (DAG) of host tasks and device kernels managed by an OpenMP/OmpSs-2 runtime. We introduce Task-Aware SYCL (TASYCL) and leverage Task-Aware CUDA (TACUDA), which elevate individual accelerator invocations to first-class tasks. When multiple native runtimes coexist on the same multi-core CPU, they contend for threads, leading to oversubscription and performance variability. To address this, we unify their thread management under the nOS-V tasking and threading library, to which we contribute a new port of the PoCL (Portable OpenCL) runtime. The methodology is evaluated on a multi-core server and a GPU-accelerated node using two contrasting workloads: the GPT-2 pre-training phase, representative of modern AI pipelines, and the HPCCG conjugate-gradient benchmark, representative of traditional HPC. From a performance standpoint, monolithic-kernel and fork-join executions are comparable —in both execution time and memory footprint— to a coarse-grained task-based formulation on both GPU-accelerated and multi-core systems. On the latter, unifying all runtimes through nOS-V mitigates interference and delivers performance on par with using a single runtime in isolation. These results demonstrate that task-aware libraries, coupled with the nOS-V library, enable a single application to harness multiple accelerator programming models transparently and efficiently. The proposed methodology is immediately applicable to current heterogeneous nodes and is readily extensible to future systems that integrate even richer combinations of CPUs, GPUs, FPGAs, and AI accelerators.

dc.description.abstract

This work has received support from multiple funding sources. It is part of the ST4HPC project (PID2023-147979NB-C21), funded by MCIN/AEI/10.13039/501100011033 and by FEDER, UE. This work is also promoted by the Barcelona Zettascale Laboratory, backed by the Ministry for Digital Transformation and of Public Services, within the framework of the Recovery, Transformation, and Resilience Plan - funded by the European Union - NextGenerationEU. This work has received funding from the DARE SGA1 Project, from the European HighPerformance Computing Joint Undertaking (JU) under Grant Agreement No 101202459 and from PCI2024-161687-3 Project funded by MICIU/AEI/10.13039/501,100,011,033 and European’s Union NextGenerationEU/PRTR. Additional support was provided by a Ramón y Cajal fellowship (RYC2019-027592-I), funded by MCIN/AEI/10.13039/ 501,100,011,033 and ESF/10.13039/501100004895, and by the Severo Ochoa Centre of Excellence programme (CEX2021-001148-S), also funded by MCIN/AEI. Finally, the Programming Models research group at BSC-UPC received support from the Departament de Recerca i Universitats de la Generalitat de Catalunya under grants 2021 SGR 01007 and 2025 STEP 00523. 11

dc.description.abstract

Peer Reviewed

dc.description.abstract

Postprint (author's final draft)

dc.format

13 p.

dc.format

application/pdf

dc.language

eng

dc.publisher

Elsevier

dc.relation

https://www.sciencedirect.com/science/article/abs/pii/S0167739X26000178

dc.rights

http://creativecommons.org/licenses/by-nc-nd/4.0/

dc.rights

Restricted access - publisher's policy

dc.rights

Attribution-NonCommercial-NoDerivatives 4.0 International

dc.subject

Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors

dc.subject

Heterogeneous computing

dc.subject

Task-based programming

dc.subject

Data-flow execution

dc.subject

Accelerator APIs

dc.subject

CUDA

dc.subject

SYCL

dc.subject

Triton

dc.subject

OpenMP offload

dc.subject

Task-aware libraries

dc.subject

Runtime interoperability

dc.subject

nOS-V

dc.title

A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs

dc.type

Article

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

E-prints [72896]