HPC meets interactive Data Science and Machine Learning
Carme (/ˈkɑːrmiː/ KAR-mee; Greek: Κάρμη) is a Jupiter moon, also giving the name for a Cluster of Jupiter moons (the carme group).
or in our case…
an open source framework to manage resources for multiple users running interactive jobs on a Cluster of (GPU) compute nodes.
Presentations
Marketing Slides
Core Idea
We combine established open source ML and DS tools with HPC backends and use therefore
- Singularity containers
- Anaconda environments
- web–based GUI frontends e.g. Code-Server and JupyterLab
- completely web frontend based
(OS independent, no installation on user side needed) - HPC job management and schedulers (SLURM)
- HPC data I/O technologies like Fraunhofer’s BeeGFS
- HPC maintenance and monitoring tools
Job submission scheme
Key Features
- Open source
- we use only open source components that allow commercial usage
- Carme is open source, allowing commercial usage
- Seamless integration with available HPC tools
- Job scheduling via SLURM
- Native LDAP support for user authentication
- Integrate existing distributed file systems like BeeGFS
- Access via web-interface
- OS independent (only web browser needed)
- requires 2FA
- Full user information (running jobs, cluster usage, news / messages)
- Start/Stop jobs within the web-interface
- Interactive jobs
- Flexible access to accelerators
- Access via web driven GUIs like code server or JupyterLab
- Job specific monitoring information in the web-interface
(GPU/FPGA/CPU utilization, memory usage, access to TensorBoard)
- Distributed multi-node and/or multi-gpu jobs
- Easy and intuitive job scheduling
- Directly use GPI, GPI-Space, MPI, HP-DLF and Horovod within the jobs
- Full control about accounting and resource management
- Job scheduling according to user/project specific roles
- Compute resources are user/project exclusive
- User maintained, containerized environments
- Singularity containers
(runs as normal user, GPU, Ethernet and Infiniband support) - Anaconda Environments
(easy updates, project / user specific environments) - Built-in matching between GPU driver and ML/DL tools
- Singularity containers
Roadmap
- 04/2024: ISC 2024 release
- Improvements
- Code improvements for the multi-type accelerators.
- Planed features
- One-line installation script
- DevOps tools implementation (e.g. MLflow)
- User documentation advanced options (e.g. parallel computation)
- Mail notifications that are in sync with our JS notifications in the frontend
- Batch job support
- Improvements
Releases
-
04/2018: Carme prototype at ITWM
-
03/2019: r0.3.0 (first public release)
-
07/2019: r0.4.0
-
11/2019: r0.5.0
-
12/2019: r0.6.0
-
07/2020: r0.7.0
-
11/2020: r0.8.0
-
08/2021: r0.9.0
-
05/2022: r0.9.5
-
09/2022: r0.9.6
-
08/2023: r0.9.7
-
12/2023: r0.9.8 (latest)
Documentation
Visit our documentation.
Who is behind Carme?
Carme is developed by the Competence Center for High Performance Computing at Fraunhofer ITWM.
Contact
→ christian.ortiz@itwm.fraunhofer.de
Sponsors
The development of Carme is financed by research grants from