PipeFunc: Ultra Simple DAG Pipelines in Python with 15µs Overhead for Science

basnijholt 9 months ago

I've developed PipeFunc, a new Python library designed to simplify the creation and execution of DAG-based computational pipelines, specifically targeting scientific computing and data analysis workflows. It's built for speed and ease of use, with a focus on minimizing boilerplate and maximizing performance.

Key features:

• Automatic Dependency Resolution: PipeFunc automatically determines the execution order of functions based on their dependencies, eliminating the need for manual dependency management. You define the relationships, and PipeFunc figures out the order.

• Ultra-Low Overhead: The library introduces minimal overhead, measured at around 15µs per function call. This makes it suitable for performance-critical applications.

• Effortless Parallelism: PipeFunc automatically parallelizes independent tasks, and it's compatible with any `concurrent.futures.Executor`. This allows you to easily leverage multi-core processors or even distribute computation across a cluster (e.g., using SLURM).

• Built-in Parameter Sweeps: The `mapspec` feature provides a concise way to define and execute N-dimensional parameter sweeps, which is often crucial in scientific experiments, simulations, and hyperparameter optimization. It uses an index-based approach to do this in parallel with minimal overhead.

• Advanced Caching: Multiple caching options helps avoid redundant computations, saving time and resources.

• Type Safety: PipeFunc leverages Python's type hinting to validate the consistency of data types across the pipeline, reducing the risk of runtime errors.

• Debugging Support: Includes an `ErrorSnapshot` feature that captures detailed error state information, including the function, arguments, traceback, and environment, to simplify debugging and error reproduction.

• Visualization: PipeFunc can generate visualizations of your pipeline to aid in understanding and debugging.

Comparison with existing tools:

• vs. Dask: PipeFunc provides a higher-level, declarative approach to pipeline construction. It automatically handles task scheduling and execution based on function definitions and `mapspec`s, whereas Dask requires more explicit task definition.

• vs. Luigi/Airflow/Prefect/Kedro: These tools are primarily designed for ETL and event-driven workflows. PipeFunc, in contrast, is optimized for scientific computing and computational workflows that require fine-grained control over execution, resource allocation, and parameter sweeps.

Use Cases:

• Scientific simulations and data analysis

• Machine learning pipelines (preprocessing, training, evaluation)

• High-performance computing (HPC) workflows

• Complex data processing tasks

• Any scenario involving interconnected functions where performance and ease of use are important

I'd appreciate any feedback, especially regarding performance, usability, and potential applications in different scientific domains.

Links:

Documentation: https://pipefunc.readthedocs.io

Source Code: https://github.com/pipefunc/pipefunc