I've developed PipeFunc, a new Python library designed to simplify the creation and execution of DAG-based computational pipelines, specifically targeting scientific computing and data analysis workflows. It's built for speed and ease of use, with a focus on minimizing boilerplate and maximizing performance.
Key features:
• Automatic Dependency Resolution: PipeFunc automatically determines the execution order of functions based on their dependencies, eliminating the need for manual dependency management. You define the relationships, and PipeFunc figures out the order.
• Ultra-Low Overhead: The library introduces minimal overhead, measured at around 15µs per function call. This makes it suitable for performance-critical applications.
• Effortless Parallelism: PipeFunc automatically parallelizes independent tasks, and it's compatible with any `concurrent.futures.Executor`. This allows you to easily leverage multi-core processors or even distribute computation across a cluster (e.g., using SLURM).
• Built-in Parameter Sweeps: The `mapspec` feature provides a concise way to define and execute N-dimensional parameter sweeps, which is often crucial in scientific experiments, simulations, and hyperparameter optimization. It uses an index-based approach to do this in parallel with minimal overhead.
• Advanced Caching: Multiple caching options helps avoid redundant computations, saving time and resources.
• Type Safety: PipeFunc leverages Python's type hinting to validate the consistency of data types across the pipeline, reducing the risk of runtime errors.
• Debugging Support: Includes an `ErrorSnapshot` feature that captures detailed error state information, including the function, arguments, traceback, and environment, to simplify debugging and error reproduction.
• Visualization: PipeFunc can generate visualizations of your pipeline to aid in understanding and debugging.
Comparison with existing tools:
• vs. Dask: PipeFunc provides a higher-level, declarative approach to pipeline construction. It automatically handles task scheduling and execution based on function definitions and `mapspec`s, whereas Dask requires more explicit task definition.
• vs. Luigi/Airflow/Prefect/Kedro: These tools are primarily designed for ETL and event-driven workflows. PipeFunc, in contrast, is optimized for scientific computing and computational workflows that require fine-grained control over execution, resource allocation, and parameter sweeps.
I've developed PipeFunc, a new Python library designed to simplify the creation and execution of DAG-based computational pipelines, specifically targeting scientific computing and data analysis workflows. It's built for speed and ease of use, with a focus on minimizing boilerplate and maximizing performance.
Key features:
• Automatic Dependency Resolution: PipeFunc automatically determines the execution order of functions based on their dependencies, eliminating the need for manual dependency management. You define the relationships, and PipeFunc figures out the order.
• Ultra-Low Overhead: The library introduces minimal overhead, measured at around 15µs per function call. This makes it suitable for performance-critical applications.
• Effortless Parallelism: PipeFunc automatically parallelizes independent tasks, and it's compatible with any `concurrent.futures.Executor`. This allows you to easily leverage multi-core processors or even distribute computation across a cluster (e.g., using SLURM).
• Built-in Parameter Sweeps: The `mapspec` feature provides a concise way to define and execute N-dimensional parameter sweeps, which is often crucial in scientific experiments, simulations, and hyperparameter optimization. It uses an index-based approach to do this in parallel with minimal overhead.
• Advanced Caching: Multiple caching options helps avoid redundant computations, saving time and resources.
• Type Safety: PipeFunc leverages Python's type hinting to validate the consistency of data types across the pipeline, reducing the risk of runtime errors.
• Debugging Support: Includes an `ErrorSnapshot` feature that captures detailed error state information, including the function, arguments, traceback, and environment, to simplify debugging and error reproduction.
• Visualization: PipeFunc can generate visualizations of your pipeline to aid in understanding and debugging.
Comparison with existing tools:
• vs. Dask: PipeFunc provides a higher-level, declarative approach to pipeline construction. It automatically handles task scheduling and execution based on function definitions and `mapspec`s, whereas Dask requires more explicit task definition.
• vs. Luigi/Airflow/Prefect/Kedro: These tools are primarily designed for ETL and event-driven workflows. PipeFunc, in contrast, is optimized for scientific computing and computational workflows that require fine-grained control over execution, resource allocation, and parameter sweeps.
Use Cases:
• Scientific simulations and data analysis
• Machine learning pipelines (preprocessing, training, evaluation)
• High-performance computing (HPC) workflows
• Complex data processing tasks
• Any scenario involving interconnected functions where performance and ease of use are important
I'd appreciate any feedback, especially regarding performance, usability, and potential applications in different scientific domains.
Links:
Documentation: https://pipefunc.readthedocs.io
Source Code: https://github.com/pipefunc/pipefunc