<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.1.1">Jekyll</generator><link href="https://blog.kubeflow.org/feed.xml" rel="self" type="application/atom+xml" /><link href="https://blog.kubeflow.org/" rel="alternate" type="text/html" /><updated>2025-05-05T11:35:53-05:00</updated><id>https://blog.kubeflow.org/feed.xml</id><title type="html">Kubeflow</title><subtitle>The Machine Learning Toolkit for Kubernetes.</subtitle><entry><title type="html">Kubeflow 1.10 Release Announcement</title><link href="https://blog.kubeflow.org/kubeflow-1.10-release/" rel="alternate" type="text/html" title="Kubeflow 1.10 Release Announcement" /><published>2025-03-26T00:00:00-05:00</published><updated>2025-03-26T00:00:00-05:00</updated><id>https://blog.kubeflow.org/kubeflow-1.10-release</id><content type="html" xml:base="https://blog.kubeflow.org/kubeflow-1.10-release/">&lt;p&gt;Kubeflow 1.10.0 delivers essential updates that enhance the flexibility, efficiency, and scalability of machine learning
workflows. The new features span across several components, improving both user experience and system performance.&lt;/p&gt;

&lt;h2 id=&quot;highlight-features&quot;&gt;Highlight features&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Trainer 2.0&lt;/li&gt;
  &lt;li&gt;New UI for Model Registry&lt;/li&gt;
  &lt;li&gt;Spark Operator as a core Kubeflow component&lt;/li&gt;
  &lt;li&gt;Kubernetes and container security (CISO compatibility)&lt;/li&gt;
  &lt;li&gt;Hyperparameter Optimization for LLMs Fine-Tuning&lt;/li&gt;
  &lt;li&gt;Loop parallelism in Pipelines&lt;/li&gt;
  &lt;li&gt;New parameter distributions for Katib&lt;/li&gt;
  &lt;li&gt;Deeper Model Registry integrations with KServe&lt;/li&gt;
  &lt;li&gt;New Python SDK, OCI storage, and model caching for KServe&lt;/li&gt;
  &lt;li&gt;New security contexts and rootless Istio-CNI integrations for Spark Operator&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;kubeflow-platform-manifests--security&quot;&gt;Kubeflow Platform (Manifests &amp;amp; Security)&lt;/h2&gt;

&lt;p&gt;The Kubeflow Platform Working Group focuses on simplifying Kubeflow installation, operations, and security. See details below.&lt;/p&gt;

&lt;h3 id=&quot;manifests&quot;&gt;Manifests:&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Spark Operator 2.1.0 included in Kubeflow platform, although not installed yet by default&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/kubeflow/manifests/blob/master/README.md&quot;&gt;Documentation updates&lt;/a&gt; that make it easier to install,
extend and upgrade Kubeflow&lt;/li&gt;
  &lt;li&gt;For more details and future plans please consult the &lt;a href=&quot;https://github.com/kubeflow/manifests/issues/2763&quot;&gt;1.10.0&lt;/a&gt; and
&lt;a href=&quot;https://github.com/kubeflow/manifests/issues/3038&quot;&gt;1.10.1/1.11.0&lt;/a&gt; milestones&lt;/li&gt;
&lt;/ul&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Notebooks&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Dashboard&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Pipelines&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Katib&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Trainer&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;KServe&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Model Registry&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Spark&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;&lt;a href=&quot;https://github.com/kubeflow/kubeflow/issues/7459&quot;&gt;1.10&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;&lt;a href=&quot;https://github.com/kubeflow/kubeflow/tags&quot;&gt;1.10&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;&lt;a href=&quot;https://github.com/kubeflow/pipelines/releases&quot;&gt;2.4.1&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;&lt;a href=&quot;https://github.com/kubeflow/katib/releases&quot;&gt;0.18&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;&lt;a href=&quot;https://github.com/kubeflow/trainer/releases&quot;&gt;1.9&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;&lt;a href=&quot;https://github.com/kserve/kserve/releases&quot;&gt;0.14&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;&lt;a href=&quot;https://github.com/kubeflow/model-registry/releases&quot;&gt;0.2.15&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;&lt;a href=&quot;https://github.com/kubeflow/spark-operator/releases/tag/v2.1.0&quot;&gt;2.1.0&lt;/a&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Kubernetes&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Kind&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Kustomize&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Cert Manager&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Knative&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Istio&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Dex&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Oauth2-proxy&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;1.31-1.33&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;0.26&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;5.4.3&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;1.16.1&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;1.16&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;1.24&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;2.41&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;7.7&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h3 id=&quot;security&quot;&gt;Security:&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;CVE reductions - regular scanning with trivy&lt;/li&gt;
  &lt;li&gt;Kubernetes and container security best practices:
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;https://github.com/kubeflow/manifests/issues/2528&quot;&gt;Rootless containers&lt;/a&gt; / PodSecurityStandards restricted for:
Istio-CNI, Knative, Dex, Oauth2-proxy, Spark&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;https://github.com/kubeflow/manifests/pull/3050&quot;&gt;50 % done&lt;/a&gt;: KFP, Notebooks / Workspaces, Katib, Trainer, Kserve, …&lt;/li&gt;
      &lt;li&gt;Istio-CNI as default for rootless Kubeflow postponed to &lt;a href=&quot;https://github.com/kubeflow/manifests/milestone/2&quot;&gt;1.10.1&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;OIDC-authservice has been replaced by oauth2-proxy&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/kubeflow/manifests#oauth2-proxy&quot;&gt;Oauth2-proxy&lt;/a&gt; and &lt;a href=&quot;https://github.com/kubeflow/manifests#dex&quot;&gt;Dex&lt;/a&gt;
documentation for external OIDC authentication (Keycloak, and OIDC providers such as Azure, Google etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Trivy CVE scans March 25 2025:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Working Group&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Images&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Critical CVE&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;High CVE&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Medium CVE&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Low CVE&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;Katib&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;17&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;11&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;101&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;417&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;734&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;Pipelines&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;15&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;57&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;490&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;4030&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;1922&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;Workbenches(Notebooks)&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;12&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;12&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;59&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;179&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;224&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;Kserve&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;16&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;21&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;305&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;6803&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;1588&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;Manifests&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;14&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;8&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;4&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;94&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;52&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;Trainer&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;0&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;0&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;Model Registry&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;6&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;13&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;153&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;188&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;Spark&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;5&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;37&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;1640&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;141&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;All Images&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;81&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;115&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;1009&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;13275&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;4804&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;pipelines&quot;&gt;Pipelines&lt;/h2&gt;

&lt;h3 id=&quot;support-for-placeholders-in-resource-limits&quot;&gt;Support for Placeholders in Resource Limits&lt;/h3&gt;

&lt;p&gt;Kubeflow Pipelines 2.4.1 introduces support for &lt;a href=&quot;https://github.com/kubeflow/pipelines/pull/11501&quot;&gt;placeholders in resource limits&lt;/a&gt;,
enhancing flexibility in pipeline execution.This update allows users to define dynamic resource limits using
parameterized values, enabling more adaptable and reusable pipeline definitions.&lt;/p&gt;

&lt;h3 id=&quot;support-for-loop-parallelism&quot;&gt;Support for Loop Parallelism&lt;/h3&gt;

&lt;p&gt;Kubeflow Pipelines 2.4.1 introduces a new &lt;a href=&quot;https://github.com/kubeflow/pipelines/issues/8718&quot;&gt;Parallelism Limit for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ParallelFor&lt;/code&gt; tasks&lt;/a&gt;,
giving users the ability to run massively parallel inference pipelines, with more control over parallel execution in
their workflows. This feature allows users to specify the maximum number of parallel iterations, preventing resource
overutilization and improving system stability. When running large pipelines with GPUs, proper use of this feature could
save your team thousands of dollars in compute expenses.&lt;/p&gt;

&lt;h3 id=&quot;implement-subdag-output-resolution&quot;&gt;Implement SubDAG Output Resolution&lt;/h3&gt;

&lt;p&gt;Kubeflow 1.10 ensures that &lt;a href=&quot;https://github.com/kubeflow/pipelines/pull/11196&quot;&gt;pipelines using nested DAGs&lt;/a&gt; work
correctly and reliably when treated as components. Outputs from deeply nested DAGs will now resolve properly, avoiding
broken dependencies.&lt;/p&gt;

&lt;h2 id=&quot;model-registry&quot;&gt;Model Registry&lt;/h2&gt;

&lt;p&gt;Model Registry introduces a new user interface and enhanced model management capabilities.&lt;/p&gt;

&lt;h3 id=&quot;model-registry-ui&quot;&gt;Model Registry UI&lt;/h3&gt;

&lt;p&gt;The new Kubeflow &lt;a href=&quot;https://www.kubeflow.org/docs/components/model-registry/getting-started/#using-the-model-registry-ui&quot;&gt;Model Registry UI&lt;/a&gt;
provides a user-friendly web interface for managing machine learning models within the Kubeflow platform. It centralizes
model metadata, version tracking, and artifact management, streamlining MLOps workflows.&lt;/p&gt;

&lt;p&gt;Key features include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Easy model registration with custom metadata&lt;/li&gt;
  &lt;li&gt;Comprehensive model management with filtering and sorting&lt;/li&gt;
  &lt;li&gt;Archiving capabilities&lt;/li&gt;
  &lt;li&gt;Version control&lt;/li&gt;
  &lt;li&gt;Metadata editing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;../images/2025-03-26-kubeflow-1.10-release/model-registry-ui.png&quot; alt=&quot;Model Registry UI&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The UI interacts with the Model Registry’s REST API, making it accessible to users of all technical backgrounds and
enhancing collaboration across data science, ML engineering, and MLOps teams.&lt;/p&gt;

&lt;p&gt;To get started with the Model Registry UI, which is currently in Alpha, you can follow the instructions
&lt;a href=&quot;https://www.kubeflow.org/docs/components/model-registry/installation/#installing-on-kubeflow-platform&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The Kubeflow Model Registry UI Team would like to conduct user research to identify possible enhancements we can contribute in future iterations of the Kubeflow Model Registry UI. If you are interested in participating in this study, please fill out &lt;a href=&quot;https://docs.google.com/forms/d/e/1FAIpQLSeCveL-b0NyUohYa86I3VeTXeynEQLpV5Loj-1HkoUVDwlVCQ/viewform&quot;&gt;this survey&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;custom-storage-initializer&quot;&gt;Custom Storage Initializer&lt;/h3&gt;

&lt;p&gt;The Model Registry Custom Storage Initializer (CSI) is a custom implementation of the KServe ClusterStorageContainer.
This feature allows users to utilize Model Registry metadata to download and deploy models efficiently. With the newest
release of the Model Registry, it is now possible to install and use the Custom Storage Initializer (CSI).&lt;/p&gt;

&lt;p&gt;You can find detailed installation instructions and a small example in the “Getting Started” section of the Model
Registry component on the Kubeflow website.&lt;/p&gt;

&lt;p&gt;For additional information and future developments towards better integration with KServe, you can refer to the slides
&lt;a href=&quot;https://docs.google.com/presentation/d/1wprxN0n23EMkPRX_PaZZcIzZbn_i8Sh_&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;training-operator-trainer--katib&quot;&gt;Training Operator (Trainer) &amp;amp; Katib&lt;/h2&gt;

&lt;p&gt;Kubeflow 1.10 enhances the Training Operator and Katib, providing new tools and APIs for hyperparameter optimization,
particularly for large language models.&lt;/p&gt;

&lt;p&gt;Moreover, the Kubeflow Training Operator now supports &lt;a href=&quot;https://github.com/kubeflow/trainer/issues/1619&quot;&gt;JAX for distributed training&lt;/a&gt;,
enabling users to leverage JAX’s capabilities for efficient and scalable model training.&lt;/p&gt;

&lt;p&gt;Finally, if you want to get involved with Trainer V2, take a look at this &lt;a href=&quot;https://github.com/kubeflow/trainer/tree/master/docs/proposals/2170-kubeflow-trainer-v2&quot;&gt;KEP&lt;/a&gt;
and &lt;a href=&quot;https://github.com/kubeflow/trainer/issues/2170&quot;&gt;issue&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;hyperparameter-optimization-api-for-llms&quot;&gt;Hyperparameter Optimization API for LLMs&lt;/h3&gt;

&lt;p&gt;Katib introduces a new high-level &lt;a href=&quot;https://github.com/kubeflow/katib/issues/2339&quot;&gt;API for hyperparameter tuning&lt;/a&gt;,
streamlining LLMOps workflows in Kubernetes. This API integrates Katib and the Training Operator to automate
hyperparameter optimization, reducing manual effort for data scientists fine-tuning large language models. For more
information, refer to the &lt;a href=&quot;https://blog.kubeflow.org/gsoc-2024-project-4/&quot;&gt;feature release blog post&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;support-for-various-parameter-distributions&quot;&gt;Support for Various Parameter Distributions&lt;/h3&gt;

&lt;p&gt;Katib now adds &lt;a href=&quot;https://github.com/kubeflow/katib/issues/2374&quot;&gt;support for multiple probability distributions&lt;/a&gt;.
Previously limited to uniform distributions, Katib now supports log-uniform, normal, and log-normal distributions,
providing data scientists with greater &lt;a href=&quot;https://youtu.be/4myE0DPp6Ko&quot;&gt;flexibility in tuning hyperparameters&lt;/a&gt;. This is
particularly useful for parameters like learning rates, which benefit from log-uniform sampling, or values expected to
vary around a mean, suited for normal distributions.&lt;/p&gt;

&lt;h3 id=&quot;push-based-metrics-collection&quot;&gt;Push-Based Metrics Collection&lt;/h3&gt;

&lt;p&gt;Katib now allows users to push metrics to Katib DB directly. The new push-based design provides administrative and
performanace improvements to the existing pull based design. For further details, please refer to the
&lt;a href=&quot;https://blog.kubeflow.org/gsoc-2024-project-6/&quot;&gt;Push-Based Metrics Collection blog post&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;dashboard--notebooks&quot;&gt;Dashboard &amp;amp; Notebooks&lt;/h2&gt;

&lt;p&gt;Kubeflow 1.10 improves the observability and usability of Notebooks, while providing updated
&lt;a href=&quot;https://github.com/kubeflow/kubeflow/pull/7687&quot;&gt;default images&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;prometheus-metrics-for-notebooks&quot;&gt;Prometheus Metrics for Notebooks&lt;/h3&gt;

&lt;p&gt;Both the Notebooks component and CRUD backends now feature Prometheus metrics. Notebooks expose custom metrics using the
prom-client library, and CRUD backends utilize the prometheus_flask_exporter library. This ensures consistent metrics
integration across all backend services.&lt;/p&gt;

&lt;h3 id=&quot;more-descriptive-error-messages&quot;&gt;More Descriptive Error Messages&lt;/h3&gt;

&lt;p&gt;Error messages for notebook creation failures due to resource constraints are now more descriptive. Users can quickly
identify issues such as insufficient resources.&lt;/p&gt;

&lt;h2 id=&quot;spark-operator&quot;&gt;Spark Operator&lt;/h2&gt;

&lt;p&gt;The Spark Operator, now integrated as a core Kubeflow component, includes several key enhancements focusing on
architecture, security, and &lt;a href=&quot;https://blog.kubeflow.org/operators/benchmarking/performance/2025/03/15/kubeflow-spark-operator-benchmarks.html&quot;&gt;performance&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Rebuilt with Controller Runtime (v2.0.0): Modernized core architecture using controller-runtime, aligning with
Kubernetes controller patterns for improved structure, extensibility, and testability.&lt;/li&gt;
  &lt;li&gt;YuniKorn Gang Scheduling Support (v2.0.0): Enables efficient scheduling of Spark driver &amp;amp; executor pods as a group,
ideal for large-scale data pipelines with resource guarantees.&lt;/li&gt;
  &lt;li&gt;Enhanced Security Contexts &amp;amp; SeccompProfile Support (v2.1.1): Adds support for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;seccompProfile: RuntimeDefault&lt;/code&gt; &amp;amp; 
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;readOnlyRootFilesystem&lt;/code&gt;, aligning with Kubernetes Pod Security Standards and minimizing security risk.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;kserve&quot;&gt;KServe&lt;/h2&gt;

&lt;p&gt;KServe v0.14.1 introduces several essential features that enhance its capabilities for deploying and managing machine
learning models.&lt;/p&gt;

&lt;h3 id=&quot;new-python-sdk&quot;&gt;New Python SDK&lt;/h3&gt;

&lt;p&gt;The release includes a new Python SDK with both REST and GRPC inference clients, offering asynchronous support and the
ability to handle tensor data in binary format.&lt;/p&gt;

&lt;h3 id=&quot;oci-storage-for-models&quot;&gt;OCI Storage for Models&lt;/h3&gt;

&lt;p&gt;OCI storage for models has also been promoted to a stable feature, with improvements to stability by configuring OCI
models as init containers.&lt;/p&gt;

&lt;h3 id=&quot;model-cache-feature&quot;&gt;Model Cache Feature&lt;/h3&gt;

&lt;p&gt;Additionally, the introduction of the Model Cache feature leverages local node storage to reduce model load times,
especially for large models, enhancing scalability.&lt;/p&gt;

&lt;h3 id=&quot;hugging-face-integration&quot;&gt;Hugging Face Integration&lt;/h3&gt;

&lt;p&gt;KServe v0.14.1 further expands integration with Hugging Face, enabling direct model deployment from the Hugbing Face hub
via a new &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hf://&lt;/code&gt; URI schema.&lt;/p&gt;

&lt;h2 id=&quot;what-comes-next&quot;&gt;What comes next?&lt;/h2&gt;

&lt;p&gt;If you want to take a peek into the Kubeflow 1.11 roadmap planning and contribute with your ideas, see
&lt;a href=&quot;https://github.com/kubeflow/kubeflow/issues/7459&quot;&gt;Notebooks&lt;/a&gt;,
&lt;a href=&quot;https://github.com/kubeflow/manifests/milestone/2&quot;&gt;Manifests &amp;amp; Security&lt;/a&gt;, Pipelines, Model Registry, Katib,
Training Operator.&lt;/p&gt;

&lt;h2 id=&quot;how-to-get-started-with-110&quot;&gt;How to get started with 1.10&lt;/h2&gt;

&lt;p&gt;Visit the Kubeflow 1.10 &lt;a href=&quot;https://github.com/kubeflow/manifests/releases&quot;&gt;release page&lt;/a&gt; or head over to the Getting
Started and Support pages.&lt;/p&gt;

&lt;h2 id=&quot;join-the-community&quot;&gt;Join the Community&lt;/h2&gt;

&lt;p&gt;We would like to thank everyone for the contribution to Kubeflow 1.10, especially Ricardo Martinelli De Oliveira for his
work as the v1.10 Release Manager, all the release team and the working group leads, who relentlessly dedicate their
time to this great project.&lt;/p&gt;

&lt;p&gt;Release team members : Ricardo Martinelli De Oliveira, Dimitris Poulopoulos, Matteo Mortari, Julius von Kohout
Valentina Rodriguez Sosa, Helber Belmiro, Vraj Bhatt, Diego Lovison, Dagvanorov Lkhagvajav, Sailesh Duddupudi,
Manos Vlassis, Tarek Abouzeid, Milos Grubjesic&lt;/p&gt;

&lt;p&gt;Working Group leads : Andrey Velichkevich, Julius von Kohout,  Mathew Wicks, …&lt;/p&gt;

&lt;p&gt;Kubeflow Steering Committee : Andrey Velichkevich, Julius von Kohout, Yuan Tang, Johnu George, Francisco Javier Araceo&lt;/p&gt;

&lt;p&gt;Participating Distributions : Charmed Kubeflow (Canonical), Nutanix, OpenShift AI (RedHat), QBO&lt;/p&gt;

&lt;p&gt;You can find more details about Kubeflow distributions
&lt;a href=&quot;https://www.kubeflow.org/docs/started/installing-kubeflow/#packaged-distributions&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;want-to-help&quot;&gt;Want to help?&lt;/h2&gt;

&lt;p&gt;The Kubeflow community Working Groups hold open meetings and are always looking for more volunteers and users to unlock
the potential of machine learning. If you’re interested in becoming a Kubeflow contributor, please feel free to check
out the resources below. We look forward to working with you!&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Visit our &lt;a href=&quot;https://www.kubeflow.org/docs/about/community/&quot;&gt;Kubeflow website&lt;/a&gt; or Kubeflow GitHub Page.&lt;/li&gt;
  &lt;li&gt;Join the &lt;a href=&quot;https://www.kubeflow.org/docs/about/community/&quot;&gt;Kubeflow Slack channel&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;Join the &lt;a href=&quot;https://groups.google.com/g/kubeflow-discuss&quot;&gt;kubeflow-discuss&lt;/a&gt; mailing list.&lt;/li&gt;
  &lt;li&gt;Attend our weekly &lt;a href=&quot;https://www.kubeflow.org/docs/about/community/#kubeflow-community-call&quot;&gt;community meeting&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;</content><author><name>Kubeflow 1.10 Release Team, Dimitris Poulopoulos</name></author><category term="release" /><summary type="html">Kubeflow 1.10.0 delivers essential updates that enhance the flexibility, efficiency, and scalability of machine learning workflows. The new features span across several components, improving both user experience and system performance.</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.kubeflow.org/images/logo.png" /><media:content medium="image" url="https://blog.kubeflow.org/images/logo.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">🚀 Announcing the Kubeflow Spark Operator Benchmarking Results</title><link href="https://blog.kubeflow.org/operators/benchmarking/performance/2025/03/15/kubeflow-spark-operator-benchmarks.html" rel="alternate" type="text/html" title="🚀 Announcing the Kubeflow Spark Operator Benchmarking Results" /><published>2025-03-15T00:00:00-05:00</published><updated>2025-03-15T00:00:00-05:00</updated><id>https://blog.kubeflow.org/operators/benchmarking/performance/2025/03/15/kubeflow-spark-operator-benchmarks</id><content type="html" xml:base="https://blog.kubeflow.org/operators/benchmarking/performance/2025/03/15/kubeflow-spark-operator-benchmarks.html">&lt;p&gt;Kubernetes has become the go-to platform for running large-scale &lt;a href=&quot;https://spark.apache.org/&quot;&gt;Apache Spark&lt;/a&gt; workloads. But as workloads scale, &lt;strong&gt;how do you ensure your Spark jobs run efficiently without hitting bottlenecks?&lt;/strong&gt; Managing thousands of concurrent Spark jobs can introduce &lt;strong&gt;severe performance challenges&lt;/strong&gt;—from &lt;strong&gt;CPU saturation&lt;/strong&gt; in the Spark Operator to &lt;strong&gt;Kubernetes API slowdowns&lt;/strong&gt; and &lt;strong&gt;job scheduling inefficiencies&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;To address these challenges, we are excited to introduce the &lt;strong&gt;Kubeflow Spark Operator Benchmarking Results and Toolkit&lt;/strong&gt;—a comprehensive framework to analyze performance, pinpoint bottlenecks, and optimize your Spark on Kubernetes deployments.&lt;/p&gt;

&lt;h2 id=&quot;-whats-included&quot;&gt;🔍 What’s Included?&lt;/h2&gt;
&lt;p&gt;This benchmarking effort provides &lt;strong&gt;three key outcomes&lt;/strong&gt; to help you take full control of your Spark on Kubernetes deployment:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;&lt;a href=&quot;https://www.kubeflow.org/docs/components/spark-operator/performance/benchmarking/&quot;&gt;Benchmarking Results&lt;/a&gt;&lt;/strong&gt; – A detailed evaluation of performance insights and tuning recommendations for large-scale Spark workloads.&lt;br /&gt;
🛠 &lt;strong&gt;&lt;a href=&quot;https://github.com/awslabs/data-on-eks/tree/main/analytics/terraform/spark-k8s-operator/examples/benchmark/spark-operator-benchmark-kit&quot;&gt;Benchmarking Test Toolkit&lt;/a&gt;&lt;/strong&gt; – A fully reproducible test suite to help users evaluate their own Spark Operator performance and validate improvements.&lt;br /&gt;
📊 &lt;strong&gt;&lt;a href=&quot;https://grafana.com/grafana/dashboards/23032-spark-operator-scale-test-dashboard/&quot;&gt;Open-Sourced Grafana Dashboard&lt;/a&gt;&lt;/strong&gt; – A &lt;strong&gt;battle-tested&lt;/strong&gt; visualization tool designed specifically to track large-scale Spark Operator deployments, providing real-time monitoring of job processing efficiency, API latencies, and system health.&lt;/p&gt;

&lt;h2 id=&quot;-the-challenges-why-benchmarking-matters&quot;&gt;❌ The Challenges: Why Benchmarking Matters&lt;/h2&gt;
&lt;p&gt;Running &lt;strong&gt;thousands of Spark jobs&lt;/strong&gt; on Kubernetes at scale uncovers several &lt;strong&gt;performance roadblocks&lt;/strong&gt; that can &lt;strong&gt;cripple efficiency&lt;/strong&gt; if left unresolved:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;🚦 Spark Operator Becomes CPU-Bound&lt;/strong&gt;: When handling thousands of Spark jobs, the controller pod maxes out CPU resources, limiting job submission rates.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;🐢 High API Server Latency&lt;/strong&gt;: As workloads scale, Kubernetes API responsiveness degrades—job status updates slow down, affecting observability and scheduling efficiency.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;🕒 Webhook Overhead Slows Job Starts&lt;/strong&gt;: Using webhooks adds &lt;strong&gt;~60 seconds&lt;/strong&gt; of extra latency per job, reducing throughput in high-concurrency environments.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;💥 Namespace Overload Causes Failures&lt;/strong&gt;: Running &lt;strong&gt;6,000+ SparkApplications in a single namespace&lt;/strong&gt; resulted in &lt;strong&gt;pod failures&lt;/strong&gt; due to excessive environment variables and service object overload.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;💡 &lt;strong&gt;So, how do you fix these issues and optimize your Spark Operator deployment?&lt;/strong&gt;&lt;br /&gt;
That’s where our &lt;strong&gt;benchmarking results and toolkit&lt;/strong&gt; come in.&lt;/p&gt;

&lt;h2 id=&quot;-tuning-best-practices-for-spark-operator&quot;&gt;🛠 Tuning Best Practices for Spark Operator&lt;/h2&gt;
&lt;p&gt;Based on our benchmarking findings, we provide &lt;strong&gt;clear, actionable recommendations&lt;/strong&gt; for improving Spark Operator performance at scale.&lt;/p&gt;

&lt;p&gt;If you’re running &lt;strong&gt;thousands of concurrent Spark jobs&lt;/strong&gt;, here’s what you need to do:&lt;/p&gt;

&lt;h3 id=&quot;deploy-multiple-spark-operator-instances&quot;&gt;&lt;strong&gt;Deploy Multiple Spark Operator Instances&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;💡 &lt;strong&gt;Why?&lt;/strong&gt; A single Spark Operator instance struggles to keep up with high job submission rates.&lt;br /&gt;
✅ &lt;strong&gt;Solution&lt;/strong&gt;: When a single Spark Operator instance struggles with high job submission rates, leading to CPU saturation and slower job launches, &lt;strong&gt;deploying multiple instances can help&lt;/strong&gt;. Distribute the workload by assigning different namespaces to each instance. For example, one instance can manage `&lt;strong&gt;20 namespaces&lt;/strong&gt; while another handles a separate set of &lt;strong&gt;20 namespaces&lt;/strong&gt;. This prevents bottlenecks and ensures efficient Spark job execution.&lt;/p&gt;

&lt;h3 id=&quot;disable-webhooks-for-faster-job-starts&quot;&gt;&lt;strong&gt;Disable Webhooks for Faster Job Starts&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;💡 &lt;strong&gt;Why?&lt;/strong&gt; Webhooks introduce &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;~60 seconds&lt;/code&gt; of delay per job due to validation and mutation overhead, reducing throughput in large workloads.
✅ &lt;strong&gt;Solution&lt;/strong&gt;: Instead of using &lt;strong&gt;webhooks&lt;/strong&gt; for volume mounts, node selectors, or taints, define &lt;strong&gt;Spark Pod Templates&lt;/strong&gt; directly within the Spark job definition—no additional files are needed. Disable webhooks by setting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;webhook.enable=false&lt;/code&gt; in the Helm chart.&lt;/p&gt;

&lt;h3 id=&quot;increase-controller-workers&quot;&gt;&lt;strong&gt;Increase Controller Workers&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;💡 &lt;strong&gt;Why?&lt;/strong&gt; By default, the operator runs with &lt;strong&gt;10 controller workers&lt;/strong&gt;, but our benchmarks showed increasing this to &lt;strong&gt;20 or 30 workers&lt;/strong&gt; improved job throughput.&lt;br /&gt;
✅ &lt;strong&gt;Solution&lt;/strong&gt;: Set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;controller.workers=20&lt;/code&gt; if your Operator pod runs on a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;36-core&lt;/code&gt; CPU or higher to enable faster parallel job execution. For larger workloads (e.g., 72+ cores), increase to 40+ workers for better parallel job execution.&lt;/p&gt;

&lt;h3 id=&quot;enable-a-batch-scheduler-volcano--yunikorn&quot;&gt;&lt;strong&gt;Enable a Batch Scheduler (Volcano / YuniKorn)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;💡 &lt;strong&gt;Why?&lt;/strong&gt; Kubernetes’ default scheduler isn’t optimized for batch workloads, leading to &lt;strong&gt;inefficient job placements&lt;/strong&gt;.&lt;br /&gt;
✅ &lt;strong&gt;Solution&lt;/strong&gt;: Enable &lt;strong&gt;Volcano&lt;/strong&gt; or &lt;strong&gt;YuniKorn&lt;/strong&gt; (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;batchScheduler.enable=true&lt;/code&gt;) to optimize job scheduling. These schedulers provide &lt;strong&gt;gang scheduling, queue management, and multi-tenant resource sharing&lt;/strong&gt;. Benchmarks show that &lt;strong&gt;Apache YuniKorn&lt;/strong&gt; schedules jobs faster than the default Kubernetes scheduler.&lt;/p&gt;

&lt;h3 id=&quot;optimize-api-server-scaling&quot;&gt;&lt;strong&gt;Optimize API Server Scaling&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;💡 &lt;strong&gt;Why?&lt;/strong&gt; API server latency spikes to &lt;strong&gt;600ms+ under heavy load&lt;/strong&gt;, affecting Spark job responsiveness.&lt;br /&gt;
✅ &lt;strong&gt;Solution&lt;/strong&gt;: Scale API server replicas, allocate more CPU and memory, and optimize event handling. Ensure your &lt;strong&gt;Kubernetes API server and etcd&lt;/strong&gt; auto-scale to handle bursty workloads efficiently. Monitor &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kube-apiserver&lt;/code&gt; metrics and scale &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;etcd&lt;/code&gt; accordingly. If running thousands of Spark pods, consider &lt;strong&gt;manually increasing control plane node sizes&lt;/strong&gt;.&lt;/p&gt;

&lt;h3 id=&quot;distribute-spark-jobs-across-multiple-namespaces&quot;&gt;&lt;strong&gt;Distribute Spark Jobs Across Multiple Namespaces&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;💡 &lt;strong&gt;Why?&lt;/strong&gt; Running too many jobs in a single namespace causes &lt;strong&gt;environment variable overflows&lt;/strong&gt;, leading to pod failures.&lt;br /&gt;
✅ &lt;strong&gt;Solution&lt;/strong&gt;: When too many pods are placed in a single namespace, operations like listing or modifying resources can generate large &lt;strong&gt;API server&lt;/strong&gt; responses, increasing latency. For example, retrieving all pods may result in a substantial size in response, consuming significant server resources. Additionally, &lt;strong&gt;etcd&lt;/strong&gt;, Kubernetes’ key-value store, can become a bottleneck when handling frequent updates from a high number of pods in one namespace. Heavy read and write operations can strain etcd, causing increased latencies and potential timeouts. To improve performance and stability, it is recommended to &lt;strong&gt;distribute workloads across multiple namespaces&lt;/strong&gt;.&lt;/p&gt;

&lt;h3 id=&quot;monitor--tune-using-the-open-source-grafana-dashboard&quot;&gt;&lt;strong&gt;Monitor &amp;amp; Tune Using the Open-Source Grafana Dashboard&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;💡 &lt;strong&gt;Why?&lt;/strong&gt; Observability is key to identifying performance bottlenecks.&lt;br /&gt;
✅ &lt;strong&gt;Solution&lt;/strong&gt;: Use our &lt;strong&gt;&lt;a href=&quot;https://grafana.com/grafana/dashboards/23032-spark-operator-scale-test-dashboard/&quot;&gt;Spark Operator Scale Test Dashboard&lt;/a&gt;&lt;/strong&gt; to track job submission rates, API latencies, and CPU utilization in real time.&lt;/p&gt;

&lt;h2 id=&quot;-learn-more--get-started&quot;&gt;📖 Learn More &amp;amp; Get Started&lt;/h2&gt;
&lt;p&gt;The &lt;strong&gt;Kubeflow Spark Operator Benchmarking Results and Toolkit&lt;/strong&gt; provide an in-depth &lt;strong&gt;performance playbook&lt;/strong&gt; for running Spark at scale on Kubernetes. Whether you’re troubleshooting an existing deployment or planning for future growth, this toolkit arms you with &lt;strong&gt;data-driven insights&lt;/strong&gt; and &lt;strong&gt;best practices&lt;/strong&gt; for success.&lt;/p&gt;

&lt;p&gt;🚀 &lt;strong&gt;Ready to optimize your Spark workloads?&lt;/strong&gt; Dive into the full results and toolkit below:&lt;br /&gt;
📖 &lt;strong&gt;&lt;a href=&quot;https://www.kubeflow.org/docs/components/spark-operator/performance/benchmarking/&quot;&gt;Kubeflow Spark Operator Benchmarks&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;</content><author><name>&lt;a href='https://www.linkedin.com/in/varaprofile/'&gt;Vara Bonthu&lt;/a&gt;, &lt;a href='https://www.linkedin.com/in/manabumccloskey/'&gt;Manabu McCloskey&lt;/a&gt;, &lt;a href='https://www.linkedin.com/in/ratnopamc/'&gt;Ratnopam Chakrabarti &lt;/a&gt;, &lt;a href='https://www.linkedin.com/in/alanhalcyon/'&gt;Alan Halcyon&lt;/a&gt;</name></author><category term="operators" /><category term="benchmarking" /><category term="performance" /><summary type="html">Kubernetes has become the go-to platform for running large-scale Apache Spark workloads. But as workloads scale, how do you ensure your Spark jobs run efficiently without hitting bottlenecks? Managing thousands of concurrent Spark jobs can introduce severe performance challenges—from CPU saturation in the Spark Operator to Kubernetes API slowdowns and job scheduling inefficiencies.</summary></entry><entry><title type="html">Optimizing RAG Pipelines with Katib: Hyperparameter Tuning for Better Retrieval &amp;amp; Generation</title><link href="https://blog.kubeflow.org/katib/rag/" rel="alternate" type="text/html" title="Optimizing RAG Pipelines with Katib: Hyperparameter Tuning for Better Retrieval &amp;amp; Generation" /><published>2025-02-21T00:00:00-06:00</published><updated>2025-02-21T00:00:00-06:00</updated><id>https://blog.kubeflow.org/katib/katib-rag-optimization</id><content type="html" xml:base="https://blog.kubeflow.org/katib/rag/">&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;As artificial intelligence and machine learning models become more
sophisticated, optimising their performance remains a critical challenge.
Kubeflow provides a robust component, &lt;a href=&quot;https://www.kubeflow.org/docs/components/katib/&quot;&gt;Katib&lt;/a&gt;, designed for
hyperparameter optimization and neural architecture search. As a part of the
Kubeflow ecosystem, Katib enables scalable, automated tuning of underlying
machine learning models, reducing the manual effort required for parameter
selection while improving model performance across diverse ML workflows.&lt;/p&gt;

&lt;p&gt;With Retrieval-Augmented Generation (&lt;a href=&quot;https://en.wikipedia.org/wiki/Retrieval-augmented_generation&quot;&gt;RAG&lt;/a&gt;) becoming an increasingly
popular approach for improving search and retrieval quality, optimizing its
parameters is essential to achieving high-quality results. RAG pipelines involve
multiple hyperparameters that influence retrieval accuracy, hallucination
reduction, and language generation quality. In this blog, we will explore how
Katib can be leveraged to fine-tune a RAG pipeline, ensuring optimal performance
by systematically adjusting key hyperparameters.&lt;/p&gt;

&lt;h1 id=&quot;lets-get-started&quot;&gt;Let’s Get Started!&lt;/h1&gt;

&lt;h2 id=&quot;step-1-setup&quot;&gt;STEP 1: Setup&lt;/h2&gt;

&lt;p&gt;Since compute resources are scarcer than a perfectly labeled dataset :), we’ll
use a lightweight &lt;a href=&quot;https://kind.sigs.k8s.io/&quot;&gt;Kind cluster (Kubernetes in Docker)&lt;/a&gt;
cluster to run this example locally. Rest assured, this setup can seamlessly
scale to larger clusters by increasing the dataset size and the number of
hyperparameters to tune.&lt;/p&gt;

&lt;p&gt;To get started, we’ll first install the Katib control plane in our cluster by
following the steps outlined &lt;a href=&quot;https://www.kubeflow.org/docs/components/katib/installation/&quot;&gt;in the documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;step-2-implementing-rag-pipeline&quot;&gt;STEP 2: Implementing RAG pipeline&lt;/h2&gt;

&lt;p&gt;In this implementation, we use a &lt;a href=&quot;https://www.sciencedirect.com/topics/computer-science/retrieval-model&quot;&gt;retriever model&lt;/a&gt;, which
encodes queries and documents into vector representations to find the most
relevant matches, to fetch relevant documents based on a query and a generator
model to produce coherent text responses.&lt;/p&gt;

&lt;h3 id=&quot;implementation-details&quot;&gt;Implementation Details:&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Retriever: Sentence Transformer &amp;amp; FAISS (Facebook AI Similarity Search) Index
    &lt;ul&gt;
      &lt;li&gt;A SentenceTransformer model (paraphrase-MiniLM-L6-v2) encodes predefined
documents into vector representations.&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;https://ai.meta.com/tools/faiss/&quot;&gt;FAISS&lt;/a&gt; is used to index these document embeddings and perform
efficient similarity searches to retrieve the most relevant documents.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Generator: Pre-trained GPT-2 Model
    &lt;ul&gt;
      &lt;li&gt;A Hugging Face GPT-2 text generation pipeline (which can be replaced with
any other model) is used to generate responses based on the retrieved
documents. I chose GPT-2 for this example as it is lightweight enough to
run on my local machine while still generating coherent responses.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Query Processing &amp;amp; Response Generation
    &lt;ul&gt;
      &lt;li&gt;When a query is submitted, the retriever encodes it and searches the FAISS
index for the top-k most similar documents.&lt;/li&gt;
      &lt;li&gt;These retrieved documents are concatenated to form the input context, which
is then passed to the GPT-2 model to generate a response.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Evaluation: &lt;a href=&quot;https://huggingface.co/spaces/evaluate-metric/bleu&quot;&gt;BLEU&lt;/a&gt; (Bilingual Evaluation Understudy) Score Calculation
    &lt;ul&gt;
      &lt;li&gt;To assess the quality of generated responses, we use the BLEU score, a
popular metric for evaluating text generation.&lt;/li&gt;
      &lt;li&gt;The evaluate function takes a query, retrieves documents, generates a
response, and compares it against a ground-truth reference to compute a
BLEU score with smoothing functions from the nltk library.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To run Katib, we will use the &lt;a href=&quot;https://www.kubeflow.org/docs/components/katib/installation/#installing-python-sdk&quot;&gt;Katib SDK&lt;/a&gt;, which provides a programmatic interface for defining and running 
hyperparameter tuning experiments in Kubeflow.&lt;/p&gt;

&lt;p&gt;Katib requires an &lt;a href=&quot;https://www.kubeflow.org/docs/components/katib/user-guides/hp-tuning/configure-experiment/#configuring-the-experiment&quot;&gt;objective&lt;/a&gt; function, which:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Defines what we want to optimize (e.g., BLEU score for text generation quality).&lt;/li&gt;
  &lt;li&gt;Executes the RAG pipeline with different hyperparameter values.&lt;/li&gt;
  &lt;li&gt;Returns an evaluation metric so Katib can compare different hyperparameter configurations.&lt;/li&gt;
&lt;/ol&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;objective&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# Import dependencies inside the function (required for Katib)
&lt;/span&gt;    &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;numpy&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;
    &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;faiss&lt;/span&gt;
    &lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sentence_transformers&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SentenceTransformer&lt;/span&gt;
    &lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;transformers&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt;
    &lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;nltk.translate.bleu_score&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sentence_bleu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SmoothingFunction&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# Function to fetch documents (Modify as needed)
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fetch_documents&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;Returns a predefined list of documents or loads them from a file.&quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# OR, to load from a file:
&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# with open(&quot;/path/to/documents.json&quot;, &quot;r&quot;) as f:
&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;#     return json.load(f)
&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# Define the RAG pipeline within the function
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;rag_pipeline_execute&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;top_k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;temperature&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;Retrieves relevant documents and generates a response using GPT-2.&quot;&quot;&quot;&lt;/span&gt;

        &lt;span class=&quot;c1&quot;&gt;# Initialize retriever
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;retriever_model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SentenceTransformer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;paraphrase-MiniLM-L6-v2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;c1&quot;&gt;# Sample documents
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;documents&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fetch_documents&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

        &lt;span class=&quot;c1&quot;&gt;# Encode documents
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;doc_embeddings&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;retriever_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;documents&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;index&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;faiss&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IndexFlatL2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;doc_embeddings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;doc_embeddings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;

        &lt;span class=&quot;c1&quot;&gt;# Encode query and retrieve top-k documents
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;query_embedding&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;retriever_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;distances&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;indices&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;search&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query_embedding&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;top_k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;retrieved_docs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;documents&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;indices&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt;

        &lt;span class=&quot;c1&quot;&gt;# Generate response using GPT-2
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;generator&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;text-generation&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;gpt2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;gpt2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;context&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;retrieved_docs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;generated&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;generator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;temperature&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;temperature&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_return_sequences&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;generated&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;][&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;generated_text&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# TODO: Provide queries and ground truth directly here or load them dynamically from a file/external volume.
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;query&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;&quot;&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Example: &quot;Tell me about the Eiffel Tower.&quot;
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;ground_truth&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;&quot;&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Example: &quot;The Eiffel Tower is a famous landmark in Paris.&quot;
&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# Extract hyperparameters
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;top_k&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;top_k&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;temperature&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;temperature&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# Generate response
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;response&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rag_pipeline_execute&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;top_k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;temperature&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# Compute BLEU score
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;reference&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ground_truth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()]&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Tokenized reference
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;candidate&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;response&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Tokenized candidate response
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;smoothie&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SmoothingFunction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method1&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;bleu_score&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sentence_bleu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reference&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;candidate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;smoothing_function&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;smoothie&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# Print BLEU score in Katib-compatible format
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;f&quot;BLEU=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bleu_score&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: Make sure to return the result in the format of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;parameter&amp;gt;=&amp;lt;value&amp;gt;&lt;/code&gt;
for Katib’s metrics collector to be able to utilize it. More ways to configure
the output are available in &lt;a href=&quot;https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/#pull-based-metrics-collector&quot;&gt;Katib Metrics
Collector&lt;/a&gt; guide.&lt;/p&gt;

&lt;h2 id=&quot;step-3-run-a-katib-experiment&quot;&gt;STEP 3: Run a Katib Experiment&lt;/h2&gt;

&lt;p&gt;Once our pipeline is encapsulated within the objective function, we can configure Katib to optimize the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BLEU&lt;/code&gt; score by 
tuning the hyperparameters:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;top_k&lt;/code&gt;: The number of documents retrieved (eg. between 10 and 20).&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;temperature&lt;/code&gt;: The randomness of text generation (eg. between 0.5 and 1.0).&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;define-hyperparameter-search-space&quot;&gt;Define hyperparameter search space&lt;/h1&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;parameters&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;top_k&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;katib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;search&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;temperature&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;katib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;search&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;double&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;1.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;step&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Let’s submit the experiment! We’ll use the &lt;a href=&quot;https://github.com/kubeflow/katib/blob/c18035e1041ca1b87ea7eb7c01cb81b5e2b922b3/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py#L178&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tune&lt;/code&gt; API &lt;/a&gt; that will run multiple trials to find the optimal &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;top_k&lt;/code&gt; 
and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;temperature&lt;/code&gt; values for our RAG pipeline.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;katib_client&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;katib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;KatibClient&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;namespace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;kubeflow&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;rag-tuning-experiment&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;katib_client&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tune&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;objective&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;objective&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;parameters&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;algorithm_name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;grid&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Grid search for hyperparameter tuning
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;objective_metric_name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;BLEU&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;objective_type&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;maximize&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;objective_goal&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;max_trial_count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Run up to 10 trials
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;parallel_trial_count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Run 2 trials in parallel
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;resources_per_trial&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;cpu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;memory&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2Gi&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;base_image&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;python:3.10-slim&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;packages_to_install&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;transformers==4.36.0&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;sentence-transformers==2.2.2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;faiss-cpu==1.7.4&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;numpy==1.23.5&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;huggingface_hub==0.20.0&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;nltk==3.9.1&quot;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Once the experiment is submitted, we can see output indicating that Katib has started the trials:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-commandline&quot;&gt;Experiment Trials status: 0 Trials, 0 Pending Trials, 0 Running Trials, 0 Succeeded Trials, 0 Failed Trials, 0 EarlyStopped Trials, 0 MetricsUnavailable Trials
Current Optimal Trial:
 {'best_trial_name': None,
 'observation': {'metrics': None},
 'parameter_assignments': None}
Experiment conditions:
 [{'last_transition_time': datetime.datetime(2025, 3, 13, 19, 40, 32, tzinfo=tzutc()),
 'last_update_time': datetime.datetime(2025, 3, 13, 19, 40, 32, tzinfo=tzutc()),
 'message': 'Experiment is created',
 'reason': 'ExperimentCreated',
 'status': 'True',
 'type': 'Created'}]
Waiting for Experiment: kubeflow/rag-tuning-experiment to reach Succeeded condition

.....

Experiment Trials status: 9 Trials, 0 Pending Trials, 2 Running Trials, 7 Succeeded Trials, 0 Failed Trials, 0 EarlyStopped Trials, 0 MetricsUnavailable Trials
Current Optimal Trial:
 {'best_trial_name': 'rag-tuning-experiment-66tmh9g7',
 'observation': {'metrics': [{'latest': '0.047040418725887996',
                              'max': '0.047040418725887996',
                              'min': '0.047040418725887996',
                              'name': 'BLEU'}]},
 'parameter_assignments': [{'name': 'top_k', 'value': '10'},
                           {'name': 'temperature', 'value': '0.6'}]}
Experiment conditions:
 [{'last_transition_time': datetime.datetime(2025, 3, 13, 19, 40, 32, tzinfo=tzutc()),
 'last_update_time': datetime.datetime(2025, 3, 13, 19, 40, 32, tzinfo=tzutc()),
 'message': 'Experiment is created',
 'reason': 'ExperimentCreated',
 'status': 'True',
 'type': 'Created'}, {'last_transition_time': datetime.datetime(2025, 3, 13, 19, 40, 52, tzinfo=tzutc()),
 'last_update_time': datetime.datetime(2025, 3, 13, 19, 40, 52, tzinfo=tzutc()),
 'message': 'Experiment is running',
 'reason': 'ExperimentRunning',
 'status': 'True',
 'type': 'Running'}]
Waiting for Experiment: kubeflow/rag-tuning-experiment to reach Succeeded condition
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We can also see the experiments and trials being run to search for the optimized parameter:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-commandline&quot;&gt;kubectl get experiments.kubeflow.org -n kubeflow
NAME                    TYPE      STATUS   AGE
rag-tuning-experiment   Running   True     10m
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code class=&quot;language-commandline&quot;&gt;kubectl get trials --all-namespaces
NAMESPACE   NAME                             TYPE      STATUS   AGE
kubeflow    rag-tuning-experiment-7wskq9b9   Running   True     10m
kubeflow    rag-tuning-experiment-cll6bt4z   Running   True     10m
kubeflow    rag-tuning-experiment-hzxrzq2t   Running   True     10m
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The list of completed trials and their results will be shown in the UI like
below. Steps to access Katib UI are available &lt;a href=&quot;https://www.kubeflow.org/docs/components/katib/user-guides/katib-ui/&quot;&gt;in the documentation&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2025-02-21-katib-rag-optimization/katib_experiment_run.jpeg&quot; alt=&quot;completed_runs&quot; /&gt;
&lt;img src=&quot;/images/2025-02-21-katib-rag-optimization/katib_ui.jpeg&quot; alt=&quot;trial details&quot; /&gt;&lt;/p&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;In this experiment, we leveraged Kubeflow Katib to optimize a
Retrieval-Augmented Generation (RAG) pipeline, systematically tuning key
hyperparameters like top_k and temperature to enhance retrieval precision and
generative response quality.&lt;/p&gt;

&lt;p&gt;For anyone working with RAG systems or hyperparameter optimization, Katib is a
powerful tool—enabling scalable, efficient, and intelligent tuning of machine
learning models! We hope this tutorial helps you streamline hyperparameter
tuning and unlock new efficiencies in your ML workflows!&lt;/p&gt;</content><author><name>Varsha Prasad Narsing (@varshaprasad96)</name></author><category term="katib" /><summary type="html">Introduction</summary></entry><entry><title type="html">Synthetic Data Generation with Kubeflow Pipelines</title><link href="https://blog.kubeflow.org/kfp/2025/02/16/synthetic-data-using-kfp.html" rel="alternate" type="text/html" title="Synthetic Data Generation with Kubeflow Pipelines" /><published>2025-02-16T00:00:00-06:00</published><updated>2025-02-16T00:00:00-06:00</updated><id>https://blog.kubeflow.org/kfp/2025/02/16/synthetic-data-using-kfp</id><content type="html" xml:base="https://blog.kubeflow.org/kfp/2025/02/16/synthetic-data-using-kfp.html">&lt;h3 id=&quot;synthetic-data-generation---why-and-how&quot;&gt;Synthetic Data Generation - Why and How?&lt;/h3&gt;

&lt;p&gt;When creating insights, decisions, and actions from data, the best results come from real data. But accessing real data often requires lengthy security and legal processes. The data may also be incomplete, biased, or too small, and during early exploration, we may not even know if it’s worth pursuing. While real data is essential for proper evaluation, gaps or limited access frequently hinder progress until the formal process is complete.&lt;/p&gt;

&lt;p&gt;To address these challenges, synthetic data provides an alternative. It mimics real data’s statistical properties while preserving privacy and accessibility. Synthetic data generators (synthesizers) are models trained on real data to generate new datasets that follow the same statistical distributions and relationships but do not contain real records. This allows for accelerated development, improved data availability, and enhanced privacy.&lt;/p&gt;

&lt;p&gt;Depending on the technique used, synthetic data not only mirrors statistical base properties of real data but also preserves correlations between features. These synthesizers — such as those based on Gaussian Copulas, Generative Adversarial Networks (GANs), and Variational Autoencoders (VAEs) — enable the creation of high-fidelity synthetic datasets. See more description of these techniques below.&lt;/p&gt;

&lt;h3 id=&quot;key-benefits-of-using-synthetic-data&quot;&gt;Key Benefits of Using Synthetic Data&lt;/h3&gt;

&lt;p&gt;While the above focuses on speed of development in general, and augmentation of data to improve performance of analytical modes, there are more motivations for &lt;em&gt;creating&lt;/em&gt; (synthetic) data:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Enhanced Privacy and Security&lt;/strong&gt;&lt;br /&gt;
Mimics real datasets without containing sensitive or personally identifiable information, mitigating privacy risks and ensuring compliance with regulations like GDPR.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Improved Data Availability&lt;/strong&gt;&lt;br /&gt;
Enables testing and training of models without requiring extensive real-world data collection.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Innovation and Experimentation&lt;/strong&gt;&lt;br /&gt;
Allows safe experimentation with new algorithms and models without exposing sensitive data, fostering rapid prototyping in a secure environment.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Ethical and Responsible AI Development&lt;/strong&gt;&lt;br /&gt;
Ensures training data is free from biases present in real-world datasets, promoting fair and unbiased AI systems.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Accelerated Testing and Deployment&lt;/strong&gt;&lt;br /&gt;
Supports testing of new products, services, and systems in a controlled yet realistic setting, ensuring they are robust, scalable, and ready for real-world use.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Cost Efficiency&lt;/strong&gt;&lt;br /&gt;
Reduces expenses related to data collection, storage, and compliance by eliminating the need for large-scale real-world data acquisition.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Regulatory Compliance Simplification&lt;/strong&gt;&lt;br /&gt;
Helps organizations navigate complex data regulations by offering a compliant alternative to real-world datasets, easing cross-border data transfers.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Balanced and Augmented Datasets&lt;/strong&gt;&lt;br /&gt;
Supplements real-world data by balancing underrepresented classes, improving model performance, and reducing biases in AI training.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Resilience Against Data Scarcity&lt;/strong&gt;&lt;br /&gt;
Enables AI development in domains where real-world data is limited, expensive, or difficult to obtain—such as healthcare and cybersecurity—by generating high-quality alternative datasets.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To realize these benefits, we need effective tools for generating synthetic data. Different frameworks exist for this purpose, ranging from cloud-based platforms to open-source solutions. In this post, we focus on &lt;strong&gt;open-source synthetic data generation frameworks&lt;/strong&gt; that provide control, flexibility, and on-premise deployment options.&lt;/p&gt;

&lt;h3 id=&quot;frameworks-for-creating-synthetic-data&quot;&gt;Frameworks for Creating Synthetic Data&lt;/h3&gt;

&lt;p&gt;This post focuses exclusively on open source frameworks.
Some data cannot be sent to the cloud, so some cloud-based synthetic data generation solutions are not always a good fit. 
For data already in cloud, we can use other cloud-based frameworks to generate synthetic data.&lt;/p&gt;

&lt;p&gt;Synthesizers are motivated by multiple factors, but in this context, our focus remains on generating synthetic data for on-premise use.&lt;/p&gt;

&lt;p&gt;So, what framework did we (initially) choose? Currently, we are using the open source version of &lt;a href=&quot;https://sdv.dev/&quot;&gt;SDV&lt;/a&gt;, 
an easy-to-use framework with a strong community and many useful features out-of-the-box (e.g. built-in evaluators, many modeling techniques). 
The field of synthetic data is evolving rapidly. While we do not aim to cover the latest advancements exhaustively, the use of Foundation models is certainly an area of interest.&lt;/p&gt;

&lt;p&gt;One of the most widely used open-source libraries for synthetic data generation is &lt;strong&gt;Synthetic Data Vault (SDV)&lt;/strong&gt;. It provides multiple synthesizers, each tailored for different types of data and statistical properties.&lt;/p&gt;

&lt;h3 id=&quot;the-synthetic-data-vault-sdv&quot;&gt;The Synthetic Data Vault (SDV)&lt;/h3&gt;

&lt;p&gt;When you initialize and fit a synthesizer (like GaussianCopulaSynthesizer, CTGANSynthesizer, etc. - see below), it trains a model based on 
the dataset you provide. This model learns the distribution of the data, capturing the relationships and dependencies between 
different features in the dataset. The synthesizer doesn’t memorize individual records from the dataset. Instead, it tries to learn the underlying statistical patterns, correlations, and distributions present in the data.&lt;/p&gt;

&lt;p&gt;Below are the (free) synthesizers provided by SDV that we evaluated on each use case. Each synthesizer does this differently:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;GaussianCopulaSynthesizer:&lt;/strong&gt; Uses statistical copula functions to model relationships between features, ensuring accurate marginal distributions.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;CTGANSynthesizer:&lt;/strong&gt; Uses Generative Adversarial Networks (GANs) to learn complex data distributions, particularly effective for categorical and mixed-type data.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;TVAESynthesizer:&lt;/strong&gt; Leverages Variational Autoencoders (VAEs) to capture latent representations, useful for continuous and structured data.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;CopulaGANSynthesizer:&lt;/strong&gt; Combines Copula-based statistical modeling with GANs to generate data with complex dependencies.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;PARSynthesizer:&lt;/strong&gt; Uses autoregressive models to generate sequential data while preserving temporal dependencies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;There are more synthesizers, also from SDV, but not all are open source.&lt;/em&gt; We used the first four, when evaluating optimal synthesizer for our different use cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generators - generating new data - on demand&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Synthesizers are statistical and (more often) AI models trained to mimic the real data. Once developed, the resulting models are used to create as much synthetic data as you find useful for your use case. Once trained, the synthesizer uses the learned model to generate new synthetic data that follows the same statistical properties and distributions as the original dataset, without directly copying any real data points. If you need more data? Just call the generator.&lt;/p&gt;

&lt;h3 id=&quot;evaluation-criteria-for-synthetic-data&quot;&gt;Evaluation Criteria for Synthetic Data&lt;/h3&gt;

&lt;p&gt;But, how good is synthetic data, how do we evaluate it?&lt;/p&gt;

&lt;p&gt;There are many aspects to consider when making use of synthetic data, and it is important to evaluate which synthetic data generation technique (synthesizer) is best for our specific dataset and use case.&lt;/p&gt;

&lt;p&gt;We need to ensure a good balance between:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Usability – How useful is the synthetic data for the intended use case?&lt;/li&gt;
  &lt;li&gt;Fidelity – How well does the synthetic data preserve statistical properties of the real data?&lt;/li&gt;
  &lt;li&gt;Privacy – Does the generated data ensure an acceptable level of privacy for the given use case?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For now, we are focusing only on usability and fidelity, using framework-provided measurements for fidelity and workflows described below to assess usability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comments on privacy and privacy preserving techniques&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Ensuring privacy in synthetic data is a non-trivial problem, even if there are techniques to ensure levels of privacy, it remains an active area of research.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Privacy problems, in synthetic data?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;While synthetic data enhances privacy by removing personally identifiable information, it is not inherently risk-free. Some key challenges include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Overfitting and Memorization: If a synthesizer is overfitted, it may generate synthetic records that closely resemble real data, leading to privacy leakage.&lt;/li&gt;
  &lt;li&gt;Anomaly Exposure: Unique individuals or rare events in the dataset (e.g., a very wealthy individual or a rare disease) may be unintentionally replicated in synthetic data, creating a risk of re-identification.&lt;/li&gt;
  &lt;li&gt;Re-identification Attacks: Even if synthetic data is statistically different from real data, attackers may use background knowledge to infer sensitive details about individuals.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One additional problem here is that it might be the anomalies we really are looking for. Currently we are experimenting with various differential privacy strategies, but it is still early days, and we do not focus on them in the examples below.&lt;/p&gt;

&lt;h3 id=&quot;our-on-premise-analytics-platform-arcus&quot;&gt;Our On-Premise Analytics Platform: ARCUS&lt;/h3&gt;

&lt;div style=&quot;display: flex; align-items: center; gap: 20px;&quot;&gt;
  &lt;p&gt;
    ARCUS is Telia’s advanced on-premise analytics platform, designed to support a wide range of use cases.
    The platform provides a Kubeflow-based MLOps environment for descriptive, predictive, generative, and (ongoing) agentic AI. 
    Fully built on open-source, ARCUS integrates a comprehensive stack of components into a unified platform - where Kubernetes is the cornerstone.
  &lt;/p&gt;
&lt;/div&gt;

&lt;h3 id=&quot;needed-environment-to-create-synthetic-data&quot;&gt;Needed environment to create synthetic data&lt;/h3&gt;

&lt;p&gt;For an efficient, automated selection of the best synthesizer, we need a number of things - from the underlying platform with GPUs and MLOps (Kubeflow).&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Kubeflow pipelines&lt;/li&gt;
  &lt;li&gt;GPU capabilities (for performance and efficiency)&lt;/li&gt;
  &lt;li&gt;Development (IDE) environment (for framework building and running)&lt;/li&gt;
  &lt;li&gt;Modern data platform (MinIO, Airflow) automating the synthetic data generation datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;parallelism-needed&quot;&gt;Parallelism needed&lt;/h4&gt;

&lt;p&gt;In the (Kube)flows below, we run evaluations in parallel - one for respective synthesizer, followed by a comparison of usability and fidelity scores, selecting the ‘winner’.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; In earlier version of Kubeflow we noticed that the parallelism wasn’t acting as expected, waiting for all threads to complete before moving to next step. We had to create a temporary workaround for this, now solved in Kubeflow Pipelines 2.3.0.&lt;/p&gt;

&lt;p&gt;Below, we briefly describe the base flow for selecting synthesizer, followed by one use case where we use the resulting data generator for ML development in cloud.&lt;/p&gt;

&lt;h2 id=&quot;exploring-the-creation-and-usefulness-of-synthetic-data&quot;&gt;Exploring the Creation and Usefulness of Synthetic Data&lt;/h2&gt;

&lt;p&gt;This is what we want to do: we have a use case, the supporting data, and developed ML model.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;How similar is the synthetic data compared to the real data (interesting for e.g. visualization use cases)?&lt;/li&gt;
  &lt;li&gt;How well do the ML models based on synthetic data keep up with ML models based on real data?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Validation of synthetic data techniques&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Create the synthetic data and save the best synthetic data generator. In this step similarity measures are created by the out of the box SDV framework&lt;/li&gt;
  &lt;li&gt;Create the ML model (in our case classifier model) both on real data and the using the synthetic data. Compare the performance of both models against the same real data testset.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;img src=&quot;/images/2025-02-16-synthetic-data-using-kfp/image-2.png&quot; width=&quot;800&quot; /&gt;&lt;/p&gt;

&lt;p&gt;From above, we have an example where the final synthesizer is collected and saved. This step is used in the example  below, exporting the resulting synthetic data generator to cloud.&lt;/p&gt;

&lt;h2 id=&quot;using-synthetic-data-generators-to-enable-multiple-environments-without-data-transfer&quot;&gt;Using Synthetic Data Generators to Enable Multiple Environments without Data Transfer&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Below is a usecase where we need to make use of both on-premise and cloud, without moving data to cloud.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem statement:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Our data cannot be moved from on-premise to cloud.&lt;/li&gt;
  &lt;li&gt;We need extra compute power, in our public cloud environment, to create an ML model for use on-premise.&lt;/li&gt;
  &lt;li&gt;The ML model is to be used on-premise, on new incoming data streams (that cannot be moved to cloud)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Create synthetic data for our on-premise environment use-cases, and - as a side product we save 
away the synthetic data generator (the pickled model used to create synthetic data).&lt;/li&gt;
  &lt;li&gt;Copy the synthetic data generator to cloud&lt;/li&gt;
  &lt;li&gt;Use the synthetic data generator in the cloud, creating synthetic data for training of an ML model&lt;/li&gt;
  &lt;li&gt;Copy the ML model on-premise, and use it for new incoming data&lt;/li&gt;
  &lt;li&gt;Evaluate: Compare the on-premise AI model with the model created in the cloud - against the same test data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;img src=&quot;/images/2025-02-16-synthetic-data-using-kfp/image.png&quot; width=&quot;700&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Division of work, what is done on-premise with Kubeflow, and what is done in cloud (AWS SageMaker)?&lt;/strong&gt;&lt;/p&gt;

&lt;h5 id=&quot;on-premise&quot;&gt;On-premise&lt;/h5&gt;

&lt;p&gt;See the above &lt;em&gt;Validation of synthetic data techniques&lt;/em&gt;.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Develop the model on real data – for the comparison later with the cloud model.&lt;/li&gt;
  &lt;li&gt;Create synthetic generators, evaluate the generators, and export the best generator to AWS.&lt;/li&gt;
&lt;/ul&gt;

&lt;h5 id=&quot;cloud&quot;&gt;Cloud&lt;/h5&gt;

&lt;ul&gt;
  &lt;li&gt;Use the imported synthetic generator (from on-premise)&lt;/li&gt;
  &lt;li&gt;Create synthetic data using the synthetic data generator&lt;/li&gt;
  &lt;li&gt;Develop the model and determine which synthetic generator is the best&lt;/li&gt;
  &lt;li&gt;Increase the amount of synthetic data, to see if the increase of synthetic data improves model performance (not for sure it will, see below comment)&lt;/li&gt;
  &lt;li&gt;Export model to on-premise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/images/2025-02-16-synthetic-data-using-kfp/image-3.png&quot; width=&quot;500&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In some more detail below.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2025-02-16-synthetic-data-using-kfp/image-4.png&quot; width=&quot;700&quot; /&gt;&lt;/p&gt;

&lt;h5 id=&quot;on-premise-1&quot;&gt;On-premise&lt;/h5&gt;

&lt;ul&gt;
  &lt;li&gt;Compare real data model against synthetic data model – using real test data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;In the current examples we see near equivalent performance of the ML models (a few percentage points lower for models created using synthetic data). We experimented with increasing the size of the synthetic dataset, with minor improvements. Augmenting the training data is expected (not tested here) to have more effects when using deep learning algorithms.&lt;/em&gt;&lt;/p&gt;

&lt;h1 id=&quot;summary&quot;&gt;Summary&lt;/h1&gt;

&lt;p&gt;Clearly, the above workflows would be very cumbersome to build and maintain without Kubeflow. 
Our solution is entirely open source, Kubernetes based, and uses Kubeflow and SDV to give us the scalability, robustness, and detailed control that is required.&lt;/p&gt;

&lt;p&gt;The area of synthetic data generation is moving fast with the overall AI field. 
Reports from &lt;a href=&quot;https://www.ibm.com/think/topics/synthetic-data&quot;&gt;IBM&lt;/a&gt; and others, of the increased usage of synthetic data for e.g. LLM training is frequent but the application areas are much greater.
We also expect more capable synthesizers and, hopefully, privacy preserving techniques to keep up with the innovation in this area.
Our original main motivator was speed up in innovation and experimentation, and overall - speed to market. Often a key pain for our teams.&lt;/p&gt;

&lt;p&gt;Looking ahead, we are exploring the development of a synthesizer catalog — ideally integrated into our overall data catalog — to enable users to rapidly experiment with ideas and get started more efficiently.&lt;/p&gt;</content><author><name>&lt;a href='https://www.linkedin.com/in/aaked'&gt;Åke Edlund&lt;/a&gt;, &lt;a href='https://www.linkedin.com/in/tarekabouzeid91'&gt;Tarek Abouzeid&lt;/a&gt;</name></author><category term="kfp" /><summary type="html">Synthetic Data Generation - Why and How?</summary></entry><entry><title type="html">Kubeflow and Me: A Story Started with Push-based Metrics Collection</title><link href="https://blog.kubeflow.org/gsoc-2024-project-6/" rel="alternate" type="text/html" title="Kubeflow and Me: A Story Started with Push-based Metrics Collection" /><published>2024-09-28T00:00:00-05:00</published><updated>2024-09-28T00:00:00-05:00</updated><id>https://blog.kubeflow.org/gsoc-2024-summary-push-basd-metrics-collection</id><content type="html" xml:base="https://blog.kubeflow.org/gsoc-2024-project-6/">&lt;p&gt;This summer, I gained a precious opportunity to participate in the Google Summer of Code(GSoC), in which I would contribute to Katib and fulfill a project named &lt;a href=&quot;https://www.kubeflow.org/events/gsoc-2024/#project-6-push-based-metrics-collection-for-katib&quot;&gt;“Push-based Metrics Collection in Katib”&lt;/a&gt; within 12 weeks. 
Firstly, I got to know about GSoC and Kubeflow with the recommendation from the former active maintainer Ce Gao(gaocegege)’s personal blog. And I was deeply impressed by the idea of cloud native AI toolkits, I decided to dive into this area and learn some skills to enhance my career and future.
In the blog, I’ll provide my personal insight into Katib, for those who are interested in cloud native, AI, and hyperparameters tuning.&lt;/p&gt;

&lt;h2 id=&quot;problem&quot;&gt;Problem&lt;/h2&gt;

&lt;p&gt;The project aims to provide a Python SDK API interface for users to push metrics to Katib DB directly.&lt;/p&gt;

&lt;p&gt;The current implementation of Metrics Collector is pull-based, raising design problems such as determining the frequency at which we scrape the metrics, performance issues like the overhead caused by too many sidecar containers, and restrictions on developing environments that must support sidecar containers and admission webhooks. And also, for data scientists, they need to pay attention to the format of metrics printed in the training scripts, which is error prone and may be hard to recognize.&lt;/p&gt;

&lt;h2 id=&quot;solution&quot;&gt;Solution&lt;/h2&gt;

&lt;p&gt;We decided to implement a new API for Katib Python SDK to offer users a push-based way to store metrics directly into the Kaitb DB and resolve those issues raised by pull-based metrics collection.&lt;/p&gt;

&lt;p&gt;In the new design, users just need to set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;metrics_collector_config={&quot;kind&quot;: &quot;Push&quot;}&lt;/code&gt; in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tune()&lt;/code&gt; function and call the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;report_metrics()&lt;/code&gt; API in their objective function to push metrics to Katib DB directly. There are no sidecar containers and restricted metric log formats any more. After that, Trial Controller will continuously collect metrics from Katib DB and update the status of Trial, which is the same as pull-based metrics collection.&lt;/p&gt;

&lt;p&gt;If you are interested in it, please refer to this &lt;a href=&quot;https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/#push-based-metrics-collector&quot;&gt;doc&lt;/a&gt; and &lt;a href=&quot;https://github.com/kubeflow/katib/blob/master/examples/v1beta1/sdk/mnist-with-push-metrics-collection.ipynb&quot;&gt;example&lt;/a&gt; for more details.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../images/2024-09-28-gsoc-2024-summary-push-based-metrics-collection/push-based-metrics-collection.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;my-contributions-during-the-gsoc&quot;&gt;My Contributions during the GSoC&lt;/h2&gt;

&lt;p&gt;I raised numerous PRs for the Katib and Training-Operator project. Some of them are related to my GSoC project, and others may contribute to the completeness of UTs (Unit Tests), simplicity of dependency management, and the compatibility of the UI component.&lt;/p&gt;

&lt;p&gt;For reference, the coding period can be rougly divided into 3 stages:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Convert the proposal to a KEP and discuss the architecture, API design, etc. (~4 weeks) with the mentors&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Develop a push-based metrics collection interface according to the KEP. (~8 weeks)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Write some examples and documentation &amp;amp; Present my work to the Kubeflow Community.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Also, I raised some issues not only to describe the problems and bugs I met during the coding period, but also to suggest the future enhancement direction for Katib and the Training-Operator.&lt;/p&gt;

&lt;p&gt;There is a &lt;a href=&quot;https://github.com/kubeflow/katib/issues/2340&quot;&gt;Github Issue&lt;/a&gt; tracks the progress of developing push-based metrics collection for katib during the GSoC coding phase. If you are interested in my work or Katib, please can check this issue for more details.&lt;/p&gt;

&lt;h2 id=&quot;lessons-learned&quot;&gt;Lessons Learned&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Think Twice, Code Once&lt;/strong&gt;: Andrey taught me that we should think of the API specification and all the related details before coding. This can significantly reduce the workload of the coding period and avoid big refactor of the project. Meanwhile, my understanding of Katib got clear gradually during the over-and-over rounds of re-think and re-design of the architecture.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Dive into the Source Code&lt;/strong&gt;: Engineering projects nowadays are extremely complex and need much effort to understand them. The best way to get familiar with the project is to dive into the source code and run several examples.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Communication&lt;/strong&gt;: Communication is the most important thing when collaborating with others. Expressing your idea precisely and making others understand you easily are significant skills not only in the open source community but also in various scenarios such as at a company and in group work.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;in-the-end&quot;&gt;In the End&lt;/h2&gt;

&lt;p&gt;Special Thanks:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;To my mentors &lt;a href=&quot;https://github.com/andreyvelich&quot;&gt;@andreyvelich&lt;/a&gt; &lt;a href=&quot;https://github.com/johnugeorge&quot;&gt;@johnugeorge&lt;/a&gt; &lt;a href=&quot;https://github.com/tenzen-y&quot;&gt;@tenzen-y&lt;/a&gt;, especially to Andrey. Your great knowledge about the code base and the industry impressed me a lot. Thanks for your timely response to my PRs and for always attending the weekly meetings to solve my pending problems, from which I benefited a lot. What’s more, I can well remember that, in that night, you explained the usage of Kubeflow in the industry to me with greate patience, and encouraged me not to doubt about myself, just do it and explore more, contribute more. You ignite the flame of my desire to contribute to cloud native AI.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;To &lt;a href=&quot;https://github.com/gaocegege&quot;&gt;@gaocegege&lt;/a&gt;. You recommend me to the Kubeflow Community. Thanks for your patient answers for my endless silly questions.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;To Google. Thanks for offering such a precious opportunity for me to begin my journey in the open source world!&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I hold a firm belief that every small step counts, and everybody in the community is unique and of great significance. There is no doubt that our joint efforts will surely contribute to the flourishing of our Kubeflow Community, make it the world-best community managing AI lifecycle on Kubernetes, and attract much more attention from the industry. Then, more and more new comers will pour in and work along with us.&lt;/p&gt;

&lt;p&gt;Again, I’ll continue to contribute to Kubeflow.&lt;/p&gt;

&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;

&lt;p&gt;For more details about Kubeflow and the upcoming GSoC’25 event, please check:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.kubeflow.org&quot;&gt;What is Kubeflow?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.kubeflow.org/events/gsoc-2025/&quot;&gt;Kubeflow GSoC’25 Event&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><author><name>Shao Wang(Electronic-Waste)</name></author><category term="gsoc" /><summary type="html">This summer, I gained a precious opportunity to participate in the Google Summer of Code(GSoC), in which I would contribute to Katib and fulfill a project named “Push-based Metrics Collection in Katib” within 12 weeks. Firstly, I got to know about GSoC and Kubeflow with the recommendation from the former active maintainer Ce Gao(gaocegege)’s personal blog. And I was deeply impressed by the idea of cloud native AI toolkits, I decided to dive into this area and learn some skills to enhance my career and future. In the blog, I’ll provide my personal insight into Katib, for those who are interested in cloud native, AI, and hyperparameters tuning.</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.kubeflow.org/images/logo.png" /><media:content medium="image" url="https://blog.kubeflow.org/images/logo.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">LLM Hyperparameter Optimization API: My Google Summer of Code Journey with Kubeflow</title><link href="https://blog.kubeflow.org/gsoc-2024-project-4/" rel="alternate" type="text/html" title="LLM Hyperparameter Optimization API: My Google Summer of Code Journey with Kubeflow" /><published>2024-09-19T00:00:00-05:00</published><updated>2024-09-19T00:00:00-05:00</updated><id>https://blog.kubeflow.org/gsoc-2024-summary-llm-hyperparameter-optimization-api</id><content type="html" xml:base="https://blog.kubeflow.org/gsoc-2024-project-4/">&lt;p&gt;This summer, I had the opportunity to participate in the Google Summer of Code (GSoC) program, where I contributed to Kubeflow, an open-source machine learning toolkit. My project focused on developing a high-level API for optimizing hyperparameters in Large Language Models (LLMs) within Katib, Kubeflow’s automated hyperparameter tuning system. I’d like to share insights from this experience with others interested in Kubeflow, GSoC, or optimizing LLMs.&lt;/p&gt;

&lt;h2 id=&quot;motivation&quot;&gt;Motivation&lt;/h2&gt;

&lt;p&gt;The rapid advancements and rising popularity of LLMs, such as GPT and BERT, have created a growing demand for efficient LLMOps in Kubernetes. To address this, we have developed a &lt;a href=&quot;https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/&quot;&gt;train API&lt;/a&gt; within the Training Python SDK, simplifying the process of fine-tuning LLMs using distributed PyTorchJob workers. However, hyperparameter optimization remains a crucial yet labor-intensive task for enhancing model performance.&lt;/p&gt;

&lt;h2 id=&quot;goal&quot;&gt;Goal&lt;/h2&gt;

&lt;p&gt;Hyperparameter optimization is essential but time-consuming, especially for LLMs with billions of parameters. This API simplifies the process by handling Kubernetes infrastructure, allowing data scientists to focus on model performance rather than system configuration.&lt;/p&gt;

&lt;p&gt;With this API, users can import pretrained models and datasets from Hugging Face and Amazon S3, define parameters including the hyperparameter search space, optimization objective, and resource configuration. The API then automates the creation of Experiment, which contains multiple Trials with different hyperparameter settings using PyTorch distributed training. It then collects and analyzes the metrics from each Trial to identify the optimal hyperparameter configuration.&lt;/p&gt;

&lt;p&gt;For detailed instruction on using the API, please refer to this &lt;a href=&quot;https://github.com/kubeflow/website/blob/b253c9402be94e7c5c044a0b1d2d9d86fd473149/content/en/docs/components/katib/user-guides/llm-hp-optimization.md&quot;&gt;guide&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../images/2024-09-19-gsoc-2024-llm-hyperparameter-optimization-api/design_tune_api.png&quot; alt=&quot;Design of API&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;my-contributions-to-the-gsoc-project&quot;&gt;My Contributions to the GSoC Project&lt;/h2&gt;

&lt;p&gt;My work on the project can be broadly divided into four stages:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Stage 1&lt;/strong&gt;: Designing the API, drafting the project proposal, and refining it into a Kubeflow Enhancement Proposal (KEP).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Stage 2&lt;/strong&gt;: Developing and implementing the high-level API.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Stage 3&lt;/strong&gt;: Implementing unit tests and end-to-end tests for the API.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Stage 4&lt;/strong&gt;: Creating documentation and presenting the work to the Kubeflow community.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In addition, I addressed several critical bugs in previous Katib and Training Operator releases and contributed new features, such as writing end-to-end tests for the train API.&lt;/p&gt;

&lt;p&gt;For those interested, here is a &lt;a href=&quot;https://github.com/kubeflow/katib/issues/2339&quot;&gt;detailed summary&lt;/a&gt; of all the pull requests I submitted during this process.&lt;/p&gt;

&lt;h2 id=&quot;lessons-learned&quot;&gt;Lessons Learned&lt;/h2&gt;

&lt;p&gt;This is my first experience contributing to an open source project, and I gained extensive technical knowledge throughout this project, including Docker, Kubernetes, and Kubeflow itself. Before developing and implementing the API, I invested significant time onboarding and familiarizing myself with Kubeflow. The &lt;a href=&quot;https://www.kubeflow.org/docs/&quot;&gt;official documentation&lt;/a&gt; and &lt;a href=&quot;https://github.com/kubeflow&quot;&gt;GitHub repository&lt;/a&gt; were invaluable resources during this process.&lt;/p&gt;

&lt;p&gt;Beyond these technical skills, I also learned several key lessons that extend into broader personal and professional growth.&lt;/p&gt;

&lt;h3 id=&quot;think-from-the-users-perspective&quot;&gt;Think from the User’s Perspective&lt;/h3&gt;

&lt;p&gt;One key lesson was the importance of considering the user’s needs. Discussing API design with my mentors taught me to focus on what functionalities users need and how they prefer to use them. Listening to users’ feedback is crucial for effective product design.&lt;/p&gt;

&lt;h3 id=&quot;dont-fear-bugs&quot;&gt;Don’t Fear Bugs&lt;/h3&gt;

&lt;p&gt;I used to feel overwhelmed by bugs and unsure how to tackle them. When a bug caused a container failure during a Katib trial, my mentor guided me through the debugging process, teaching me how to systematically trace and understand the issue. The key is to approach debugging methodically and think through each step of the problem.&lt;/p&gt;

&lt;h3 id=&quot;communication-is-important&quot;&gt;Communication is Important&lt;/h3&gt;

&lt;p&gt;Communication is important in collaboration, especially in open source projects. There are various ways of communicating in open-source projects, such as GitHub issues or PRs, Slack, and community meetings. And I’m grateful to my mentor for discussing my challenges during our weekly meetings and providing invaluable guidance.&lt;/p&gt;

&lt;h3 id=&quot;every-contribution-counts&quot;&gt;Every Contribution Counts&lt;/h3&gt;

&lt;p&gt;Initially, I thought contributing to open source was complex. I learned that every contribution, no matter how small, is valuable and appreciated. For example, contributing to documentation is crucial, especially for newcomers.&lt;/p&gt;

&lt;h2 id=&quot;in-the-end&quot;&gt;In The End&lt;/h2&gt;

&lt;p&gt;I am deeply grateful to everyone who supported me throughout this project. Your suggestions, advice, and encouragement were invaluable in helping me complete my work.&lt;/p&gt;

&lt;p&gt;I especially want to extend my heartfelt thanks to my mentor, Andrey Velichkevich. His deep knowledge of both the project and the industry, combined with his willingness to help, has been incredibly inspiring. I greatly appreciate the time and effort he dedicated to guiding me, from the high-level design of the API to the finer details like code formatting. I have learned so much from his mentorship.&lt;/p&gt;

&lt;p&gt;Looking ahead, I am excited to continue contributing to Kubeflow. I also look forward to helping future contributors by improving documentation and sharing my experiences with newcomers in the community.&lt;/p&gt;

&lt;p&gt;If you’re interested in open-source and want to be part of Kubeflow, GSoC 2025 applications are now open! Check out the details &lt;a href=&quot;https://www.kubeflow.org/events/gsoc-2025/&quot;&gt;here&lt;/a&gt;—we’d love to have you join us!&lt;/p&gt;</content><author><name>Hezhi(Helen) Xie</name></author><category term="gsoc" /><summary type="html">This summer, I had the opportunity to participate in the Google Summer of Code (GSoC) program, where I contributed to Kubeflow, an open-source machine learning toolkit. My project focused on developing a high-level API for optimizing hyperparameters in Large Language Models (LLMs) within Katib, Kubeflow’s automated hyperparameter tuning system. I’d like to share insights from this experience with others interested in Kubeflow, GSoC, or optimizing LLMs.</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.kubeflow.org/images/logo.png" /><media:content medium="image" url="https://blog.kubeflow.org/images/logo.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Kubeflow 1.9: New Tools for Model Management and Training Optimization</title><link href="https://blog.kubeflow.org/kubeflow-1.9-release/" rel="alternate" type="text/html" title="Kubeflow 1.9: New Tools for Model Management and Training Optimization" /><published>2024-07-22T00:00:00-05:00</published><updated>2024-07-22T00:00:00-05:00</updated><id>https://blog.kubeflow.org/kubeflow-1.9-release</id><content type="html" xml:base="https://blog.kubeflow.org/kubeflow-1.9-release/">&lt;p&gt;Kubeflow 1.9 significantly simplifies the development, tuning and management of secure machine learning models and LLMs. Highlights include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Model Registry&lt;/strong&gt;: Centralized management for ML models, versions, and artifacts.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Fine-Tune APIs for LLMs&lt;/strong&gt;: Simplifies fine-tuning of LLMs with custom datasets.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Pipelines&lt;/strong&gt;: Consolidation of Tekton and Argo Workflows backends for improved flexibility.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Security Enhancements&lt;/strong&gt;: Network policies, Oauth2-proxy, and CVE scanning.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Integration Upgrades&lt;/strong&gt;: Improved integrations with Ray, Seldon, BentoML, and KServe for LLM GPU optimizations.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Installation and Documentation&lt;/strong&gt;: Streamlined installation, updated platform dependencies, and enhanced documentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These updates aim to simplify workflows, improve integration dependencies, and provide Kubernetes-native operational efficiencies for enterprise scale, security, and isolation.&lt;/p&gt;

&lt;h2 id=&quot;model-registry&quot;&gt;Model Registry&lt;/h2&gt;

&lt;p&gt;A model registry provides a central catalog for ML model developers to index and manage models, versions, and ML artifacts metadata. It fills a gap between model experimentation and production activities. It provides a central interface for all stakeholders in the ML lifecycle to collaborate on ML models. Model registry has been &lt;a href=&quot;https://blog.kubeflow.org/kubeflow-user-survey-2023/#:~:text=lifecycle%2C%20followed%20by-,model%20registry%20(44%25),-and%20initial%20setup&quot;&gt;asked by the community&lt;/a&gt; for a long time and we are delighted to introduce it to the Kubeflow ecosystem.&lt;/p&gt;

&lt;p&gt;This initial release includes REST APIs and a Python SDK to track model artifacts and model metadata with a standardized format that can be reused across Kubeflow components, such as to deploy Inference Servers. You can get started by following the &lt;a href=&quot;https://www.kubeflow.org/docs/components/model-registry/overview/&quot;&gt;Model Registry tutorial on the Kubeflow website&lt;/a&gt;, or see a short &lt;a href=&quot;https://www.youtube.com/watch?v=JVxUTkAKsMU&quot;&gt;demo video&lt;/a&gt; of the Model Registry in action.&lt;/p&gt;

&lt;p&gt;We are just getting started. This is an Alpha version and we look forward to feedback. The &lt;a href=&quot;https://docs.google.com/document/d/1DmMhcae081SItH19gSqBpFtPfbkr9dFhSMCgs-JKzNo/edit&quot;&gt;model registry working group&lt;/a&gt; meets biweekly: you can provide feedback by joining the meeting or directly on the &lt;a href=&quot;https://github.com/kubeflow/model-registry/issues&quot;&gt;repository&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;fine-tune-apis-for-llms&quot;&gt;Fine-Tune APIs for LLMs&lt;/h2&gt;

&lt;p&gt;In the rapidly evolving ML/AI landscape, the ability to fine-tune pre-trained models represents a significant leap towards achieving custom solutions with less effort and time. Fine-tuning with custom datasets allows practitioners to adapt large language models (LLMs) to their specific needs.&lt;/p&gt;

&lt;p&gt;However, fine-tuning tasks often require extensive manual intervention, including the configuration of training environments and the distribution of data across nodes. The new &lt;a href=&quot;https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/&quot;&gt;Fine-Tune API&lt;/a&gt; aims to simplify this process, offering an easy-to-use Python interface that abstracts away the complexity involved in setting up and executing fine-tuning tasks on distributed systems.&lt;/p&gt;

&lt;p&gt;By providing this API, Training Operator not only simplifies the user experience for ML practitioners but also leverages its existing infrastructure for distributed training. You can take advantage of Kubernetes’ ability to dynamically schedule GPU, thus saving on compute resources and cost. Training Operator also gives you fault tolerant guarantees, to save your training procedures from cluster node failures.&lt;/p&gt;

&lt;h2 id=&quot;pipelines&quot;&gt;Pipelines&lt;/h2&gt;

&lt;h3 id=&quot;v1-feature-parity&quot;&gt;v1 Feature Parity&lt;/h3&gt;

&lt;p&gt;We made significant progress towards KFPv1 feature parity by adding more Kubernetes resources to the Pipelines code with the new kfp-kubernetes 1.2.0 Python package. We encourage every KFP user to test the new V2 functionality and plan your migration from V1 to V2. We still have some outstanding features that need to be ported over to V2, please help us to identify what’s missing by openining a new issue in the KFP repository.&lt;/p&gt;

&lt;h3 id=&quot;argo-workflows-and-tekton-backends-consolidation&quot;&gt;Argo Workflows and Tekton Backends Consolidation&lt;/h3&gt;

&lt;p&gt;The Pipelines Tekton backend has been &lt;a href=&quot;https://github.com/kubeflow/pipelines/pull/10678&quot;&gt;merged&lt;/a&gt; into the main Kubeflow Pipelines repository. You can now choose what workflow engine to use from the same Pipelines version. This proves the extensibility and flexibility of the KFP v2 architecture, which encourages other contributors to bring support for other workflow engines.&lt;/p&gt;

&lt;p&gt;Both Argo Workflows and Tekton provide unique advantages. Argo Workflows is known for its simplicity and ease of use, making it a popular choice for many users. Tekton offers extensive customization options with its pipeline definitions and reusable components, which can be advantageous for integrating into various CI/CD systems. Depending on your specific requirements and preferences, you can leverage the strengths of either Argo Workflows or Tekton to optimize your machine learning workflows.&lt;/p&gt;

&lt;p&gt;In this &lt;a href=&quot;https://developer.ibm.com/blogs/awb-tekton-optimizations-for-kubeflow-pipelines-2-0/&quot;&gt;blog post&lt;/a&gt;, you can find more details about the benefits of running KFP with either Tekton or Argo Workflows.&lt;/p&gt;

&lt;h3 id=&quot;argo-workflows-upgrade&quot;&gt;Argo Workflows Upgrade&lt;/h3&gt;

&lt;p&gt;Kubeflow Pipelines’s Argo Workflows backend is &lt;a href=&quot;https://github.com/kubeflow/pipelines/issues/10469&quot;&gt;upgraded to 3.4.16&lt;/a&gt;. This upgrade moves the supported version closer to the latest upstream version and brings lots of CVE resolutions. The previous minor version was no longer being patched by the Argo community, so lots of security issues had accumulated over time.&lt;/p&gt;

&lt;h2 id=&quot;katib&quot;&gt;Katib&lt;/h2&gt;

&lt;p&gt;Kubeflow 1.9 ships with Katib 0.17, which brings &lt;a href=&quot;https://github.com/kubeflow/katib/pull/2315&quot;&gt;official support&lt;/a&gt; for ARM64, getting us one step closer to full ARM64 coverage.&lt;/p&gt;

&lt;p&gt;For Data Scientists who submit training jobs with the Python SDK, you can now set the &lt;a href=&quot;https://github.com/kubeflow/katib/pull/2227&quot;&gt;algorithm settings&lt;/a&gt; and &lt;a href=&quot;https://github.com/kubeflow/katib/pull/2235&quot;&gt;environment variables&lt;/a&gt; from the tune method. Previously, you had to rely directly on Kubernetes CRD submission for these. You can also take advantage of the latest features from TensorFlow 2.16 and PyTorch 2.2. The team also worked to resolve &lt;a href=&quot;https://github.com/kubeflow/katib/issues/2346&quot;&gt;environmental conflicts&lt;/a&gt; that prevented the Katib Python SDK to be installed alongside the Kubeflow Python SDK.&lt;/p&gt;

&lt;p&gt;There are tons of additional improvements and bug fixes. Check out the full changelog &lt;a href=&quot;https://github.com/kubeflow/katib/blob/master/CHANGELOG.md&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;central-dashboard&quot;&gt;Central Dashboard&lt;/h2&gt;

&lt;p&gt;This release bring several improvements to the Kubeflow Central Dashboard, including:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Styling improvements to the sidebar, including &lt;a href=&quot;https://github.com/kubeflow/kubeflow/pull/7583&quot;&gt;grouping&lt;/a&gt; all Kubeflow Pipelines links to reduce clutter&lt;/li&gt;
  &lt;li&gt;Significant &lt;a href=&quot;https://github.com/kubeflow/kubeflow/pull/7582&quot;&gt;improvements&lt;/a&gt; to the “manage contributors” page, including the ability to manage contributors for all profiles that you are the owner of, and see which profiles you have access to, even when you are not the owner&lt;/li&gt;
  &lt;li&gt;Allow external services to &lt;a href=&quot;https://github.com/kubeflow/kubeflow/pull/7138&quot;&gt;parse&lt;/a&gt; the current profile (namespace) by sending the namespace selector value to non-iframed applications&lt;/li&gt;
  &lt;li&gt;Significant &lt;a href=&quot;https://github.com/kubeflow/kubeflow/pull/7578&quot;&gt;updates&lt;/a&gt; to dependencies to reduce CVEs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;../images/2024-07-22-kubeflow-1.9-release/dashboard.png&quot; alt=&quot;Kubeflow notebook images&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;notebooks&quot;&gt;Notebooks&lt;/h2&gt;

&lt;p&gt;With this release, we provide &lt;a href=&quot;https://github.com/kubeflow/kubeflow/pull/7590&quot;&gt;significant updates&lt;/a&gt; to all example notebook images including PyTorch 2.3.0, Tensorflow 2.15.1 and many other library updates. While you can continue to use the old images, we recommend updating to use the greatest and latest ML libraries.&lt;/p&gt;

&lt;p&gt;Additionally, notebooks images now run with a &lt;a href=&quot;https://github.com/kubeflow/kubeflow/pull/7622&quot;&gt;non-root SecurityContext&lt;/a&gt;, allowing for an improved security.&lt;/p&gt;

&lt;p&gt;Take a look at the &lt;a href=&quot;https://github.com/kubeflow/kubeflow/releases/tag/v1.9.0&quot;&gt;changelog&lt;/a&gt; for a full list of bug fixes and improvements.&lt;/p&gt;

&lt;p&gt;While this release was light on new Notebooks features, the Working Group is hard at work on an exciting new project: we are actively developing Notebooks V2, with contributions from various companies, in the &lt;a href=&quot;https://github.com/kubeflow/notebooks/tree/notebooks-v2&quot;&gt;new repository&lt;/a&gt;. Take a look &lt;a href=&quot;https://github.com/kubeflow/kubeflow/issues/7156&quot;&gt;here&lt;/a&gt; and join our Working Group meetings to get involved!&lt;/p&gt;

&lt;h2 id=&quot;kubeflow-platform-security-and-manifests&quot;&gt;Kubeflow Platform (Security and Manifests)&lt;/h2&gt;

&lt;h3 id=&quot;security&quot;&gt;Security&lt;/h3&gt;

&lt;h4 id=&quot;network-policies&quot;&gt;Network Policies&lt;/h4&gt;

&lt;p&gt;Network policies are enabled for the Kubeflow core services as a second layer of defense before Istio authorization policies. This gives administrators a better network overview and segmentation while also enforcing common enterprise security guidelines.
You can read more about the current implementation and architecture &lt;a href=&quot;https://github.com/kubeflow/manifests/tree/master/common/networkpolicies&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h4 id=&quot;authentication&quot;&gt;Authentication&lt;/h4&gt;

&lt;p&gt;Oauth2-proxy replaces oidc-authservice, which brings improved token-based authentication. Machine Learning engineers can now use tokens instead of insecure passwords for CI/CD automation of Kubeflow deployment and maintenance (e.g. using GitHub actions). 
You can read more about the current implementation and architecture &lt;a href=&quot;https://github.com/kubeflow/manifests/tree/master/common/oidc-client/oauth2-proxy&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h4 id=&quot;cve-scanning&quot;&gt;CVE Scanning&lt;/h4&gt;

&lt;p&gt;With this release we are introducing &lt;a href=&quot;https://github.com/kubeflow/manifests/blob/master/hack/trivy_scan.py&quot;&gt;automated CVE scanning&lt;/a&gt; with &lt;a href=&quot;https://github.com/aquasecurity/trivy&quot;&gt;Trivy&lt;/a&gt; on the manifests &lt;a href=&quot;https://github.com/kubeflow/manifests/blob/master/.github/workflows/trivy.yaml&quot;&gt;master branch&lt;/a&gt;. We appreciate contributions to reduce the number of CVEs, the Security Working Group needs help to build a more secure platform. You can find more details about our security scanning process and disclosure policy here. Here are is a summary from June 25th:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../images/2024-07-22-kubeflow-1.9-release/CVE_table.png&quot; alt=&quot;Kubeflow notebook images&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You can find a detailed Security WG roadmap &lt;a href=&quot;https://github.com/kubeflow/manifests/issues/2598&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;manifests&quot;&gt;Manifests&lt;/h3&gt;

&lt;h4 id=&quot;installation-and-documentation-improvements&quot;&gt;Installation and documentation improvements&lt;/h4&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/kubeflow/manifests?tab=readme-ov-file#upgrading-and-extending&quot;&gt;documentation&lt;/a&gt; has been improved and now contains guidelines for upgrading and extending the Kubeflow Platform for administrators. New users can now install kubeflow on their laptop in just a few minutes&lt;/p&gt;

&lt;p&gt;Platform dependencies updates:&lt;/p&gt;

&lt;table&gt;
  &lt;tr&gt;
   &lt;td&gt;&lt;strong&gt;Component&lt;/strong&gt;
   &lt;/td&gt;
   &lt;td&gt;&lt;a href=&quot;https://kubernetes.io/releases/&quot;&gt;Kubernetes&lt;/a&gt;
   &lt;/td&gt;
   &lt;td&gt;&lt;a href=&quot;https://github.com/kubernetes-sigs/kustomize/releases&quot;&gt;Kustomize&lt;/a&gt;
   &lt;/td&gt;
   &lt;td&gt;&lt;a href=&quot;https://istio.io/latest/news/releases/&quot;&gt;Istio&lt;/a&gt;
   &lt;/td&gt;
   &lt;td&gt;&lt;a href=&quot;https://github.com/dexidp/dex/releases&quot;&gt;Dex&lt;/a&gt;
   &lt;/td&gt;
   &lt;td&gt;&lt;a href=&quot;https://cert-manager.io/docs/installation/supported-releases/&quot;&gt;Cert-Manager&lt;/a&gt;
   &lt;/td&gt;
   &lt;td&gt;&lt;a href=&quot;https://knative.dev/docs/reference/relnotes/&quot;&gt;Knative&lt;/a&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;&lt;strong&gt;KF 1.9 Version&lt;/strong&gt;
   &lt;/td&gt;
   &lt;td&gt;1.27 - 1.29
   &lt;/td&gt;
   &lt;td&gt;5.2.1+
   &lt;/td&gt;
   &lt;td&gt;1.22.1
   &lt;/td&gt;
   &lt;td&gt;2.39.1
   &lt;/td&gt;
   &lt;td&gt;1.14.5
   &lt;/td&gt;
   &lt;td&gt;1.12.4
   &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;NOTE&lt;/strong&gt; Kubernetes 1.30+ is also expected to work, but was not officially tested.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;NOTE&lt;/strong&gt; Kustomize 5.2.1+ support with way less warnings. So platform engineers have a modern and supported installation tool chain.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4 id=&quot;integration-with-third-party-ml-tools&quot;&gt;Integration with third-party ML tools&lt;/h4&gt;

&lt;p&gt;We have updated the third-party components in /contrib to provide integration with the broader ML ecosystem.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/kubeflow/manifests/tree/master/contrib/bentoml&quot;&gt;BentoML&lt;/a&gt; 1.2.28 and 1.1.21&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/kubeflow/manifests/tree/master/contrib/seldon&quot;&gt;Seldon&lt;/a&gt; 1.18.1&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/kubeflow/manifests/tree/master/contrib/ray&quot;&gt;Ray&lt;/a&gt; 2.23 and Kuberay 1.1.1&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Find a detailed Manifests WG roadmap &lt;a href=&quot;https://github.com/kubeflow/manifests/issues/2592&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;kserve&quot;&gt;KServe&lt;/h2&gt;

&lt;p&gt;We upgraded to &lt;a href=&quot;https://kserve.github.io/website/0.13/blog/articles/2024-05-15-KServe-0.13-release/&quot;&gt;KServe 0.13&lt;/a&gt;. This release includes:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Enhanced Hugging Face Runtime Support&lt;/strong&gt;: Hugging Face models are supported out-of-the-box, implementing a &lt;a href=&quot;https://github.com/kserve/kserve/tree/master/python/huggingfaceserver&quot;&gt;KServe Hugging Face Serving Runtime&lt;/a&gt;. Currently supported tasks include sequence classification, token classification, fill mask, text generation, and text to text generation.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;vLLM Support&lt;/strong&gt;: Dedicated runtime support for &lt;a href=&quot;https://docs.vllm.ai/en/latest/&quot;&gt;vLLM&lt;/a&gt; is now included, streamlining the deployment process for LLMs.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;OpenAI Schema Integration&lt;/strong&gt;: KServe now supports endpoints for generative transformer models, following the OpenAI protocol.  This enables KServe to be used directly with OpenAI’s client libraries or third-party tools like LangChain and LlamaIndex.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;documentation&quot;&gt;Documentation&lt;/h2&gt;

&lt;p&gt;MLOps is a complex subject and users have asked for clear, up-to-date and comprehensive documentation. We are happy to announce that we started a restructuring process to better align the various components’ docs to have a similar structure. We are revamping our docs to better align with user expectations and how you expect technical docs to be organized. We will continue to improve the quality and completeness of our docs, by adding new user guides, tutorials, and reference architecture topics.&lt;/p&gt;

&lt;p&gt;We are looking for new members who can help us craft complete and high quality documentation. Please get involved by &lt;a href=&quot;https://github.com/kubeflow/website/pulls&quot;&gt;opening and reviewing PRs&lt;/a&gt; in the Kubeflow website.&lt;/p&gt;

&lt;h2 id=&quot;honorable-mentions&quot;&gt;Honorable Mentions&lt;/h2&gt;

&lt;h3 id=&quot;google-spark-operator-migration-to-kubeflow&quot;&gt;Google Spark Operator migration to Kubeflow&lt;/h3&gt;

&lt;p&gt;We’re excited to announce the migration of Google’s Spark Operator to the Kubeflow Spark Operator, marking the launch of a significant addition to the Kubeflow ecosystem. The Kubeflow Spark Operator simplifies the deployment and management of Apache Spark applications on Kubernetes. This announcement isn’t just about a new piece of technology, it’s about building a stronger, open-governed, and more collaborative community around Spark on Kubernetes.&lt;/p&gt;

&lt;p&gt;Kubeflow Spark Operator is not yet officially included in the Kubeflow release, but you can install it by following the instructions &lt;a href=&quot;https://www.kubeflow.org/docs/components/spark-operator/getting-started/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Read more about Kubeflow Spark operator in the announcement &lt;a href=&quot;https://blog.kubeflow.org/operators/2024/04/15/kubeflow-spark-operator.html&quot;&gt;blog post&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;google-summer-of-code&quot;&gt;Google Summer of Code&lt;/h3&gt;

&lt;p&gt;This year, Kubeflow was excited to participate in Google Summer of Code (GSoC), attracting a wave of enthusiastic students! Over 250 students joined our Slack channel, eager to learn about contributing to the Kubeflow community and crafting impactful proposals.&lt;/p&gt;

&lt;p&gt;We were also fortunate to have a dedicated group of 20 mentors ready to guide these talented individuals. From a pool of nearly 70 proposals, we have selected 10 proposals and were awarded with 8 outstanding students by Google. They are now actively contributing to various Kubeflow features, making a real difference in various Kubeflow components.&lt;/p&gt;

&lt;p&gt;We’ll be following their progress and sharing their accomplishments through a series of blog posts in the future, so stay tuned! A big thank you to all the mentors and students who are making Kubeflow’s GSoC 2024 a huge success!&lt;/p&gt;

&lt;p&gt;Thanks to all the studets: Adem Baccara, Biswajit Pattnaik, Hansini Karunarathne, Hezhi Xie, Sandipan Panda, Shao Wang, Shashank Mittal, SIVASUBRAMANIAM L. Visit &lt;a href=&quot;https://summerofcode.withgoogle.com/programs/2024/organizations/kubeflow&quot;&gt;this page&lt;/a&gt; for more details about each project and respective mentors.&lt;/p&gt;

&lt;h2 id=&quot;whats-next&quot;&gt;What’s next&lt;/h2&gt;

&lt;p&gt;The community continues to see growth, especially with the ever growing interest from the CloudNative community in AI topics. We have recently elected the first Kubeflow Steering Committee. This is the first step towards a more mature governance structure and a democratic and open community.&lt;/p&gt;

&lt;p&gt;If you want to take a peek into the Kubeflow 1.10 roadmap planning and contribute with your ideas, see &lt;a href=&quot;https://github.com/kubeflow/kubeflow/issues/7459&quot;&gt;Notebooks&lt;/a&gt;, &lt;a href=&quot;https://github.com/kubeflow/manifests/issues/2763&quot;&gt;Manifests &amp;amp; Security&lt;/a&gt;, &lt;a href=&quot;https://github.com/kubeflow/pipelines/discussions/10908&quot;&gt;Pipelines&lt;/a&gt;, &lt;a href=&quot;https://github.com/kubeflow/model-registry/issues/175&quot;&gt;Model Registry&lt;/a&gt;, &lt;a href=&quot;https://github.com/kubeflow/katib/issues/2386&quot;&gt;Katib&lt;/a&gt;, &lt;a href=&quot;https://github.com/kubeflow/training-operator/issues/2169&quot;&gt;Training Operator&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;how-to-get-started-with-19&quot;&gt;How to get started with 1.9&lt;/h2&gt;

&lt;p&gt;Visit the Kubeflow 1.9 &lt;a href=&quot;https://www.kubeflow.org/docs/releases/kubeflow-1.9/&quot;&gt;release page&lt;/a&gt; or head over to the &lt;a href=&quot;https://www.kubeflow.org/docs/started/&quot;&gt;Getting Started&lt;/a&gt; section to learn more about installation, architecture and quick start examples.&lt;/p&gt;

&lt;h2 id=&quot;join-the-community&quot;&gt;Join the Community&lt;/h2&gt;

&lt;p&gt;We would like to thank everyone for their contribution to Kubeflow 1.9, especially Ricardo Martinelli De Oliveira for his work as the v1.9 Release Manager, all the release team and the working group leads, who relentlessly dedicate their time to this great project.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Release team members&lt;/em&gt;: Ricardo Martinelli De Oliveira, Stefano Fioravanzo, Helber Belmiro, Diego Lovison, Ajay Nagar, Mathew Wicks, Steven Irvin, Milos Grubjesic, Andrew Scribner, Julius von Kohout.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Working Group leads&lt;/em&gt;: Andrey Velichkevich, Ce Gao, Chaoran Yu, Chen Sun, Christian Kadner, Ilias Katsakioris, James Liu, James Wu, Johnu George, Julius von Kohout, Kimonas Sotirchos, Mathew Wicks, Matteo Mortari, Ramesh Reddy, Stefano Fioravanzo, Tommy Li, Vara Bonthu, Yannis Zarkadas, Yuan Tang, Yuki Iwai.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Kubeflow Steering Committee&lt;/em&gt;: Andrey Velichkevich, Johnu George, Josh Bottum, James Wu, Yuan Tang.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Participating Distributions&lt;/em&gt;: Charmed Kubeflow (Canonical), IBM IKS, Nutanix, OpenShift AI (RedHat), Oracle Cloud Infrastructure, DeployKF, VMWare, QBO. You can find more details about Kubeflow distributions &lt;a href=&quot;https://www.kubeflow.org/docs/started/installing-kubeflow/#packaged-distributions&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;want-to-help&quot;&gt;Want to help?&lt;/h3&gt;

&lt;p&gt;The Kubeflow community Working Groups hold open meetings and are always looking for more volunteers and users to unlock the potential of machine learning. If you’re interested in becoming a Kubeflow contributor, please feel free to check out the resources below. We look forward to working with you!&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Visit our &lt;a href=&quot;https://www.kubeflow.org/docs/about/community/&quot;&gt;Kubeflow website&lt;/a&gt; or Kubeflow GitHub Page&lt;/li&gt;
  &lt;li&gt;Join the &lt;a href=&quot;https://www.kubeflow.org/docs/about/community/&quot;&gt;Kubeflow Slack channel&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Join the &lt;a href=&quot;https://groups.google.com/g/kubeflow-discuss&quot;&gt;kubeflow-discuss&lt;/a&gt; mailing list&lt;/li&gt;
  &lt;li&gt;Attend our weekly &lt;a href=&quot;https://www.kubeflow.org/docs/about/community/#kubeflow-community-call&quot;&gt;community meeting&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><author><name>Kubeflow 1.9 Release Team, Stefano Fioravanzo</name></author><category term="release" /><summary type="html">Kubeflow 1.9 significantly simplifies the development, tuning and management of secure machine learning models and LLMs. Highlights include:</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blog.kubeflow.org/images/logo.png" /><media:content medium="image" url="https://blog.kubeflow.org/images/logo.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Announcing the Kubeflow Spark Operator: Building a Stronger Spark on Kubernetes Community</title><link href="https://blog.kubeflow.org/operators/2024/04/15/kubeflow-spark-operator.html" rel="alternate" type="text/html" title="Announcing the Kubeflow Spark Operator: Building a Stronger Spark on Kubernetes Community" /><published>2024-04-15T00:00:00-05:00</published><updated>2024-04-15T00:00:00-05:00</updated><id>https://blog.kubeflow.org/operators/2024/04/15/kubeflow-spark-operator</id><content type="html" xml:base="https://blog.kubeflow.org/operators/2024/04/15/kubeflow-spark-operator.html">&lt;p&gt;We’re excited to announce the migration of Google’s Spark Operator to
the &lt;a href=&quot;https://github.com/kubeflow/spark-operator&quot;&gt;Kubeflow Spark Operator&lt;/a&gt;,
marking the launch of a significant addition to the &lt;a href=&quot;https://www.kubeflow.org/&quot;&gt;Kubeflow&lt;/a&gt; ecosystem. The
Kubeflow Spark Operator simplifies the deployment and management of
&lt;a href=&quot;https://spark.apache.org/docs/latest/index.html&quot;&gt;Apache
Spark&lt;/a&gt;
applications on &lt;a href=&quot;https://kubernetes.io/&quot;&gt;Kubernetes&lt;/a&gt;. This
announcement isn’t just about a new piece of technology, it’s about
building a stronger, open-governed, and more collaborative community
around Spark on Kubernetes.&lt;/p&gt;

&lt;h2 id=&quot;the-journey-to-kubeflow-spark-operator&quot;&gt;The Journey to Kubeflow Spark Operator&lt;/h2&gt;

&lt;p&gt;The journey of the Kubeflow Spark Operator began with Google Cloud
Platform’s Spark on Kubernetes Operator
(https://cloud.google.com/blog/products/data-analytics/data-analytics-meet-containers-kubernetes-operator-for-apache-spark-now-in-beta).
With over 2.3k stars and 1.3k forks on GitHub, this project laid the
foundation for a robust Spark on Kubernetes experience, enabling users
to deploy Spark workloads seamlessly across Kubernetes clusters.&lt;/p&gt;

&lt;p&gt;Growth and innovation require not just code but also community.
Acknowledging the resource and time limitations faced by Google Cloud’s
original maintainers, Kubeflow has taken up the mantle.This transition
is not merely administrative but a strategic move towards fostering a
vibrant, diverse, and more actively engaged community.&lt;/p&gt;

&lt;h2 id=&quot;why-kubeflow&quot;&gt;Why Kubeflow?&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Enhanced Community Engagement:&lt;/strong&gt; Transitioning to Kubeflow opens
the door to a broader developer base, encouraging contributions and
collaboration. Since Kubeflow is a CNCF incubating project this
transition will help consolidate Cloud Native and Spark communities
to work more closely to build robust infrastructure to run Spark
applications on Kubernetes.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Stronger Governance&lt;/strong&gt;: Kubeflow’s governance model provides a
structured environment for decision-making and project management,
ensuring sustainable growth for the Spark Operator.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Unified Ecosystem&lt;/strong&gt;: By bringing the Spark Operator under the
Kubeflow umbrella, we’re not just merging projects; we’re building
a cohesive ecosystem that enhances the Spark on Kubernetes
experience.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Integration with AI/ML:&lt;/strong&gt; Kubeflow provides several components to
address many stages of the AI/ML lifecycle. The Spark distributed
data processing capabilities are a natural expansion, allowing the
Spark community to closely collaborate and better integrate within
the end-to-end ML lifecycle.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;whats-next&quot;&gt;What’s Next?&lt;/h2&gt;

&lt;p&gt;We are dedicated to not just maintaining but enhancing the Kubeflow
Spark Operator for the long term. Here’s what you can look forward to:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Upcoming roadmap&lt;/strong&gt;: As part of the first release, we aim to update
the documentation with references to Kubeflow, address GitHub
workflow issues, and update the container registry with Kubeflow,
along with any other critical issues.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Ongoing Support and Enhancements&lt;/strong&gt;: At the time of migration to
the Kubeflow repository, the repository comprised 450+ issues and
60+ pull requests. We kindly request contributors to rebase their
code and update the PR with a comment indicating its continued
relevance. As for open issues, they will be considered for
resolution as the broader community and contributors engage in
upcoming releases.The operator will continue to evolve,
incorporating new features and improvements to stay at the forefront
of Kubernetes deployments.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Rich Community Resources&lt;/strong&gt;: From detailed documentation to
hands-on tutorials, we’re crafting resources to help you succeed
with the Spark Operator. We are planning to host regular Spark
Operator calls to discuss users issues, questions, and future
roadmaps.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Open Doors for Contributions&lt;/strong&gt;: This is a call to arms for
developers, writers, and enthusiasts! Your contributions are the
lifeblood of this project, and there’s a place for everyone to make
a mark.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Kubeflow Working Group Data:&lt;/strong&gt; To consolidate efforts around new
data tools in the Kubeflow ecosystem such as Spark Operator and
Model Registry the new Working Group Data will be formalized soon.
Feel free to review &lt;a href=&quot;https://github.com/kubeflow/community/pull/673&quot;&gt;this PR&lt;/a&gt; to
get involved and provide your feedback on the charter.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;join-the-movement&quot;&gt;Join the Movement&lt;/h2&gt;

&lt;p&gt;The Kubeflow Spark Operator is more than just software. It’s a
community endeavor. Here’s how you can be a part of this journey:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Dive In&lt;/strong&gt;: Visit our &lt;a href=&quot;https://github.com/kubeflow/spark-operator&quot;&gt;GitHub repository&lt;/a&gt;
to start your journey with the Kubeflow Spark Operator.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Contribute&lt;/strong&gt;: Every code snippet, documentation update, and piece
of feedback counts. Find out how you can contribute on GitHub.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Be Part of the Community&lt;/strong&gt;: Join the &lt;a href=&quot;https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels&quot;&gt;CNCF Slack Workspace&lt;/a&gt; 
and then join the conversation in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#kubeflow-spark-operator&lt;/code&gt; channel. 
Whether you’re seeking advice, sharing insights, or just listening in, 
your presence enriches us. Follow &lt;a href=&quot;https://www.kubeflow.org/docs/about/community/&quot;&gt;this guide&lt;/a&gt;
to learn more about Kubeflow community.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Kubeflow Spark Operator Community Call&lt;/strong&gt;: We’re excited to announce Spark Operator Community Monthly Meetings for Open Source Contributors starting &lt;strong&gt;May 17th,   2024 (10-11 AM PST)&lt;/strong&gt;. These meetings, held every third Friday, are your chance to discuss project updates, share ideas, and collaborate with the community. You can find the Zoom call details and meeting notes in this &lt;a href=&quot;https://docs.google.com/document/d/1AnG6ptKLBY7O6ddyNm4SVsEbfu6jiyVyN3hDDgDUnxQ/edit#heading=h.pgrbsx5c3qqo&quot;&gt;Google Doc&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the spirit of collaboration fostered on platforms like Slack, and
with the generous support of the Google Cloud team, we’re set to sail
into a promising future. The Kubeflow Spark Operator isn’t just a tool,
it’s our collective step towards harnessing the true potential of Spark
on Kubernetes. Together, let’s shape the future of cloud-native big
data processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Reference Issues&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/kubeflow/spark-operator/issues/1928#issue-2066490838&quot;&gt;Action items for adoption of Spark Kubernetes Operator in Kubeflow&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/kubeflow/community/pull/673&quot;&gt;WG Data(name provisional)proposal&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/kubeflow/spark-operator/issues/1929&quot;&gt;Update Documentation: Redirect Helm Chart Installation Links to Kubeflow Repository&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/kubeflow/spark-operator/issues/1930&quot;&gt;Update Release Workflows: Change Container Registry to Kubeflow’s ghcr.io&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;</content><author><name>&lt;a href='https://www.linkedin.com/in/varaprofile/'&gt;Vara Bonthu&lt;/a&gt;, &lt;a href='https://www.linkedin.com/in/yuchaoran/'&gt;Chaoran Yu&lt;/a&gt;, &lt;a href='https://www.linkedin.com/in/andrey-velichkevich/'&gt;Andrey Velichkevich&lt;/a&gt;, &lt;a href='https://www.linkedin.com/in/wielgusmarcin/'&gt;Marcin Wielgus&lt;/a&gt;</name></author><category term="operators" /><summary type="html">We’re excited to announce the migration of Google’s Spark Operator to the Kubeflow Spark Operator, marking the launch of a significant addition to the Kubeflow ecosystem. The Kubeflow Spark Operator simplifies the deployment and management of Apache Spark applications on Kubernetes. This announcement isn’t just about a new piece of technology, it’s about building a stronger, open-governed, and more collaborative community around Spark on Kubernetes.</summary></entry><entry><title type="html">Kubeflow Project Steering Committee Announced</title><link href="https://blog.kubeflow.org/election/2024/01/31/kubeflow-project-steering-committee-announced.html" rel="alternate" type="text/html" title="Kubeflow Project Steering Committee Announced" /><published>2024-01-31T00:00:00-06:00</published><updated>2024-01-31T00:00:00-06:00</updated><id>https://blog.kubeflow.org/election/2024/01/31/kubeflow-project-steering-committee-announced</id><content type="html" xml:base="https://blog.kubeflow.org/election/2024/01/31/kubeflow-project-steering-committee-announced.html">&lt;p&gt;We’re thrilled to &lt;a href=&quot;https://groups.google.com/g/kubeflow-discuss/c/IiwFd-Eoc_Y/m/ig9pMvvtAAAJ&quot;&gt;announce the results&lt;/a&gt; of the &lt;a href=&quot;https://github.com/kubeflow/community/blob/b52b8dbc020fa69731d31d6df618ab87e38f822e/elections/kubeflow-steering-committee-elections-2023.md&quot;&gt;2023 Kubeflow Steering Committee (KSC) election&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The Kubeflow community has shown a strong commitment to the project’s future by casting their votes for the new leadership. Please welcome &lt;a href=&quot;https://www.linkedin.com/in/terrytangyuan/&quot;&gt;Yuan (Terry) Tang&lt;/a&gt; (Red Hat), &lt;a href=&quot;https://www.linkedin.com/in/andrey-velichkevich/&quot;&gt;Andrey Velichkevich&lt;/a&gt; (Apple), and &lt;a href=&quot;https://www.linkedin.com/in/johnu-george-83036610/&quot;&gt;Johnu George&lt;/a&gt; (Nutanix) who will be joining &lt;a href=&quot;https://www.linkedin.com/in/joshbottum/&quot;&gt;Josh Bottom&lt;/a&gt; (Consultant) and &lt;a href=&quot;https://www.linkedin.com/in/jawks/&quot;&gt;James Wu&lt;/a&gt; (Google) as KSC members.  The three new members will serve a two-year term ending in 2025, beginning immediately.&lt;/p&gt;

&lt;p&gt;The election saw a turnout of 72.88%, with 43 out of 59 eligible voters participating. The three nominees were chosen from a pool of candidates, which also included &lt;a href=&quot;https://www.linkedin.com/in/kimonas-sotirchos-1ba45b155/&quot;&gt;Kimonas Sotirchos&lt;/a&gt;, &lt;a href=&quot;https://www.linkedin.com/in/juliusvonkohout/&quot;&gt;Julius von Kohout&lt;/a&gt;, and &lt;a href=&quot;https://www.linkedin.com/in/mathewwicks/&quot;&gt;Mathew Wicks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This election represents a significant step forward for the Kubeflow project. We extend our deepest gratitude to the interim KSC members and the election officials for their service and to all those who were nominated and participated in this election. We eagerly anticipate the contributions of our new leadership and what we can accomplish as a project moving forward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Quotes:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;“Excited to start my two-year term on the Kubeflow Steering Committee on behalf of Red Hat! Thanks to everyone who’s supported me along the way, and congratulations to all the new committee members. Johnu, Andrey, and I have collaborated on various Kubeflow subprojects for many years. I look forward to working with them and the rest of the committee more closely. Kubeflow is an umbrella of projects that provide an excellent foundation for AI/ML applications in the cloud-native world. Its contributor velocity, community momentum, and industry adoption have proliferated. Together with the Kubeflow community, ecosystem projects, supporting organizations, and partners, I am confident that we will steer the project towards a successful CNCF journey and continue thriving.” - &lt;em&gt;Yuan (Terry) Tang&lt;/em&gt; (Red Hat)&lt;/p&gt;

&lt;p&gt;“I am thrilled to join the Kubeflow Steering Committee for the two-year term. Being an active member of this community for almost six years, it was great to see how the project evolves towards open governance and widespread user adoption. Thanks to everyone for your support, collaboration, and contributions throughout these years. Kubeflow stands as the foundational framework to run AI/ML workloads on Kubernetes. It bridges ML, Big Data, and Cloud Native ecosystems to facilitate a new generation of AI applications. Kubeflow components empower users at every stage of ML lifecycle from model development and training to fine-tuning and deployment. I am excited about the future of CNCF and Kubeflow together, to build an open CloudNative AI/ML platform accessible to everyone.” - &lt;em&gt;Andrey Velichkevich&lt;/em&gt; (Apple)&lt;/p&gt;

&lt;p&gt;“I am really excited to join the first Kubeflow Steering Committee formed post CNCF incubation. Kubeflow is one of the most popular enterprise-ready MLOps platforms used in production at various companies. It has become the de facto platform to run complex ML pipelines, managing model lifecycles on a large scale. Having been part of Kubeflow leadership since its inception, it is delightful to see its growing ecosystem proving its relevance in the current times. Thanks to the entire Kubeflow user and developer community, who have immensely contributed to its success over the years. I am excited to drive the future of this vibrant community and look forward to the next phase of the Kubeflow journey in the CNCF ecosystem.”  -&lt;em&gt;Johnu George&lt;/em&gt; (Nutanix)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Additional Information:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/kubeflow/community/blob/master/elections/kubeflow-steering-committee-elections-2023.md&quot;&gt;Kubeflow Project Steering Committee Elections 2023&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/kubeflow/community/blob/master/elections/eligible-candidates-and-voters-2023-KSC.md&quot;&gt;List of Eligible Voters and Candidates 2023&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://groups.google.com/g/kubeflow-discuss/c/IiwFd-Eoc_Y/m/ig9pMvvtAAAJ&quot;&gt;[Election] - 2023 KSC Member Election Results&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://groups.google.com/g/kubeflow-discuss/c/00il0lYjaOI/m/2AJbpc9EAgAJ&quot;&gt;[Elections] Election Phase - Kubeflow Project Steering Committee Elections - Begins today&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://groups.google.com/g/kubeflow-discuss/c/LWxRO6ADUgM/m/GZ5RGfv0AwAJ&quot;&gt;[Election] Testimonial Phase Kubeflow Steering Committee Election - Now Open&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://groups.google.com/g/kubeflow-discuss/c/dEs1aGSd_X4/m/n-6pMaCnAQAJ&quot;&gt;[Elections] Nominations Phase - 2023 Kubeflow Project Steering Committee Elections - Opens Today&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://groups.google.com/g/kubeflow-discuss/c/aIkzJVgsSp4/m/crpQv1EZAgAJ&quot;&gt;[Elections] 2023 Kubeflow Project Steering Committee Elections - Eligible Voters and Candidates&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://groups.google.com/g/kubeflow-discuss/c/54PGJ-ypqc8/m/iRC2UXcvAQAJ&quot;&gt;[Elections] 2023 Kubeflow Project Steering Committee Elections - Timeline and Update&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Get Involved&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Getting involved in the &lt;a href=&quot;https://www.kubeflow.org/&quot;&gt;Kubeflow &lt;/a&gt;Community offers numerous opportunities for learning, networking, and contributing to open-source development. Participating in the &lt;a href=&quot;https://www.kubeflow.org/docs/about/community/&quot;&gt;Kubeflow community&lt;/a&gt; you’ll be joining a vibrant ecosystem of developers, data scientists, AI enthusiasts and others who are pushing the boundaries of machine learning operations with Kubeflow. By actively participating and sharing your ideas, you can make a meaningful impact and be part of this dynamic community.  Check out the &lt;a href=&quot;https://www.kubeflow.org/docs/about/community/#kubeflow-community-call&quot;&gt;weekly community call&lt;/a&gt;, get involved in discussions on the [mailing list]&lt;/p&gt;</content><author><name>Amber Graner, David Cardozo</name></author><category term="election" /><summary type="html">We’re thrilled to announce the results of the 2023 Kubeflow Steering Committee (KSC) election.</summary></entry><entry><title type="html">Kubeflow Community Holds First Election for Kubeflow Steering Committee</title><link href="https://blog.kubeflow.org/election/2023/12/12/kubeflow-community-holds-first-election-for-kubeflow-steering-committee.html" rel="alternate" type="text/html" title="Kubeflow Community Holds First Election for Kubeflow Steering Committee" /><published>2023-12-12T00:00:00-06:00</published><updated>2023-12-12T00:00:00-06:00</updated><id>https://blog.kubeflow.org/election/2023/12/12/kubeflow-community-holds-first-election-for-kubeflow-steering-committee</id><content type="html" xml:base="https://blog.kubeflow.org/election/2023/12/12/kubeflow-community-holds-first-election-for-kubeflow-steering-committee.html">&lt;p&gt;The Kubeflow Community, known for its dedication to making machine learning workflows on Kubernetes simple, portable, and scalable, &lt;a href=&quot;https://groups.google.com/g/kubeflow-discuss/c/54PGJ-ypqc8&quot;&gt;has recently announced&lt;/a&gt; a significant milestone in its journey. For the first time, they are holding elections for the Kubeflow Steering Committee (KSC)!&lt;/p&gt;

&lt;p&gt;The KSC will be crucial in guiding the project’s direction, ensuring it continues to meet the needs of its growing user base. Candidates for this committee are drawn from the community’s diverse members, embodying the spirit of open-source collaboration that Kubeflow cherishes.
This election marks a new chapter in Kubeflow’s history, reflecting the community’s commitment to a democratic project governance model. It serves as a testament to the community’s growth, maturity, and dedication to inclusivity and shared decision-making.&lt;/p&gt;

&lt;p&gt;The election process for the KSC is structured to ensure that every phase is transparent and fair. As per &lt;a href=&quot;https://github.com/kubeflow/community/blob/master/elections/kubeflow-steering-committee-elections-2023.md&quot;&gt;the announced timeline&lt;/a&gt;, we are currently in &lt;a href=&quot;https://groups.google.com/g/kubeflow-discuss/c/dEs1aGSd_X4/m/n-6pMaCnAQAJ&quot;&gt;the nomination phase&lt;/a&gt;, where &lt;a href=&quot;https://github.com/kubeflow/community/blob/master/elections/eligible-candidates-and-voters-2023-KSC.md&quot;&gt;eligible community members&lt;/a&gt; are encouraged to nominate themselves or others for a position on the Committee. This stage will determine the group of candidates who will move forward to the testimonial phase and then the voting phase. It’s an exciting time for the community, as members step up to the challenge and show their readiness to drive the project’s future direction.&lt;/p&gt;

&lt;p&gt;Elections like this in the open source ecosystem highlight the collaborative spirit that drives these initiatives. They provide the opportunity for anyone, regardless of their role in the project, to step up and help guide its future. Let’s all join in congratulating the &lt;a href=&quot;https://www.kubeflow.org/&quot;&gt;Kubeflow Community&lt;/a&gt; on this milestone and look forward to the outcomes of this exciting election!&lt;/p&gt;

&lt;h3 id=&quot;get-involved&quot;&gt;&lt;em&gt;Get Involved:&lt;/em&gt;&lt;/h3&gt;
&lt;p&gt;We are an open and welcoming &lt;a href=&quot;https://www.kubeflow.org/docs/about/community/&quot;&gt;community&lt;/a&gt; of software developers, data scientists, AI enthusiasts, organizations and more!  Getting involved in the Kubeflow Community is an exciting journey that offers numerous opportunities for learning, networking, and contributing to open-source development. By actively participating and sharing your ideas, you can make a meaningful impact and be part of this dynamic community. Check out the &lt;a href=&quot;https://www.kubeflow.org/docs/about/community/#kubeflow-community-call&quot;&gt;weekly community call&lt;/a&gt;, get involved in discussions on &lt;a href=&quot;https://www.kubeflow.org/docs/about/community/#kubeflow-mailing-list&quot;&gt;the mailing list&lt;/a&gt; or chat with others on the &lt;a href=&quot;https://www.kubeflow.org/docs/about/community/#kubeflow-slack&quot;&gt;Slack Workspace&lt;/a&gt;!&lt;/p&gt;

&lt;h3 id=&quot;important-dates&quot;&gt;&lt;em&gt;Important Dates:&lt;/em&gt;&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Exceptions Phase:&lt;/strong&gt;  4 December 2023 at 0900 Pacific Time  (Starts) - 10 December 2023 at 12:00pm Pacific Time (Ends)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Nomination Phase:&lt;/strong&gt; 11 December 2023 at 0900 Pacific Time  (Starts) - 24 December 2023 at 12:00pm Pacific Time (Ends)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Testimonial Phase:&lt;/strong&gt; 25 December 2023 at 0900 Pacific Time (Starts) - 7 January 2024 at 12:00pm Pacific Time (Ends)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Voting Phase:&lt;/strong&gt; 8 January 2024 at 0900 Pacific Time (Starts) - 29 January 2024 at 12:00pm Pacific Time (Ends)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Announcement&lt;/strong&gt; of Election Results:  30 January 2024&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;additional-information&quot;&gt;&lt;em&gt;Additional Information:&lt;/em&gt;&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/kubeflow/community/blob/master/elections/kubeflow-steering-committee-elections-2023.md&quot;&gt;Kubeflow Project Steering Committee Elections 2023&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/kubeflow/community/blob/master/elections/eligible-candidates-and-voters-2023-KSC.md&quot;&gt;List of Eligible Voters and Candidates 2023&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://groups.google.com/g/kubeflow-discuss/c/dEs1aGSd_X4/m/n-6pMaCnAQAJ&quot;&gt;[Elections] Nominations Phase - 2023 Kubeflow Project Steering Committee Elections - Opens Today&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://groups.google.com/g/kubeflow-discuss/c/aIkzJVgsSp4/m/crpQv1EZAgAJ&quot;&gt;[Elections] 2023 Kubeflow Project Steering Committee Elections - Eligible Voters and Candidates&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://groups.google.com/g/kubeflow-discuss/c/54PGJ-ypqc8/m/iRC2UXcvAQAJ&quot;&gt;[Elections] 2023 Kubeflow Project Steering Committee Elections - Timeline and Update&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><author><name>Amber Graner</name></author><category term="election" /><summary type="html">The Kubeflow Community, known for its dedication to making machine learning workflows on Kubernetes simple, portable, and scalable, has recently announced a significant milestone in its journey. For the first time, they are holding elections for the Kubeflow Steering Committee (KSC)!</summary></entry></feed>