A complete workflow for automated prompt optimization using Gemini Flash, few-shot selection, and evolutionary instruction search

by
0 comments
A complete workflow for automated prompt optimization using Gemini Flash, few-shot selection, and evolutionary instruction search

In this tutorial, we’re moving from traditional prompt crafting to a more systematic, programmable approach by treating prompts as tunable parameters rather than static text. Instead of guessing which instruction or example works best, we create an optimization loop around Gemini 2.0 Flash that experiments, evaluates, and automatically selects the strongest prompt configuration. In this implementation, we see our model improving step by step, demonstrating how much more powerful accelerated engineering becomes when we approach it with data-driven discovery rather than intuition. check it out full code here,

import google.generativeai as genai
import json
import random
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
import numpy as np
from collections import Counter


def setup_gemini(api_key: str = None):
   if api_key is None:
       api_key = input("Enter your Gemini API key: ").strip()
   genai.configure(api_key=api_key)
   model = genai.GenerativeModel('gemini-2.0-flash-exp')
   print("✓ Gemini 2.0 Flash configured")
   return model


@dataclass
class Example:
   text: str
   sentiment: str
   def to_dict(self):
       return {"text": self.text, "sentiment": self.sentiment}


@dataclass
class Prediction:
   sentiment: str
   reasoning: str = ""
   confidence: float = 1.0

We import all the necessary libraries and define the setup_gemini helper to configure Gemini 2.0 flash. We also create example and prediction data classes to present dataset entries and model outputs in a clean, structured way. check it out full code here,

def create_dataset() -> Tuple(List(Example), List(Example)):
   train_data = (
       Example("This movie was absolutely fantastic! Best film of the year.", "positive"),
       Example("Terrible experience, waste of time and money.", "negative"),
       Example("The product works as expected, nothing special.", "neutral"),
       Example("I'm blown away by the quality and attention to detail!", "positive"),
       Example("Disappointing and overpriced. Would not recommend.", "negative"),
       Example("It's okay, does the job but could be better.", "neutral"),
       Example("Incredible customer service and amazing results!", "positive"),
       Example("Complete garbage, broke after one use.", "negative"),
       Example("Average product, met my basic expectations.", "neutral"),
       Example("Revolutionary! This changed everything for me.", "positive"),
       Example("Frustrating bugs and poor design choices.", "negative"),
       Example("Decent quality for the price point.", "neutral"),
       Example("Exceeded all my expectations, truly remarkable!", "positive"),
       Example("Worst purchase I've ever made, avoid at all costs.", "negative"),
       Example("It's fine, nothing to complain about really.", "neutral"),
       Example("Absolutely stellar performance, 5 stars!", "positive"),
       Example("Broken and unusable, total disaster.", "negative"),
       Example("Meets requirements, standard quality.", "neutral"),
   )
   val_data = (
       Example("Absolutely love it, couldn't be happier!", "positive"),
       Example("Broken on arrival, very upset.", "negative"),
       Example("Works fine, no major issues.", "neutral"),
       Example("Outstanding performance and great value!", "positive"),
       Example("Regret buying this, total letdown.", "negative"),
       Example("Adequate for basic use.", "neutral"),
   )
   return train_data, val_data


class PromptTemplate:
   def __init__(self, instruction: str = "", examples: List(Example) = None):
       self.instruction = instruction
       self.examples = examples or ()
   def format(self, text: str) -> str:
       prompt_parts = ()
       if self.instruction:
           prompt_parts.append(self.instruction)
       if self.examples:
           prompt_parts.append("nExamples:")
           for ex in self.examples:
               prompt_parts.append(f"nText: {ex.text}")
               prompt_parts.append(f"Sentiment: {ex.sentiment}")
       prompt_parts.append(f"nText: {text}")
       prompt_parts.append("Sentiment:")
       return "n".join(prompt_parts)
   def clone(self):
       return PromptTemplate(self.instruction, self.examples.copy())

We prepare a small but diverse sentiment dataset for training and validation using the create_dataset function. We then define a PromptTemplate, which lets us assemble instructions, a few-shot examples, and a current query into a single prompt string. We treat the template as a programmable object so that we can swap out instructions and examples during customization. check it out full code here,

class SentimentModel:
   def __init__(self, model, prompt_template: PromptTemplate):
       self.model = model
       self.prompt_template = prompt_template


   def predict(self, text: str) -> Prediction:
       prompt = self.prompt_template.format(text)
       try:
           response = self.model.generate_content(prompt)
           result = response.text.strip().lower()
           for sentiment in ('positive', 'negative', 'neutral'):
               if sentiment in result:
                   return Prediction(sentiment=sentiment, reasoning=result)
           return Prediction(sentiment="neutral", reasoning=result)
       except Exception as e:
           return Prediction(sentiment="neutral", reasoning=str(e))


   def evaluate(self, dataset: List(Example)) -> float:
       correct = 0
       for example in dataset:
           pred = self.predict(example.text)
           if pred.sentiment == example.sentiment:
               correct += 1
       return (correct / len(dataset)) * 100

We wrap Gemini in the SentimentModel class so that we can call it like a regular classifier. We format the prompts via the template, call generate_content, and post-process the text to extract one of the three emotions. We also add an evaluation method so we can measure accuracy on any dataset with a single call. check it out full code here,

class PromptOptimizer:
   def __init__(self, model):
       self.model = model
       self.instruction_candidates = (
           "Analyze the sentiment of the following text. Classify as positive, negative, or neutral.",
           "Classify the sentiment: positive, negative, or neutral.",
           "Determine if this text expresses positive, negative, or neutral sentiment.",
           "What is the emotional tone? Answer: positive, negative, or neutral.",
           "Sentiment classification (positive/negative/neutral):",
           "Evaluate sentiment and respond with exactly one word: positive, negative, or neutral.",
       )


   def select_best_examples(self, train_data: List(Example), val_data: List(Example), n_examples: int = 3) -> List(Example):
       best_examples = None
       best_score = 0
       for _ in range(10):
           examples_by_sentiment = {
               'positive': (e for e in train_data if e.sentiment == 'positive'),
               'negative': (e for e in train_data if e.sentiment == 'negative'),
               'neutral': (e for e in train_data if e.sentiment == 'neutral')
           }
           selected = ()
           for sentiment in ('positive', 'negative', 'neutral'):
               if examples_by_sentiment(sentiment):
                   selected.append(random.choice(examples_by_sentiment(sentiment)))
           remaining = (e for e in train_data if e not in selected)
           while len(selected) < n_examples and remaining:
               selected.append(random.choice(remaining))
               remaining.remove(selected(-1))
           template = PromptTemplate(instruction=self.instruction_candidates(0), examples=selected)
           test_model = SentimentModel(self.model, template)
           score = test_model.evaluate(val_data(:3))
           if score > best_score:
               best_score = score
               best_examples = selected
       return best_examples


   def optimize_instruction(self, examples: List(Example), val_data: List(Example)) -> str:
       best_instruction = self.instruction_candidates(0)
       best_score = 0
       for instruction in self.instruction_candidates:
           template = PromptTemplate(instruction=instruction, examples=examples)
           test_model = SentimentModel(self.model, template)
           score = test_model.evaluate(val_data)
           if score > best_score:
               best_score = score
               best_instruction = instruction
       return best_instruction

We introduce the PromptOptimizer class and define a pool of candidate instructions for testing. We apply select_best_examples to search a small, diverse set of few-shot examples and optimize_instructions to score each instruction variant on the validation data. We are effectively turning quick design into a lightweight search problem rather than examples and instructions. check it out full code here,

  def compile(self, train_data: List(Example), val_data: List(Example), n_examples: int = 3) -> PromptTemplate:
       best_examples = self.select_best_examples(train_data, val_data, n_examples)
       best_instruction = self.optimize_instruction(best_examples, val_data)
       optimized_template = PromptTemplate(instruction=best_instruction, examples=best_examples)
       return optimized_template


def main():
   print("="*70)
   print("Prompt Optimization Tutorial")
   print("Stop Writing Prompts, Start Programming Them!")
   print("="*70)


   model = setup_gemini()
   train_data, val_data = create_dataset()
   print(f"✓ {len(train_data)} training examples, {len(val_data)} validation examples")


   baseline_template = PromptTemplate(
       instruction="Classify sentiment as positive, negative, or neutral.",
       examples=()
   )
   baseline_model = SentimentModel(model, baseline_template)
   baseline_score = baseline_model.evaluate(val_data)


   manual_examples = train_data(:3)
   manual_template = PromptTemplate(
       instruction="Classify sentiment as positive, negative, or neutral.",
       examples=manual_examples
   )
   manual_model = SentimentModel(model, manual_template)
   manual_score = manual_model.evaluate(val_data)


   optimizer = PromptOptimizer(model)
   optimized_template = optimizer.compile(train_data, val_data, n_examples=4)

We add a compilation method to combine the best examples and best instructions into the final customized prompt template. Inside main, we configure Gemini, create datasets, and evaluate both the zero-shot baseline and the simple manual few-shot prompt. We then call the optimizer to produce our compiled, optimized prompt for sentiment analysis. check it out full code here,

optimized_model = SentimentModel(model, optimized_template)
   optimized_score = optimized_model.evaluate(val_data)


   print(f"Baseline (zero-shot):     {baseline_score:.1f}%")
   print(f"Manual few-shot:          {manual_score:.1f}%")
   print(f"Optimized (compiled):     {optimized_score:.1f}%")


   print(f"nInstruction: {optimized_template.instruction}")
   print(f"nSelected Examples ({len(optimized_template.examples)}):")
   for i, ex in enumerate(optimized_template.examples, 1):
       print(f"n{i}. Text: {ex.text}")
       print(f"   Sentiment: {ex.sentiment}")


   test_cases = (
       "This is absolutely amazing, I love it!",
       "Completely broken and unusable.",
       "It works as advertised, no complaints."
   )


   for test_text in test_cases:
       print(f"nInput: {test_text}")
       pred = optimized_model.predict(test_text)
       print(f"Predicted: {pred.sentiment}")


   print("✓ Tutorial Complete!")


if __name__ == "__main__":
   main()

We evaluate the optimized model and compare its accuracy with a baseline and a manual few-shot setup. We print out the selected directive and selected examples so that we can observe what the optimizer finds, and then we run some live test sentences to see the predictions in action. We conclude by summarizing the improvements and reinforcing the idea that signals can be tuned programmatically rather than written by hand.

In conclusion, we implemented how programmatic prompt optimization provides a repeatable, evidence-driven workflow for designing high-performing prompts. We started with a delicate baseline, then iteratively tested the instructions, selected diverse examples, and compiled an optimized template that outperforms manual efforts. This process shows that we no longer rely on trial-and-error signals; Instead, we conducted a controlled optimization cycle. Additionally, we can extend this pipeline to new tasks, richer datasets, and more advanced scoring methods, allowing us to engineer prompts with accuracy, confidence, and scalability.


check it out full code hereFeel free to check us out GitHub page for tutorials, code, and notebooksAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.


Asif Razzaq Marktechpost Media Inc. Is the CEO of. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. Their most recent endeavor is the launch of MarketTechPost, an Artificial Intelligence media platform, known for its in-depth coverage of Machine Learning and Deep Learning news that is technically robust and easily understood by a wide audience. The platform boasts of over 2 million monthly views, which shows its popularity among the audience.

Related Articles

Leave a Comment