A coding guide for property-based testing using hypotheses with stateful, differential, and metamorphic test designs

In this tutorial, we explore using asset-based testing hypothesis And build a rigorous testing pipeline that goes far beyond traditional unit testing. We apply invariant, discriminant testing, metamorphic testing, targeted exploration, and stateful testing to validate both the functional correctness and behavioral guarantees of our systems. Instead of manually generating edge cases, we let the hypothesis generate structured inputs, limit failures to a minimum of counter-examples, and systematically uncover hidden bugs. Additionally, we demonstrate how modern testing practices can be directly integrated into experimental and research-driven workflows.

import sys, textwrap, subprocess, os, re, math
!{sys.executable} -m pip -q install hypothesis pytest


test_code = r'''
import re, math
import pytest
from hypothesis import (
   given, assume, example, settings, note, target,
   HealthCheck, Phase
)
from hypothesis import strategies as st
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant, initialize, precondition


def clamp(x: int, lo: int, hi: int) -> int:
   if x < lo:
       return lo
   if x > hi:
       return hi
   return x


def normalize_whitespace(s: str) -> str:
   return " ".join(s.split())


def is_sorted_non_decreasing(xs):
   return all(xs(i) <= xs(i+1) for i in range(len(xs)-1))


def merge_sorted(a, b):
   i = j = 0
   out = ()
   while i < len(a) and j < len(b):
       if a(i) <= b(j):
           out.append(a(i)); i += 1
       else:
           out.append(b(j)); j += 1
   out.extend(a(i:))
   out.extend(b(j:))
   return out


def merge_sorted_reference(a, b):
   return sorted(list(a) + list(b))

We set up the environment by installing Hypothesis and pytest and importing all the necessary modules. We start building the full test suite by defining core utility functions like clamp, normalize_whitespace, and merge_sorted. We establish the functional foundation that our property-based tests will rigorously validate in subsequent snippets.

def safe_parse_int(s: str):
   t = s.strip()
   if re.fullmatch(r"(+-)?d+", t) is None:
       return (False, "not_an_int")
   if len(t.lstrip("+-")) > 2000:
       return (False, "too_big")
   try:
       return (True, int
   except Exception:
       return (False, "parse_error")


def safe_parse_int_alt(s: str):
   t = s.strip()
   if not t:
       return (False, "not_an_int")
   sign = 1
   if t(0) == "+":
       t = t(1:)
   elif t(0) == "-":
       sign = -1
       t = t(1:)
   if not t or any(ch < "0" or ch > "9" for ch in t):
       return (False, "not_an_int")
   if len
       return (False, "too_big")
   val = 0
   for ch in t:
       val = val * 10 + (ord(ch) - 48)
   return (True, sign * val)


bounds = st.tuples(st.integers(-10_000, 10_000), st.integers(-10_000, 10_000)).map(
   lambda t: (t(0), t(1)) if t(0) <= t(1) else (t(1), t(0))
)


@st.composite
def int_like_strings(draw):
   sign = draw(st.sampled_from(("", "+", "-")))
   digits = draw(st.text(alphabet=st.characters(min_codepoint=48, max_codepoint=57), min_size=1, max_size=300))
   left_ws = draw(st.text(alphabet=(" ", "t", "n"), min_size=0, max_size=5))
   right_ws = draw(st.text(alphabet=(" ", "t", "n"), min_size=0, max_size=5))
   return f"{left_ws}{sign}{digits}{right_ws}"


sorted_lists = st.lists(st.integers(-10_000, 10_000), min_size=0, max_size=200).map(sorted)

We implement parsing logic and define structured strategies that generate limited, meaningful test inputs. We create composite strategies like int_like_strings to precisely control the input space for property validation. We design sorted list generators and threshold strategies that enable difference and invariant-based testing.

@settings(max_examples=300, suppress_health_check=(HealthCheck.too_slow))
@given(x=st.integers(-50_000, 50_000), b=bounds)
def test_clamp_within_bounds(x, b):
   lo, hi = b
   y = clamp(x, lo, hi)
   assert lo <= y <= hi


@settings(max_examples=300, suppress_health_check=(HealthCheck.too_slow))
@given(x=st.integers(-50_000, 50_000), b=bounds)
def test_clamp_idempotent(x, b):
   lo, hi = b
   y = clamp(x, lo, hi)
   assert clamp(y, lo, hi) == y


@settings(max_examples=250)
@given(s=st.text())
@example("   attb n c  ")
def test_normalize_whitespace_is_idempotent(s):
   t = normalize_whitespace(s)
   assert normalize_whitespace
   assert normalize_whitespace(" nt " + s + "  t") == normalize_whitespace(s)


@settings(max_examples=250, suppress_health_check=(HealthCheck.too_slow))
@given(a=sorted_lists, b=sorted_lists)
def test_merge_sorted_matches_reference(a, b):
   out = merge_sorted(a, b)
   ref = merge_sorted_reference(a, b)
   assert out == ref
   assert is_sorted_non_decreasing(out)

We define key property tests that validate correctness and passivity across multiple functions. We use the Hypothesis Decorator to automatically detect edge cases and verify behavioral guarantees such as boundary constraints and deterministic normalization. We also implement discriminant testing to ensure that our merge implementation matches a reliable context.

@settings(max_examples=250, deadline=200, suppress_health_check=(HealthCheck.too_slow))
@given(s=int_like_strings())
def test_two_parsers_agree_on_int_like_strings(s):
   ok1, v1 = safe_parse_int(s)
   ok2, v2 = safe_parse_int_alt(s)
   assert ok1 and ok2
   assert v1 == v2


@settings(max_examples=250)
@given(s=st.text(min_size=0, max_size=200))
def test_safe_parse_int_rejects_non_ints(s):
   t = s.strip()
   m = re.fullmatch(r"(+-)?d+", t)
   ok, val = safe_parse_int(s)
   if m is None:
       assert ok is False
   else:
       if len(t.lstrip("+-")) > 2000:
           assert ok is False and val == "too_big"
       else:
           assert ok is True and isinstance(val, int)


def variance(xs):
   if len(xs) < 2:
       return 0.0
   mu = sum(xs) / len(xs)
   return sum((x - mu) ** 2 for x in xs) / (len(xs) - 1)


@settings(max_examples=250, phases=(Phase.generate, Phase.shrink))
@given(xs=st.lists(st.integers(-1000, 1000), min_size=0, max_size=80))
def test_statistics_sanity(xs):
   target(variance(xs))
   if len(xs) == 0:
       assert variance(xs) == 0.0
   elif len(xs) == 1:
       assert variance(xs) == 0.0
   else:
       v = variance(xs)
       assert v >= 0.0
       k = 7
       assert math.isclose(variance((x + k for x in xs)), v, rel_tol=1e-12, abs_tol=1e-12)

We extend our validation for parsing robustness and statistical correctness using targeted exploration. We verify that two independent integer parsers agree on the structured input and apply rejection rules on invalid strings. We further apply the metamorphic test by validating variance invariants under transformation.

class Bank:
   def __init__(self):
       self.balance = 0
       self.ledger = ()


   def deposit(self, amt: int):
       if amt <= 0:
           raise ValueError("deposit must be positive")
       self.balance += amt
       self.ledger.append(("dep", amt))


   def withdraw(self, amt: int):
       if amt <= 0:
           raise ValueError("withdraw must be positive")
       if amt > self.balance:
           raise ValueError("insufficient funds")
       self.balance -= amt
       self.ledger.append(("wd", amt))


   def replay_balance(self):
       bal = 0
       for typ, amt in self.ledger:
           bal += amt if typ == "dep" else -amt
       return bal


class BankMachine(RuleBasedStateMachine):
   def __init__(self):
       super().__init__()
       self.bank = Bank()


   @initialize()
   def init(self):
       assert self.bank.balance == 0
       assert self.bank.replay_balance() == 0


   @rule(amt=st.integers(min_value=1, max_value=10_000))
   def deposit(self, amt):
       self.bank.deposit(amt)


   @precondition(lambda self: self.bank.balance > 0)
   @rule(amt=st.integers(min_value=1, max_value=10_000))
   def withdraw(self, amt):
       assume(amt <= self.bank.balance)
       self.bank.withdraw(amt)


   @invariant()
   def balance_never_negative(self):
       assert self.bank.balance >= 0


   @invariant()
   def ledger_replay_matches_balance(self):
       assert self.bank.replay_balance() == self.bank.balance


TestBankMachine = BankMachine.TestCase
'''


path = "/tmp/test_hypothesis_advanced.py"
with open(path, "w", encoding="utf-8") as f:
   f.write(test_code)


print("Hypothesis version:", __import__("hypothesis").__version__)
print("nRunning pytest on:", path, "n")


res = subprocess.run((sys.executable, "-m", "pytest", "-q", path), capture_output=True, text=True)
print(res.stdout)
if res.returncode != 0:
   print(res.stderr)


if res.returncode == 0:
   print("nAll Hypothesis tests passed.")
elif res.returncode == 5:
   print("nPytest collected no tests.")
else:
   print("nSome tests failed.")

We implement a stateful system using Hypothesis’s rule-based state machine to simulate a bank account. We define rules, preconditions, and invariants to guarantee balance stability and account integrity under arbitrary operation sequences. We then execute the entire test suite through pytest, allowing Hypothesis to automatically discover counter examples and verify the correctness of the system.

Finally, we created a comprehensive property-based testing framework that validates stateful systems with pure functions, parsing logic, statistical behavior, and even invariants. We leveraged hypothesis shrinkage, targeted search, and state machine testing capabilities to move from example-based testing to behavior-driven verification. This allows us to reason about correctness at a higher level of abstraction while maintaining strong guarantees for edge cases and system stability.

check it out Full coding notebook here. Also, feel free to follow us Twitter And don’t forget to join us 130k+ ML subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us

A coding guide for property-based testing using hypotheses with stateful, differential, and metamorphic test designs

Scope Creep Kraken – Meet O’Reilly

iVibe has coded a tool that analyzes customer sentiments and themes from call recordings

Related Articles

Leave a Comment Cancel Reply