Author(s): Mohammed Bedhiafi
Originally published on Towards AI.
Moving beyond brittle fallback rules toward a mathematically principled, learnable treatment system.
Your test suite will be passed on Friday. On Monday, a developer renames a CSS class, tweaks the form layout – and suddenly 40 automated tests are failing. Not because the app is broken. Because your XPaths did.
This is the quiet crisis of large-scale test automation. And the industry standard answer – self healing – Smarter than anything else, but not nearly as smart.
Why does speculative treatment fail?
Most self-healing tools work like this: when an XPath fails, run through a ranked list of fallback strategies.
1. Try by ID
2. Try by name attribute
3. Try by visible text
4. Try by CSS class
...
This seems appropriate. In practice, it’s a house of cards – and the failure modes are structural, not accidental.
Static priority is context-blind. An ID match seems like a slam dunk, unless your frontend generates IDs automatically. input_3847291. Class names seem static until design system refactor btn-primary To button--cta. No heuristic knows which signals are reliable In your specific application.
Strategies are evaluated in isolation. A good treatment decision combines several cues: The text looks right And original tag matches And The depth is the same. Heuristic ladders evaluate signals one at a time and stop at the first hit – they cannot combine evidence.
There is no lesson. Every treatment program is stateless. The system makes the same mistakes over and over again, never improving from a growing history of successes and failures.
Confidence is binary. Either the fallback gets something or it doesn’t. There are no probability scores, no uncertainty estimates, no limits to tune.
Basic problem: Heuristic treatment is a lookup table that pretends to be intelligent.
formally redefining the problem
Let’s state the problem precisely, because precision is where the solution lies.
on time t−1Correctly targets a test element E* in the dome D(t−1). on time TeaDOM has evolved into D
Formally, we want:
ê = argmax P(e = e* | D_t, φ*)
e ∈ D_t
Where? φ* There is a feature representation extracted from the original element – its features, visual text, structural position, surrounding context. This is not a lookup problem. this is one Ranking problem on probability distribution On the current DOM.
Step 1 – Candidate Set Creation
Before doing any scoring, we prune the search space. Evaluating every element in a modern DOM (often 1,000+ nodes) is wasteful and creates noise. We define a candidate filter:
C = { e ∈ D_t | candidate_filter(e, φ*) = 1 }
The filter is intentionally cheap and permissive – similar tags, matching type Or roleAny text overlaps. Its function is elimination, not selection. Handles probabilistic model selection.
def build_candidate_set(dom, target):
return (
el for el in dom.find_all()
if el.name == target.tag and (
any(el.get(a) == target.attrs.get(a) for a in ('type', 'role', 'name') if target.attrs.get(a))
or any(w in el.get_text() for w in target.visible_text.split()(:3))
)
)
Step 2 – Feature Representation
The heart of the system is a function ψ(e, ϕ*) which maps each candidate element and snapshot of the original element into a numerical feature vector:
x_e = ψ(e, φ*) ∈ ℝᵈ
We decompose it into four groups of signals.
A – property similarity
We compare element attributes between the candidate and the original target. key features:
ID exact match:
x_id = 𝟙(id_e = id*)
class jacquard similarity – Measures set the overlap between class lists:
x_class = |C_e ∩ C*| / |C_e ∪ C*|
fuzzy attribute matching For names, placeholders and area-labels using sequence similarity. Together these form feature sub-vectors:
x_attr = (x_id, x_class, x_name, x_type, x_placeholder, ...)
B – text similarity
Visible text is often the most stable signal – labels survive refactors due to users Look Them. We use two solutions:
exact match:
x_text_exact = 𝟙(T_e = T*)
semantic cosine similarity Through sentence embedding – meaning captures even with slight changes in words:
x_text_cos = v(T_e) · v(T*) / (‖v(T_e)‖ ‖v(T*)‖)
C – structural similarity
where an element lives The DOM tree is often more stable than its attributes. We define the DOM as a graph G = (V, E) and remove:
depth difference – normalized distance from the root:
x_depth = |depth(e) - depth(e*)|
Parent tag matching:
x_parent = 𝟙(tag(parent(e)) = tag(parent(e*)))
tree edit distance Capturing structural neighborhoods, among local subtrees:
x_tree = TreeEditDistance(subtree(e), subtree(e*))
D – reference similarity
The label next to a button, the title above a form – nearby text is often the most durable anchor of all. we define Of-neighbourhood N_k(e) As siblings and parent text, then calculate:
x_context = cos(v(context_e), v(context*))
The full feature vector is the combination of all four groups:
x_e = ( x_attr | x_text | x_struct | x_context )
Step 3 – Probabilistic Ranking Model
This is where the approach fundamentally diverges from the hypothesis.
Instead of scoring candidates independently, we rank them against each other using a softmax ranking model (Also known as the Plackett-Loos model). For a treatment program with candidate set C = {E₁, E₂, …, Eₙ}there is a possibility that eᵢ The correct match is:
P(eᵢ | C) = exp(wᵀ x_eᵢ) / Σⱼ exp(wᵀ x_eⱼ)
This formulation has an important property: the scores compete Within the candidate set. The sum of the probabilities is always 1 and reflects relative confidence – not absolute similarity over a certain threshold. The model knows how to pick it up E₃ Meaning No selection of E₁ And E₂.
The decision rule is then clear and interpretable:
ê = argmax wᵀ x_e (selection)
e ∈ C
Confidence = P(ê | C) (uncertainty quantification)
And we conditionally recover:
Heal only if Confidence > τ
Threshold τ One becomes a tunable dial between precision and recall – something heuristic systems cannot easily offer.
Step 4 – Training the Model
We learn the weight vector w From historical healing events. Each event contributes one correct element (label = 1) and several incorrect candidates (label = 0).
The purpose of training is Cross-entropy loss on the right element within each candidate set:
L = -Σᵢ log P(e_correct | Cᵢ)
With L2 regularization to prevent overfitting:
L_total = -Σᵢ log P(e_correct | Cᵢ) + λ ‖w‖²
It is directly applied as logistic regression on the feature matrix:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # X: (n_candidates, d) feature matrixmodel = LogisticRegression(C=1.0/λ, class_weight='balanced')
model.fit(X_scaled, y) # y: binary labels
observe after training model.coef_ To see what the model actually learned. If x_class_jaccard This weighs more, your app uses static class names. If x_context_cosine Dominate, surrounding labels are your most reliable anchors. Learned weights are a diagnostic – they reveal the real facts about the consistency structure of your frontend.
Step 5 – Bayesian Prior Integration
One last promotion. Not all elements are equally stable during type changes. Buttons survive refactors better than auto-generated list items. Form inputs last longer than decorative divs.
We encode this as one of the preceding element types – historical treatment success rates – and add it to the score via log-odds adjustment:
score(e) = wᵀ x_e + log P(e)
Where:
P(e) = (successes_for_tag + α) / (total_for_tag + 2α) (Laplace smoothed)
This is Bayesian inference in its purest form: the probability from our feature model, multiplied by the prior from historical experience. The system doesn’t just learn how to match elements, but What types of elements are believable?
complete pipeline
Putting it all together, the treatment system becomes a five-stage pipeline:
Given: failed XPath, previous DOM snapshot, current DOM1. φ* ← extract_snapshot(failed_xpath, previous_dom)
2. C ← candidate_filter(current_dom, φ*)
3. For each e ∈ C: x_e = ψ(e, φ*)
4. s_e = wᵀ x_e + log P(e)
5. P(e|C) = exp(s_e) / Σⱼ exp(sⱼ)
ê = argmax s_e
Heal if P(ê|C) > τ, else escalate
The entire system is now:
- care — It learns from the actual treatment history of your application
- probable – Each decision comes with a calibrated confidence score
- Explanation – Feature weights show which signals matter in your codebase
- adaptive – Re-train on new events and the model evolves with your frontend
What does it reveal in practice
The limits of confidence become meaningful. set τ = 0.80 In production for high precision treatment. put on τ = 0.50 In CI for higher coverage. The tradeoff is clear and controllable.
Human-in-the-loop enhancement. When confidence is low, present the top 3 candidates with their probabilities to a reviewer. Uncertainty motivates right action, not silent wrong action.
Continuous learning. Each treatment event – success or failure – is training data. The model improves without writing any new rules.
Feature audit. After training, the weight vector tells you something real: which signals in your application are stable, which are not, and where your test selectors are most fragile.
closure
Heuristic self-treatment sounds clever until it isn’t. It’s a collection of fragile notions about what Needed Be stable – applied equally across all applications, each of which is stable in its own special way.
The probabilistic formulation restates the question with mathematical integrity: How confident are we that this is the correct substance, despite all the available evidence? It answers that question with a learnable model, a calibrated probability, and a tunable threshold.
DOM is not a static artifact. Neither should your treatment system.
Implementation stack: Python 3.10+, scikit-learn, sentence-transformer, BeautifulSoup4. The ranking model is a conditional logit (Plackett–Loos family); Prior integration follows standard Bayesian log-odds scoring with Laplace smoothing.
Published via Towards AI
