Performance & Production Guide
===============================

This guide is written for **quant traders, risk engineers, and production
system architects** who need to know exactly how fast TensorQuantLib is, where
the bottlenecks are, and how to tune every knob.

All timings were measured on **Apple M1 (8-core, 8 GB RAM), Python 3.11,
NumPy linked to Apple Accelerate**.  Expect similar or better numbers on a
modern x86-64 server with an AVX-512-enabled BLAS.

.. contents:: On this page
   :local:
   :depth: 2


At-a-Glance Latency Table
--------------------------

.. list-table::
   :header-rows: 1
   :widths: 28 18 18 36

   * - Function / Workflow
     - Single-call latency
     - Throughput
     - Notes
   * - ``black_scholes``
     - < 5 µs
     - > 200 K/s
     - Fully analytic, log+exp only
   * - ``bs_greeks``
     - < 5 µs
     - > 200 K/s
     - Analytic Delta/Gamma/Theta/Vega/Rho
   * - ``second_order_greeks``
     - ~1 ms
     - ~1 K/s
     - 4-point finite-diff bump × 2 params
   * - ``implied_vol``
     - < 1 ms
     - ~2 K/s
     - Brent solver, ~10 iterations
   * - ``heston_price`` (CF)
     - ~1 ms
     - ~1 K/s
     - 100-pt Gaussian quadrature
   * - ``heston_price_mc`` (QE)
     - 200–500 ms
     - 2–5/s
     - 100 K paths × 252 steps
   * - ``american_option_lsm``
     - 100–500 ms
     - 2–10/s
     - 50 K paths × 100 steps
   * - ``asian_price_mc``
     - 100–400 ms
     - 2–10/s
     - 50 K paths × 252 steps
   * - ``asian_price_cv``
     - 50–200 ms
     - 5–20/s
     - Control variate; 10× variance reduction
   * - ``barrier_price``
     - < 5 µs
     - > 200 K/s
     - Rubinstein-Reiner closed form
   * - ``HestonCalibrator.fit``
     - 5–15 s
     - —
     - Default: 3 restarts, 500 iterations
   * - ``TTSurrogate`` build (3D, n=15)
     - 2 ms
     - —
     - One-time cost; paid at startup
   * - ``TTSurrogate.evaluate`` (3D)
     - 1.5 µs
     - 650 K/s
     - After build; multi-linear interp on TT cores
   * - ``TTSurrogate.evaluate`` (5D)
     - ~5 µs
     - ~200 K/s
     - After build


Choosing the Right Pricer
--------------------------

.. code-block:: text

   Need a price RIGHT NOW (< 10 µs)?
   ├── Vanilla (single asset)  →  black_scholes / bs_greeks
   ├── Barrier (single asset)  →  barrier_price
   ├── Vasicek / CIR bond      →  vasicek_bond_price / cir_bond_price
   └── FX vanilla              →  garman_kohlhagen

   Need something slightly slower but model-richer?
   ├── Heston (single price)   →  heston_price          (~1 ms)
   ├── Implied vol             →  implied_vol            (~1 ms)
   ├── SABR smile              →  sabr_implied_vol       (< 10 µs)
   └── SVI surface             →  svi_implied_vol        (< 10 µs)

   Monte Carlo is unavoidable?
   ├── American option         →  american_option_lsm   (100 ms)
   ├── Asian (with VR!)        →  asian_price_cv        (50 ms)
   └── Heston stress-test      →  heston_price_mc       (200 ms)

   Many strikes / expiries to price daily?
   └── Build a TTSurrogate once at market open, query at µs latency forever.


.. _batch-pricing:

Batch / Vectorised Pricing
---------------------------

Every scalar pricer accepts plain Python floats and works inside
``np.vectorize`` or a list comprehension with no code change.

**Price a 1 000-strike chain in one shot:**

.. code-block:: python

   import numpy as np
   from tensorquantlib import black_scholes

   strikes = np.linspace(80, 120, 1000)
   S, T, r, sigma = 100.0, 1.0, 0.05, 0.2

   # np.vectorize adds basically zero overhead for C-speed functions
   bs_vec = np.vectorize(black_scholes)
   prices = bs_vec(S, strikes, T, r, sigma, option_type="call")
   # ~5 ms for 1 000 strikes — 200 K/s effective throughput

**Price a full Heston surface (5 strikes × 3 expiries) at once:**

.. code-block:: python

   import numpy as np
   from tensorquantlib import heston_price, HestonParams

   params = HestonParams(kappa=2.0, theta=0.04, xi=0.3, rho=-0.7, v0=0.04)
   K_grid = np.array([90., 95., 100., 105., 110.])
   T_grid = np.array([0.5, 1.0, 2.0])
   S, r = 100.0, 0.05

   # 15 calls × ~1 ms each ≈ 15 ms
   surface = np.array([
       [heston_price(S, K, T, r, params) for T in T_grid]
       for K in K_grid
   ])   # shape (5, 3)

**Even faster: TTSurrogate replaces the surface loop entirely:**

.. code-block:: python

   from tensorquantlib import TTSurrogate, heston_price, HestonParams
   import numpy as np

   params = HestonParams(kappa=2.0, theta=0.04, xi=0.3, rho=-0.7, v0=0.04)

   # Build once at startup (~2 ms for 3D)
   surr = TTSurrogate.from_function(
       fn=lambda s, k, t: heston_price(s, k, t, 0.05, params),
       axes=[
           np.linspace(80, 120, 15),   # S axis
           np.linspace(85, 115, 15),   # K axis
           np.linspace(0.25, 2.0, 15), # T axis
       ],
       eps=1e-3,
       max_rank=30,
   )

   # Now price 10 000 spot/strike/expiry combos in < 15 ms
   pts = np.column_stack([
       np.random.uniform(80, 120, 10_000),
       np.random.uniform(85, 115, 10_000),
       np.random.uniform(0.25, 2.0, 10_000),
   ])
   prices = surr.evaluate(pts)   # vectorised; ~1.5 µs/call = 650 K/s


TT-Rank and ``n_points`` Tuning
---------------------------------

These are the two levers that control the accuracy/speed trade-off in every
``TTSurrogate``.

Grid resolution: ``n_points``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``n_points`` is the number of grid points **per axis**.  The full tensor has
``n_points ** d`` entries; TT compresses this exponentially.

.. list-table::
   :header-rows: 1
   :widths: 15 15 20 20 30

   * - Dimensions (d)
     - Recommended ``n_points``
     - Full grid size
     - Build time (approx)
     - Notes
   * - 2–3
     - 20–30
     - 400–27 K entries
     - < 5 ms
     - Dense enough for smooth Heston/BS surfaces
   * - 4–5
     - 15–20
     - 50 K–3 M entries
     - 10–100 ms (use TT-Cross)
     - TT-Cross recommended; skip full grid construction
   * - 6+
     - 10–15
     - 1 M+ (never built)
     - N/A (TT-Cross only)
     - ``TTSurrogate.from_function`` with ``max_rank=20``

.. tip::

   For 6+ dimensional baskets use ``TTSurrogate.from_function``.
   It calls the pricing function at only
   ``d × r² × n ≈ 6 × 400 × 15 = 36,000`` carefully chosen points instead of
   the ``15^6 = 11 M`` full grid — a **300× speedup in build time**.

SVD tolerance: ``eps``
~~~~~~~~~~~~~~~~~~~~~~~

``eps`` is the *relative Frobenius error* allowed in TT-SVD.  Tighter ``eps``
means higher ranks and more memory.

.. list-table::
   :header-rows: 1
   :widths: 12 15 15 15 43

   * - ``eps``
     - Max rank (3D)
     - TT memory
     - Compression
     - When to use
   * - ``1e-1``
     - 2–3
     - ~2 KB
     - 100×
     - Coarse smile shape only; not for pricing
   * - ``1e-2``
     - 8
     - ~17 KB
     - 13×
     - Pre-trade screening / large portfolio VaR
   * - ``1e-3`` ✅
     - 23
     - ~124 KB
     - 1.7×
     - **Production sweet spot** — 42× compression at 5D, < 0.1% error
   * - ``1e-4``
     - 30
     - ~225 KB
     - ~1×
     - High-fidelity validation; matches CF to machine precision
   * - ``1e-6``
     - Full rank
     - = full grid
     - 1×
     - No benefit; use direct evaluation instead

.. code-block:: python

   # Production-quality 3-asset surrogate
   surr = TTSurrogate.from_basket_analytic(
       S0_ranges=[(80, 120)] * 3,
       K=100, T=1.0, r=0.05,
       sigma=[0.2, 0.25, 0.3],
       weights=[1/3, 1/3, 1/3],
       n_points=20,    # ← sweet spot for 3D
       eps=1e-3,       # ← production tolerance
       max_rank=30,    # ← safety cap; rarely reached at eps=1e-3
   )
   surr.print_summary()
   # Expected: max_rank ≈ 23, memory ≈ 124 KB, compression ≈ 1.7×

Hard-capping ranks: ``max_rank``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Setting ``max_rank`` prevents runaway memory on poorly-conditioned surfaces
(e.g., near-digital payoffs or very short maturities).  Recommended values:

- ``max_rank=20``:  fast / light; suitable for pre-trade screening
- ``max_rank=30``:  production default
- ``max_rank=50``:  near-exact; use for EOD risk reports
- ``max_rank=None``:  unbounded; only for validation

``n_sweeps`` in TT-Cross
~~~~~~~~~~~~~~~~~~~~~~~~~

When using ``TTSurrogate.from_function`` (TT-Cross), ``n_sweeps`` controls
how many alternating left/right passes are made over the dimension chain.

- ``n_sweeps=4``:  sufficient for smooth Black-Scholes surfaces
- ``n_sweeps=6``:  default; good for Heston and jump-diffusion
- ``n_sweeps=8``:  for stiff or highly correlated surfaces


Heston Calibration: From 10 s to < 2 s
-----------------------------------------

The default ``HestonCalibrator.fit`` with ``n_restarts=3, maxiter=500``
takes **5–15 seconds** because it runs hundreds of L-BFGS-B optimizer
steps, and each step calls ``heston_price`` (1 ms) for every
``(K, T)`` pair in the market surface.

**Breakdown for a 5 × 4 surface (20 options):**

- ~1 ms per ``heston_price`` call (100-pt CF quadrature)
- ~15 ms per objective evaluation (15 ``heston_price`` + 15 ``implied_vol``)
- ~200 objective evals per L-BFGS-B run
- × 3 restarts = ~200 × 15 ms × 3 = **9 s**

Speedup strategy 1 — Single restart with warm start
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In production you re-calibrate daily (or tick-by-tick).  Yesterday's
parameters are an excellent starting point:

.. code-block:: python

   import numpy as np
   from tensorquantlib.finance.heston import HestonCalibrator, HestonParams

   # Day T-1 calibrated params (persisted to disk / Redis)
   prev_params = HestonParams(kappa=2.1, theta=0.042, xi=0.31, rho=-0.68, v0=0.038)

   cal = HestonCalibrator(S=100.0, r=0.05)
   cal.params_ = prev_params   # warm start

   K_grid = np.array([90., 95., 100., 105., 110.])
   T_grid = np.array([0.5, 1.0, 2.0])
   iv_mkt = np.full((5, 3), 0.20)   # replace with real market IVs

   # 1 restart instead of 3 → 3× faster
   # 200 maxiter instead of 500 → 2.5× faster
   cal.fit(iv_mkt, K_grid, T_grid, n_restarts=1, maxiter=200)
   print(f"Calibrated RMSE: {cal.rmse_:.6f}")
   # Typical wall-clock: ~1.5 s

Speedup strategy 2 — Parallel restarts with joblib
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Run multiple random-start optimisations simultaneously; keep the best:

.. code-block:: python

   import numpy as np
   from scipy.optimize import minimize
   from joblib import Parallel, delayed
   from tensorquantlib import heston_price, implied_vol, HestonParams

   S, r = 100.0, 0.05
   K_grid = np.array([90., 95., 100., 105., 110.])
   T_grid = np.array([0.5, 1.0, 2.0])
   iv_mkt = np.full((5, 3), 0.20)

   BOUNDS = [(0.1, 10), (0.001, 0.5), (0.1, 2.0), (-0.99, 0.99), (0.001, 0.5)]

   def _objective(x):
       params = HestonParams(*x)
       err = 0.0
       for i, K in enumerate(K_grid):
           for j, T in enumerate(T_grid):
               try:
                   price = heston_price(S, K, T, r, params)
                   iv_m = implied_vol(price, S, K, T, r)
                   err += (iv_m - iv_mkt[i, j]) ** 2
               except Exception:
                   err += 1.0
       return err

   def _one_restart(seed):
       rng = np.random.default_rng(seed)
       x0 = np.array([rng.uniform(lo, hi) for lo, hi in BOUNDS])
       return minimize(_objective, x0, method="L-BFGS-B",
                       bounds=BOUNDS, options={"maxiter": 200})

   # 4 parallel restarts — wall-clock ≈ 1 restart time (not 4×)
   results = Parallel(n_jobs=4)(delayed(_one_restart)(s) for s in range(4))
   best = min(results, key=lambda r: r.fun)
   params = HestonParams(*best.x)
   print(f"Best RMSE: {np.sqrt(best.fun / (len(K_grid)*len(T_grid))):.6f}")

Speedup strategy 3 — Reduce the CF integration cost
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``heston_price`` uses ``n_points=100`` Gaussian quadrature nodes by default.
For calibration purposes 50 nodes are just as accurate and twice as fast:

.. code-block:: python

   from functools import partial
   from tensorquantlib.finance.heston import heston_price as _hp

   # Monkey-patch a faster version for calibration
   fast_heston = partial(_hp, n_points=50)

   # Use fast_heston inside your objective instead of heston_price
   price = fast_heston(S=100, K=100, T=1.0, r=0.05, params=params)

Speedup strategy 4 — Coarsen the IV grid
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Calibrate to fewer points; check residual on the full grid:

.. code-block:: python

   # Coarse calibration grid: 4 strikes × 3 expiries = 12 evaluations/step
   K_calib = np.array([90., 97., 103., 110.])
   T_calib = np.array([0.5, 1.0, 2.0])

   cal.fit(iv_mkt_coarse, K_calib, T_calib, n_restarts=1, maxiter=200)

   # Validate RMSE on full 10×5 grid
   iv_model = np.array([
       [implied_vol(heston_price(S, K, T, r, cal.params_), S, K, T, r)
        for T in T_grid_full]
       for K in K_grid_full
   ])
   print("Full-grid RMSE:", np.sqrt(np.mean((iv_model - iv_mkt_full)**2)))

**Summary of speedups:**

.. list-table::
   :header-rows: 1
   :widths: 35 20 20 25

   * - Strategy
     - Wall-clock
     - Speedup
     - Accuracy impact
   * - Default (n_restarts=3, maxiter=500)
     - ~10 s
     - 1×
     - Baseline
   * - Single restart + warm start
     - ~1.5 s
     - ~7×
     - Negligible if params are stable
   * - Parallel restarts (4 cores) + warm start
     - ~1.0 s
     - ~10×
     - As good as 4 restarts
   * - n_points=50 in CF
     - ~5 s
     - ~2×
     - < 0.1 bp IV error
   * - All combined
     - **< 0.5 s**
     - **~20×**
     - < 0.5 bps RMSE


Memory Profiling
-----------------

Use Python's built-in ``tracemalloc`` to measure memory allocation of any
pricing or surrogate workflow:

.. code-block:: python

   import tracemalloc
   import numpy as np
   from tensorquantlib import TTSurrogate

   tracemalloc.start()

   surr = TTSurrogate.from_basket_analytic(
       S0_ranges=[(80, 120)] * 4,
       K=100, T=1.0, r=0.05,
       sigma=[0.2, 0.22, 0.25, 0.28],
       weights=[0.25] * 4,
       n_points=15,
       eps=1e-3,
   )

   current, peak = tracemalloc.get_traced_memory()
   tracemalloc.stop()

   print(f"Current: {current / 1024:.1f} KB")
   print(f"Peak:    {peak / 1024:.1f} KB")
   # Typical output for 4-asset eps=1e-3:
   #   Current: 91.4 KB
   #   Peak:    ~500 KB  (includes grid construction + SVD temporaries)

**Expected memory footprint (eps=1e-3, n_points=15):**

.. list-table::
   :header-rows: 1
   :widths: 15 18 18 16 33

   * - Assets (d)
     - Full grid (MB)
     - TT size (KB)
     - Peak alloc
     - Notes
   * - 2
     - < 0.01
     - 3 KB
     - ~50 KB
     - Negligible
   * - 3
     - 0.03
     - 28 KB
     - ~200 KB
     - Fits in L1 cache
   * - 4
     - 0.39
     - 91 KB
     - ~500 KB
     - 4× compression
   * - 5
     - 5.79 MB
     - 142 KB
     - ~2 MB
     - **42× compression — TT wins decisively**
   * - 6
     - 92 MB
     - ~200 KB
     - ~4 MB
     - Full grid never materialised (TT-Cross)
   * - 10
     - 576 GB
     - ~400 KB
     - ~8 MB
     - Impossible without TT

The TT surrogate's ``print_summary()`` method reports memory automatically:

.. code-block:: python

   surr.print_summary()
   # ──────────────────────────────────────────────────────────────────
   # TTSurrogate  dims=4  n_points=15  eps=1e-3  max_rank=30
   # Ranks      : [1, 22, 25, 18, 1]
   # TT memory  : 91.4 KB        Full-grid equivalent : 393.8 KB
   # Compression: 4.3×           Max rank : 25
   # Build time : 10.2 ms        Compress time : 2.1 ms
   # ──────────────────────────────────────────────────────────────────


Monte Carlo Variance Reduction: Which Method When
---------------------------------------------------

All MC pricers default to ``n_paths=100_000``.  The table below shows the
variance reduction you get and when to use each method.

.. list-table::
   :header-rows: 1
   :widths: 25 18 18 39

   * - Method
     - Variance reduction
     - ``n_paths`` to match crude MC 100 K
     - Best for
   * - ``bs_price_antithetic``
     - ~2×
     - 50 K
     - Always-on default; free variance reduction
   * - ``bs_price_cv``
     - 10–100×
     - 1 K–10 K
     - Asian options (use geometric Asian as control)
   * - ``bs_price_qmc``
     - O(1/N) vs O(1/√N)
     - 8 K (power of 2)
     - Near-ATM vanilla / mild path dependency
   * - ``bs_price_importance``
     - 10–1000×
     - 100–1 K
     - Deep OTM options (>3 sigma)
   * - ``bs_price_stratified``
     - 2–10×
     - 10 K–50 K
     - General purpose when QMC is unavailable

.. code-block:: python

   from tensorquantlib import (
       asian_price_mc,   # crude MC baseline
       asian_price_cv,   # with geometric-Asian control variate
       bs_price_qmc,     # Sobol QMC
   )

   # Compare methods at equal compute budget (n_paths=10_000)
   price_crude, se_crude = asian_price_mc(
       S=100, K=100, T=1.0, r=0.05, sigma=0.2,
       n_paths=10_000, return_stderr=True
   )
   price_cv, se_cv = asian_price_cv(
       S=100, K=100, T=1.0, r=0.05, sigma=0.2,
       n_paths=10_000, return_stderr=True
   )

   print(f"Crude MC : {price_crude:.4f} ± {se_crude:.4f}")
   print(f"Control V: {price_cv:.4f} ± {se_cv:.4f}")
   # Typical:
   #   Crude MC : 5.7621 ± 0.0451   (large stderr)
   #   Control V: 5.7608 ± 0.0043   (10× tighter)

**QMC: Always pass a power-of-2 ``n_paths``** to get proper Sobol balance:

.. code-block:: python

   # Good (balance maintained)
   price = bs_price_qmc(S=100, K=100, T=1.0, r=0.05, sigma=0.2,
                        n_paths=65_536)   # 2^16

   # Avoid — triggers UserWarning and slight accuracy degradation
   price = bs_price_qmc(..., n_paths=100_000)


Parallelism: Pricing a Large Portfolio
---------------------------------------

TensorQuantLib has **no internal parallelism** by design — all functions are
pure NumPy and release the GIL.  This makes them safe to call from multiple
threads or processes simultaneously.

**Thread-parallel strike chain (I/O-light, GIL-free):**

.. code-block:: python

   from concurrent.futures import ThreadPoolExecutor
   import numpy as np
   from tensorquantlib import heston_price, HestonParams

   params = HestonParams(kappa=2.0, theta=0.04, xi=0.3, rho=-0.7, v0=0.04)
   strikes = np.linspace(70, 130, 200)

   def price_one(K):
       return heston_price(S=100, K=K, T=1.0, r=0.05, params=params)

   with ThreadPoolExecutor(max_workers=8) as pool:
       prices = list(pool.map(price_one, strikes))
   # 200 Heston prices × ~1 ms / 8 cores ≈ 25 ms end-to-end

**Process-parallel for calibration restarts** (CPU-bound — use processes):

.. code-block:: python

   from concurrent.futures import ProcessPoolExecutor

   def calibrate_one(seed):
       # ... same as _one_restart() above ...
       pass

   with ProcessPoolExecutor(max_workers=4) as pool:
       results = list(pool.map(calibrate_one, range(4)))
   best = min(results, key=lambda r: r.fun)

.. note::

   **GPU / CUDA**: TensorQuantLib is pure NumPy.  There is no built-in GPU
   support today, but pricing surfaces and TT-SVD operations are compatible
   with ``cupy.ndarray`` as a drop-in replacement for ``numpy.ndarray`` in
   most functions.  Full CuPy integration is on the roadmap (see
   :doc:`limitations`).


Production Configuration Checklist
------------------------------------

Use this checklist before deploying TensorQuantLib in a production trading or
risk system:

**Daily startup**

.. code-block:: bash

   # 1. Verify environment
   python -m tensorquantlib --version

   # 2. Run smoke tests
   python -m pytest tests/ -q --tb=short -x -m "not slow"

**Parameter selection**

.. list-table::
   :header-rows: 1
   :widths: 40 30 30

   * - Setting
     - Development / backtest
     - Production / real-time
   * - ``HestonCalibrator n_restarts``
     - 3–5
     - **1** (with warm start)
   * - ``HestonCalibrator maxiter``
     - 500
     - **200**
   * - ``heston_price n_points``
     - 100
     - **50–100** (50 gives < 0.1 bp error)
   * - ``heston_price_mc n_paths``
     - 100 K
     - N/A (use CF for live pricing)
   * - ``TTSurrogate eps``
     - ``1e-4``
     - **``1e-3``**
   * - ``TTSurrogate n_points``
     - 30 (2-3D)
     - **15–20** (3D), **10–15** (4-5D)
   * - ``TTSurrogate max_rank``
     - None
     - **30**
   * - MC ``n_paths`` (vanilla)
     - 100 K
     - **10 K + antithetic** or **QMC 8 K**
   * - MC ``n_paths`` (Asian with CV)
     - 100 K
     - **10 K** (CV reduces SE 10×)
   * - Heston MC ``scheme``
     - ``"qe"``
     - N/A (use ``heston_price`` CF instead)

**Warm-start workflow (recommended)**

.. code-block:: python

   import json, pathlib
   from tensorquantlib.finance.heston import HestonCalibrator, HestonParams

   PARAMS_FILE = pathlib.Path("heston_params.json")

   def load_params() -> HestonParams:
       if PARAMS_FILE.exists():
           d = json.loads(PARAMS_FILE.read_text())
           return HestonParams(**d)
       return HestonParams()   # default

   def save_params(p: HestonParams) -> None:
       PARAMS_FILE.write_text(json.dumps({
           "kappa": p.kappa, "theta": p.theta,
           "xi": p.xi, "rho": p.rho, "v0": p.v0,
       }))

   # At market open
   cal = HestonCalibrator(S=spot, r=risk_free_rate)
   cal.params_ = load_params()          # warm start
   cal.fit(iv_surface, K_grid, T_grid,
           n_restarts=1, maxiter=200)   # ~1.5 s
   save_params(cal.params_)             # persist for tomorrow

**Error handling in production**

.. code-block:: python

   from tensorquantlib import implied_vol, heston_price, HestonParams
   import numpy as np

   def safe_heston_iv(S, K, T, r, params, fallback=np.nan):
       """Return Heston-implied vol; return fallback on any numerical failure."""
       try:
           price = heston_price(S, K, T, r, params)
           if not np.isfinite(price) or price <= 1e-10:
               return fallback
           return implied_vol(price, S, K, T, r)
       except Exception:
           return fallback

**TT surrogate rebuild policy**

Rebuild the surrogate when any of these change:

- Heston / model parameters (after recalibration)
- Risk-free rate > 5 bps shift
- Spot price moves outside the surrogate's domain ± 10 %
- New expiry or strike enters the book

Otherwise, ``surr.evaluate`` is exact within the built tolerance for the
*existing* parameter set.


Interpreting ``print_summary()`` Output
-----------------------------------------

.. code-block:: text

   TTSurrogate  dims=3  n_points=20  eps=1e-4
   Ranks      :  [1, 15, 22, 1]
   TT memory  :  53.6 KB        Full-grid equivalent : 640.0 KB
   Compression:  11.9×          Max rank : 22
   Build time :  1.8 ms         Compress time : 1.2 ms

- **Ranks**: boundary ranks of each TT core.  High middle ranks (> 40) with
  ``eps=1e-3`` suggest a rough or discontinuous payoff — consider increasing
  ``n_points`` or switching to ``from_function`` (TT-Cross).
- **Compression < 1×**: TT uses *more* memory than the dense grid.  This
  means the surface is not low-rank.  Try increasing ``eps`` or reducing
  ``n_points``.
- **Build time >> Compress time**: Most of the cost is in evaluating the
  pricing function on the grid.  Use analytic approximations
  (``from_basket_analytic``) rather than MC (``from_basket_mc``) wherever
  possible.
- **Compress time >> Build time**: The TT-SVD routine is dominant; consider
  reducing ``n_points``.


Profiling Your Own Workflow
-----------------------------

Quick wall-clock profiling with ``cProfile``:

.. code-block:: bash

   python -m cProfile -s cumtime -m tensorquantlib.__main__ heston \
       --S 100 --K 100 --T 1.0 --r 0.05 2>&1 | head -30

Or instrument inline with ``time.perf_counter``:

.. code-block:: python

   import time
   from tensorquantlib import heston_price, HestonParams

   params = HestonParams()
   t0 = time.perf_counter()
   for _ in range(1000):
       heston_price(S=100, K=100, T=1.0, r=0.05, params=params)
   elapsed = time.perf_counter() - t0
   print(f"{elapsed / 1000 * 1e3:.3f} ms per call")
   # Typical: ~0.9 ms on M1 / ~1.2 ms on Intel Xeon

For memory-intensive workflows, use ``memory_profiler`` (install separately):

.. code-block:: bash

   pip install memory-profiler
   python -m memory_profiler my_surrogate_script.py