Performance & Production Guide =============================== This guide is written for **quant traders, risk engineers, and production system architects** who need to know exactly how fast TensorQuantLib is, where the bottlenecks are, and how to tune every knob. All timings were measured on **Apple M1 (8-core, 8 GB RAM), Python 3.11, NumPy linked to Apple Accelerate**. Expect similar or better numbers on a modern x86-64 server with an AVX-512-enabled BLAS. .. contents:: On this page :local: :depth: 2 At-a-Glance Latency Table -------------------------- .. list-table:: :header-rows: 1 :widths: 28 18 18 36 * - Function / Workflow - Single-call latency - Throughput - Notes * - ``black_scholes`` - < 5 µs - > 200 K/s - Fully analytic, log+exp only * - ``bs_greeks`` - < 5 µs - > 200 K/s - Analytic Delta/Gamma/Theta/Vega/Rho * - ``second_order_greeks`` - ~1 ms - ~1 K/s - 4-point finite-diff bump × 2 params * - ``implied_vol`` - < 1 ms - ~2 K/s - Brent solver, ~10 iterations * - ``heston_price`` (CF) - ~1 ms - ~1 K/s - 100-pt Gaussian quadrature * - ``heston_price_mc`` (QE) - 200–500 ms - 2–5/s - 100 K paths × 252 steps * - ``american_option_lsm`` - 100–500 ms - 2–10/s - 50 K paths × 100 steps * - ``asian_price_mc`` - 100–400 ms - 2–10/s - 50 K paths × 252 steps * - ``asian_price_cv`` - 50–200 ms - 5–20/s - Control variate; 10× variance reduction * - ``barrier_price`` - < 5 µs - > 200 K/s - Rubinstein-Reiner closed form * - ``HestonCalibrator.fit`` - 5–15 s - — - Default: 3 restarts, 500 iterations * - ``TTSurrogate`` build (3D, n=15) - 2 ms - — - One-time cost; paid at startup * - ``TTSurrogate.evaluate`` (3D) - 1.5 µs - 650 K/s - After build; multi-linear interp on TT cores * - ``TTSurrogate.evaluate`` (5D) - ~5 µs - ~200 K/s - After build Choosing the Right Pricer -------------------------- .. code-block:: text Need a price RIGHT NOW (< 10 µs)? ├── Vanilla (single asset) → black_scholes / bs_greeks ├── Barrier (single asset) → barrier_price ├── Vasicek / CIR bond → vasicek_bond_price / cir_bond_price └── FX vanilla → garman_kohlhagen Need something slightly slower but model-richer? ├── Heston (single price) → heston_price (~1 ms) ├── Implied vol → implied_vol (~1 ms) ├── SABR smile → sabr_implied_vol (< 10 µs) └── SVI surface → svi_implied_vol (< 10 µs) Monte Carlo is unavoidable? ├── American option → american_option_lsm (100 ms) ├── Asian (with VR!) → asian_price_cv (50 ms) └── Heston stress-test → heston_price_mc (200 ms) Many strikes / expiries to price daily? └── Build a TTSurrogate once at market open, query at µs latency forever. .. _batch-pricing: Batch / Vectorised Pricing --------------------------- Every scalar pricer accepts plain Python floats and works inside ``np.vectorize`` or a list comprehension with no code change. **Price a 1 000-strike chain in one shot:** .. code-block:: python import numpy as np from tensorquantlib import black_scholes strikes = np.linspace(80, 120, 1000) S, T, r, sigma = 100.0, 1.0, 0.05, 0.2 # np.vectorize adds basically zero overhead for C-speed functions bs_vec = np.vectorize(black_scholes) prices = bs_vec(S, strikes, T, r, sigma, option_type="call") # ~5 ms for 1 000 strikes — 200 K/s effective throughput **Price a full Heston surface (5 strikes × 3 expiries) at once:** .. code-block:: python import numpy as np from tensorquantlib import heston_price, HestonParams params = HestonParams(kappa=2.0, theta=0.04, xi=0.3, rho=-0.7, v0=0.04) K_grid = np.array([90., 95., 100., 105., 110.]) T_grid = np.array([0.5, 1.0, 2.0]) S, r = 100.0, 0.05 # 15 calls × ~1 ms each ≈ 15 ms surface = np.array([ [heston_price(S, K, T, r, params) for T in T_grid] for K in K_grid ]) # shape (5, 3) **Even faster: TTSurrogate replaces the surface loop entirely:** .. code-block:: python from tensorquantlib import TTSurrogate, heston_price, HestonParams import numpy as np params = HestonParams(kappa=2.0, theta=0.04, xi=0.3, rho=-0.7, v0=0.04) # Build once at startup (~2 ms for 3D) surr = TTSurrogate.from_function( fn=lambda s, k, t: heston_price(s, k, t, 0.05, params), axes=[ np.linspace(80, 120, 15), # S axis np.linspace(85, 115, 15), # K axis np.linspace(0.25, 2.0, 15), # T axis ], eps=1e-3, max_rank=30, ) # Now price 10 000 spot/strike/expiry combos in < 15 ms pts = np.column_stack([ np.random.uniform(80, 120, 10_000), np.random.uniform(85, 115, 10_000), np.random.uniform(0.25, 2.0, 10_000), ]) prices = surr.evaluate(pts) # vectorised; ~1.5 µs/call = 650 K/s TT-Rank and ``n_points`` Tuning --------------------------------- These are the two levers that control the accuracy/speed trade-off in every ``TTSurrogate``. Grid resolution: ``n_points`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``n_points`` is the number of grid points **per axis**. The full tensor has ``n_points ** d`` entries; TT compresses this exponentially. .. list-table:: :header-rows: 1 :widths: 15 15 20 20 30 * - Dimensions (d) - Recommended ``n_points`` - Full grid size - Build time (approx) - Notes * - 2–3 - 20–30 - 400–27 K entries - < 5 ms - Dense enough for smooth Heston/BS surfaces * - 4–5 - 15–20 - 50 K–3 M entries - 10–100 ms (use TT-Cross) - TT-Cross recommended; skip full grid construction * - 6+ - 10–15 - 1 M+ (never built) - N/A (TT-Cross only) - ``TTSurrogate.from_function`` with ``max_rank=20`` .. tip:: For 6+ dimensional baskets use ``TTSurrogate.from_function``. It calls the pricing function at only ``d × r² × n ≈ 6 × 400 × 15 = 36,000`` carefully chosen points instead of the ``15^6 = 11 M`` full grid — a **300× speedup in build time**. SVD tolerance: ``eps`` ~~~~~~~~~~~~~~~~~~~~~~~ ``eps`` is the *relative Frobenius error* allowed in TT-SVD. Tighter ``eps`` means higher ranks and more memory. .. list-table:: :header-rows: 1 :widths: 12 15 15 15 43 * - ``eps`` - Max rank (3D) - TT memory - Compression - When to use * - ``1e-1`` - 2–3 - ~2 KB - 100× - Coarse smile shape only; not for pricing * - ``1e-2`` - 8 - ~17 KB - 13× - Pre-trade screening / large portfolio VaR * - ``1e-3`` ✅ - 23 - ~124 KB - 1.7× - **Production sweet spot** — 42× compression at 5D, < 0.1% error * - ``1e-4`` - 30 - ~225 KB - ~1× - High-fidelity validation; matches CF to machine precision * - ``1e-6`` - Full rank - = full grid - 1× - No benefit; use direct evaluation instead .. code-block:: python # Production-quality 3-asset surrogate surr = TTSurrogate.from_basket_analytic( S0_ranges=[(80, 120)] * 3, K=100, T=1.0, r=0.05, sigma=[0.2, 0.25, 0.3], weights=[1/3, 1/3, 1/3], n_points=20, # ← sweet spot for 3D eps=1e-3, # ← production tolerance max_rank=30, # ← safety cap; rarely reached at eps=1e-3 ) surr.print_summary() # Expected: max_rank ≈ 23, memory ≈ 124 KB, compression ≈ 1.7× Hard-capping ranks: ``max_rank`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Setting ``max_rank`` prevents runaway memory on poorly-conditioned surfaces (e.g., near-digital payoffs or very short maturities). Recommended values: - ``max_rank=20``: fast / light; suitable for pre-trade screening - ``max_rank=30``: production default - ``max_rank=50``: near-exact; use for EOD risk reports - ``max_rank=None``: unbounded; only for validation ``n_sweeps`` in TT-Cross ~~~~~~~~~~~~~~~~~~~~~~~~~ When using ``TTSurrogate.from_function`` (TT-Cross), ``n_sweeps`` controls how many alternating left/right passes are made over the dimension chain. - ``n_sweeps=4``: sufficient for smooth Black-Scholes surfaces - ``n_sweeps=6``: default; good for Heston and jump-diffusion - ``n_sweeps=8``: for stiff or highly correlated surfaces Heston Calibration: From 10 s to < 2 s ----------------------------------------- The default ``HestonCalibrator.fit`` with ``n_restarts=3, maxiter=500`` takes **5–15 seconds** because it runs hundreds of L-BFGS-B optimizer steps, and each step calls ``heston_price`` (1 ms) for every ``(K, T)`` pair in the market surface. **Breakdown for a 5 × 4 surface (20 options):** - ~1 ms per ``heston_price`` call (100-pt CF quadrature) - ~15 ms per objective evaluation (15 ``heston_price`` + 15 ``implied_vol``) - ~200 objective evals per L-BFGS-B run - × 3 restarts = ~200 × 15 ms × 3 = **9 s** Speedup strategy 1 — Single restart with warm start ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In production you re-calibrate daily (or tick-by-tick). Yesterday's parameters are an excellent starting point: .. code-block:: python import numpy as np from tensorquantlib.finance.heston import HestonCalibrator, HestonParams # Day T-1 calibrated params (persisted to disk / Redis) prev_params = HestonParams(kappa=2.1, theta=0.042, xi=0.31, rho=-0.68, v0=0.038) cal = HestonCalibrator(S=100.0, r=0.05) cal.params_ = prev_params # warm start K_grid = np.array([90., 95., 100., 105., 110.]) T_grid = np.array([0.5, 1.0, 2.0]) iv_mkt = np.full((5, 3), 0.20) # replace with real market IVs # 1 restart instead of 3 → 3× faster # 200 maxiter instead of 500 → 2.5× faster cal.fit(iv_mkt, K_grid, T_grid, n_restarts=1, maxiter=200) print(f"Calibrated RMSE: {cal.rmse_:.6f}") # Typical wall-clock: ~1.5 s Speedup strategy 2 — Parallel restarts with joblib ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Run multiple random-start optimisations simultaneously; keep the best: .. code-block:: python import numpy as np from scipy.optimize import minimize from joblib import Parallel, delayed from tensorquantlib import heston_price, implied_vol, HestonParams S, r = 100.0, 0.05 K_grid = np.array([90., 95., 100., 105., 110.]) T_grid = np.array([0.5, 1.0, 2.0]) iv_mkt = np.full((5, 3), 0.20) BOUNDS = [(0.1, 10), (0.001, 0.5), (0.1, 2.0), (-0.99, 0.99), (0.001, 0.5)] def _objective(x): params = HestonParams(*x) err = 0.0 for i, K in enumerate(K_grid): for j, T in enumerate(T_grid): try: price = heston_price(S, K, T, r, params) iv_m = implied_vol(price, S, K, T, r) err += (iv_m - iv_mkt[i, j]) ** 2 except Exception: err += 1.0 return err def _one_restart(seed): rng = np.random.default_rng(seed) x0 = np.array([rng.uniform(lo, hi) for lo, hi in BOUNDS]) return minimize(_objective, x0, method="L-BFGS-B", bounds=BOUNDS, options={"maxiter": 200}) # 4 parallel restarts — wall-clock ≈ 1 restart time (not 4×) results = Parallel(n_jobs=4)(delayed(_one_restart)(s) for s in range(4)) best = min(results, key=lambda r: r.fun) params = HestonParams(*best.x) print(f"Best RMSE: {np.sqrt(best.fun / (len(K_grid)*len(T_grid))):.6f}") Speedup strategy 3 — Reduce the CF integration cost ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``heston_price`` uses ``n_points=100`` Gaussian quadrature nodes by default. For calibration purposes 50 nodes are just as accurate and twice as fast: .. code-block:: python from functools import partial from tensorquantlib.finance.heston import heston_price as _hp # Monkey-patch a faster version for calibration fast_heston = partial(_hp, n_points=50) # Use fast_heston inside your objective instead of heston_price price = fast_heston(S=100, K=100, T=1.0, r=0.05, params=params) Speedup strategy 4 — Coarsen the IV grid ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Calibrate to fewer points; check residual on the full grid: .. code-block:: python # Coarse calibration grid: 4 strikes × 3 expiries = 12 evaluations/step K_calib = np.array([90., 97., 103., 110.]) T_calib = np.array([0.5, 1.0, 2.0]) cal.fit(iv_mkt_coarse, K_calib, T_calib, n_restarts=1, maxiter=200) # Validate RMSE on full 10×5 grid iv_model = np.array([ [implied_vol(heston_price(S, K, T, r, cal.params_), S, K, T, r) for T in T_grid_full] for K in K_grid_full ]) print("Full-grid RMSE:", np.sqrt(np.mean((iv_model - iv_mkt_full)**2))) **Summary of speedups:** .. list-table:: :header-rows: 1 :widths: 35 20 20 25 * - Strategy - Wall-clock - Speedup - Accuracy impact * - Default (n_restarts=3, maxiter=500) - ~10 s - 1× - Baseline * - Single restart + warm start - ~1.5 s - ~7× - Negligible if params are stable * - Parallel restarts (4 cores) + warm start - ~1.0 s - ~10× - As good as 4 restarts * - n_points=50 in CF - ~5 s - ~2× - < 0.1 bp IV error * - All combined - **< 0.5 s** - **~20×** - < 0.5 bps RMSE Memory Profiling ----------------- Use Python's built-in ``tracemalloc`` to measure memory allocation of any pricing or surrogate workflow: .. code-block:: python import tracemalloc import numpy as np from tensorquantlib import TTSurrogate tracemalloc.start() surr = TTSurrogate.from_basket_analytic( S0_ranges=[(80, 120)] * 4, K=100, T=1.0, r=0.05, sigma=[0.2, 0.22, 0.25, 0.28], weights=[0.25] * 4, n_points=15, eps=1e-3, ) current, peak = tracemalloc.get_traced_memory() tracemalloc.stop() print(f"Current: {current / 1024:.1f} KB") print(f"Peak: {peak / 1024:.1f} KB") # Typical output for 4-asset eps=1e-3: # Current: 91.4 KB # Peak: ~500 KB (includes grid construction + SVD temporaries) **Expected memory footprint (eps=1e-3, n_points=15):** .. list-table:: :header-rows: 1 :widths: 15 18 18 16 33 * - Assets (d) - Full grid (MB) - TT size (KB) - Peak alloc - Notes * - 2 - < 0.01 - 3 KB - ~50 KB - Negligible * - 3 - 0.03 - 28 KB - ~200 KB - Fits in L1 cache * - 4 - 0.39 - 91 KB - ~500 KB - 4× compression * - 5 - 5.79 MB - 142 KB - ~2 MB - **42× compression — TT wins decisively** * - 6 - 92 MB - ~200 KB - ~4 MB - Full grid never materialised (TT-Cross) * - 10 - 576 GB - ~400 KB - ~8 MB - Impossible without TT The TT surrogate's ``print_summary()`` method reports memory automatically: .. code-block:: python surr.print_summary() # ────────────────────────────────────────────────────────────────── # TTSurrogate dims=4 n_points=15 eps=1e-3 max_rank=30 # Ranks : [1, 22, 25, 18, 1] # TT memory : 91.4 KB Full-grid equivalent : 393.8 KB # Compression: 4.3× Max rank : 25 # Build time : 10.2 ms Compress time : 2.1 ms # ────────────────────────────────────────────────────────────────── Monte Carlo Variance Reduction: Which Method When --------------------------------------------------- All MC pricers default to ``n_paths=100_000``. The table below shows the variance reduction you get and when to use each method. .. list-table:: :header-rows: 1 :widths: 25 18 18 39 * - Method - Variance reduction - ``n_paths`` to match crude MC 100 K - Best for * - ``bs_price_antithetic`` - ~2× - 50 K - Always-on default; free variance reduction * - ``bs_price_cv`` - 10–100× - 1 K–10 K - Asian options (use geometric Asian as control) * - ``bs_price_qmc`` - O(1/N) vs O(1/√N) - 8 K (power of 2) - Near-ATM vanilla / mild path dependency * - ``bs_price_importance`` - 10–1000× - 100–1 K - Deep OTM options (>3 sigma) * - ``bs_price_stratified`` - 2–10× - 10 K–50 K - General purpose when QMC is unavailable .. code-block:: python from tensorquantlib import ( asian_price_mc, # crude MC baseline asian_price_cv, # with geometric-Asian control variate bs_price_qmc, # Sobol QMC ) # Compare methods at equal compute budget (n_paths=10_000) price_crude, se_crude = asian_price_mc( S=100, K=100, T=1.0, r=0.05, sigma=0.2, n_paths=10_000, return_stderr=True ) price_cv, se_cv = asian_price_cv( S=100, K=100, T=1.0, r=0.05, sigma=0.2, n_paths=10_000, return_stderr=True ) print(f"Crude MC : {price_crude:.4f} ± {se_crude:.4f}") print(f"Control V: {price_cv:.4f} ± {se_cv:.4f}") # Typical: # Crude MC : 5.7621 ± 0.0451 (large stderr) # Control V: 5.7608 ± 0.0043 (10× tighter) **QMC: Always pass a power-of-2 ``n_paths``** to get proper Sobol balance: .. code-block:: python # Good (balance maintained) price = bs_price_qmc(S=100, K=100, T=1.0, r=0.05, sigma=0.2, n_paths=65_536) # 2^16 # Avoid — triggers UserWarning and slight accuracy degradation price = bs_price_qmc(..., n_paths=100_000) Parallelism: Pricing a Large Portfolio --------------------------------------- TensorQuantLib has **no internal parallelism** by design — all functions are pure NumPy and release the GIL. This makes them safe to call from multiple threads or processes simultaneously. **Thread-parallel strike chain (I/O-light, GIL-free):** .. code-block:: python from concurrent.futures import ThreadPoolExecutor import numpy as np from tensorquantlib import heston_price, HestonParams params = HestonParams(kappa=2.0, theta=0.04, xi=0.3, rho=-0.7, v0=0.04) strikes = np.linspace(70, 130, 200) def price_one(K): return heston_price(S=100, K=K, T=1.0, r=0.05, params=params) with ThreadPoolExecutor(max_workers=8) as pool: prices = list(pool.map(price_one, strikes)) # 200 Heston prices × ~1 ms / 8 cores ≈ 25 ms end-to-end **Process-parallel for calibration restarts** (CPU-bound — use processes): .. code-block:: python from concurrent.futures import ProcessPoolExecutor def calibrate_one(seed): # ... same as _one_restart() above ... pass with ProcessPoolExecutor(max_workers=4) as pool: results = list(pool.map(calibrate_one, range(4))) best = min(results, key=lambda r: r.fun) .. note:: **GPU / CUDA**: TensorQuantLib is pure NumPy. There is no built-in GPU support today, but pricing surfaces and TT-SVD operations are compatible with ``cupy.ndarray`` as a drop-in replacement for ``numpy.ndarray`` in most functions. Full CuPy integration is on the roadmap (see :doc:`limitations`). Production Configuration Checklist ------------------------------------ Use this checklist before deploying TensorQuantLib in a production trading or risk system: **Daily startup** .. code-block:: bash # 1. Verify environment python -m tensorquantlib --version # 2. Run smoke tests python -m pytest tests/ -q --tb=short -x -m "not slow" **Parameter selection** .. list-table:: :header-rows: 1 :widths: 40 30 30 * - Setting - Development / backtest - Production / real-time * - ``HestonCalibrator n_restarts`` - 3–5 - **1** (with warm start) * - ``HestonCalibrator maxiter`` - 500 - **200** * - ``heston_price n_points`` - 100 - **50–100** (50 gives < 0.1 bp error) * - ``heston_price_mc n_paths`` - 100 K - N/A (use CF for live pricing) * - ``TTSurrogate eps`` - ``1e-4`` - **``1e-3``** * - ``TTSurrogate n_points`` - 30 (2-3D) - **15–20** (3D), **10–15** (4-5D) * - ``TTSurrogate max_rank`` - None - **30** * - MC ``n_paths`` (vanilla) - 100 K - **10 K + antithetic** or **QMC 8 K** * - MC ``n_paths`` (Asian with CV) - 100 K - **10 K** (CV reduces SE 10×) * - Heston MC ``scheme`` - ``"qe"`` - N/A (use ``heston_price`` CF instead) **Warm-start workflow (recommended)** .. code-block:: python import json, pathlib from tensorquantlib.finance.heston import HestonCalibrator, HestonParams PARAMS_FILE = pathlib.Path("heston_params.json") def load_params() -> HestonParams: if PARAMS_FILE.exists(): d = json.loads(PARAMS_FILE.read_text()) return HestonParams(**d) return HestonParams() # default def save_params(p: HestonParams) -> None: PARAMS_FILE.write_text(json.dumps({ "kappa": p.kappa, "theta": p.theta, "xi": p.xi, "rho": p.rho, "v0": p.v0, })) # At market open cal = HestonCalibrator(S=spot, r=risk_free_rate) cal.params_ = load_params() # warm start cal.fit(iv_surface, K_grid, T_grid, n_restarts=1, maxiter=200) # ~1.5 s save_params(cal.params_) # persist for tomorrow **Error handling in production** .. code-block:: python from tensorquantlib import implied_vol, heston_price, HestonParams import numpy as np def safe_heston_iv(S, K, T, r, params, fallback=np.nan): """Return Heston-implied vol; return fallback on any numerical failure.""" try: price = heston_price(S, K, T, r, params) if not np.isfinite(price) or price <= 1e-10: return fallback return implied_vol(price, S, K, T, r) except Exception: return fallback **TT surrogate rebuild policy** Rebuild the surrogate when any of these change: - Heston / model parameters (after recalibration) - Risk-free rate > 5 bps shift - Spot price moves outside the surrogate's domain ± 10 % - New expiry or strike enters the book Otherwise, ``surr.evaluate`` is exact within the built tolerance for the *existing* parameter set. Interpreting ``print_summary()`` Output ----------------------------------------- .. code-block:: text TTSurrogate dims=3 n_points=20 eps=1e-4 Ranks : [1, 15, 22, 1] TT memory : 53.6 KB Full-grid equivalent : 640.0 KB Compression: 11.9× Max rank : 22 Build time : 1.8 ms Compress time : 1.2 ms - **Ranks**: boundary ranks of each TT core. High middle ranks (> 40) with ``eps=1e-3`` suggest a rough or discontinuous payoff — consider increasing ``n_points`` or switching to ``from_function`` (TT-Cross). - **Compression < 1×**: TT uses *more* memory than the dense grid. This means the surface is not low-rank. Try increasing ``eps`` or reducing ``n_points``. - **Build time >> Compress time**: Most of the cost is in evaluating the pricing function on the grid. Use analytic approximations (``from_basket_analytic``) rather than MC (``from_basket_mc``) wherever possible. - **Compress time >> Build time**: The TT-SVD routine is dominant; consider reducing ``n_points``. Profiling Your Own Workflow ----------------------------- Quick wall-clock profiling with ``cProfile``: .. code-block:: bash python -m cProfile -s cumtime -m tensorquantlib.__main__ heston \ --S 100 --K 100 --T 1.0 --r 0.05 2>&1 | head -30 Or instrument inline with ``time.perf_counter``: .. code-block:: python import time from tensorquantlib import heston_price, HestonParams params = HestonParams() t0 = time.perf_counter() for _ in range(1000): heston_price(S=100, K=100, T=1.0, r=0.05, params=params) elapsed = time.perf_counter() - t0 print(f"{elapsed / 1000 * 1e3:.3f} ms per call") # Typical: ~0.9 ms on M1 / ~1.2 ms on Intel Xeon For memory-intensive workflows, use ``memory_profiler`` (install separately): .. code-block:: bash pip install memory-profiler python -m memory_profiler my_surrogate_script.py