Performance & Production Guide¶
This guide is written for quant traders, risk engineers, and production system architects who need to know exactly how fast TensorQuantLib is, where the bottlenecks are, and how to tune every knob.
All timings were measured on Apple M1 (8-core, 8 GB RAM), Python 3.11, NumPy linked to Apple Accelerate. Expect similar or better numbers on a modern x86-64 server with an AVX-512-enabled BLAS.
At-a-Glance Latency Table¶
Function / Workflow |
Single-call latency |
Throughput |
Notes |
|---|---|---|---|
|
< 5 µs |
> 200 K/s |
Fully analytic, log+exp only |
|
< 5 µs |
> 200 K/s |
Analytic Delta/Gamma/Theta/Vega/Rho |
|
~1 ms |
~1 K/s |
4-point finite-diff bump × 2 params |
|
< 1 ms |
~2 K/s |
Brent solver, ~10 iterations |
|
~1 ms |
~1 K/s |
100-pt Gaussian quadrature |
|
200–500 ms |
2–5/s |
100 K paths × 252 steps |
|
100–500 ms |
2–10/s |
50 K paths × 100 steps |
|
100–400 ms |
2–10/s |
50 K paths × 252 steps |
|
50–200 ms |
5–20/s |
Control variate; 10× variance reduction |
|
< 5 µs |
> 200 K/s |
Rubinstein-Reiner closed form |
|
5–15 s |
— |
Default: 3 restarts, 500 iterations |
|
2 ms |
— |
One-time cost; paid at startup |
|
1.5 µs |
650 K/s |
After build; multi-linear interp on TT cores |
|
~5 µs |
~200 K/s |
After build |
Choosing the Right Pricer¶
Need a price RIGHT NOW (< 10 µs)?
├── Vanilla (single asset) → black_scholes / bs_greeks
├── Barrier (single asset) → barrier_price
├── Vasicek / CIR bond → vasicek_bond_price / cir_bond_price
└── FX vanilla → garman_kohlhagen
Need something slightly slower but model-richer?
├── Heston (single price) → heston_price (~1 ms)
├── Implied vol → implied_vol (~1 ms)
├── SABR smile → sabr_implied_vol (< 10 µs)
└── SVI surface → svi_implied_vol (< 10 µs)
Monte Carlo is unavoidable?
├── American option → american_option_lsm (100 ms)
├── Asian (with VR!) → asian_price_cv (50 ms)
└── Heston stress-test → heston_price_mc (200 ms)
Many strikes / expiries to price daily?
└── Build a TTSurrogate once at market open, query at µs latency forever.
Batch / Vectorised Pricing¶
Every scalar pricer accepts plain Python floats and works inside
np.vectorize or a list comprehension with no code change.
Price a 1 000-strike chain in one shot:
import numpy as np
from tensorquantlib import black_scholes
strikes = np.linspace(80, 120, 1000)
S, T, r, sigma = 100.0, 1.0, 0.05, 0.2
# np.vectorize adds basically zero overhead for C-speed functions
bs_vec = np.vectorize(black_scholes)
prices = bs_vec(S, strikes, T, r, sigma, option_type="call")
# ~5 ms for 1 000 strikes — 200 K/s effective throughput
Price a full Heston surface (5 strikes × 3 expiries) at once:
import numpy as np
from tensorquantlib import heston_price, HestonParams
params = HestonParams(kappa=2.0, theta=0.04, xi=0.3, rho=-0.7, v0=0.04)
K_grid = np.array([90., 95., 100., 105., 110.])
T_grid = np.array([0.5, 1.0, 2.0])
S, r = 100.0, 0.05
# 15 calls × ~1 ms each ≈ 15 ms
surface = np.array([
[heston_price(S, K, T, r, params) for T in T_grid]
for K in K_grid
]) # shape (5, 3)
Even faster: TTSurrogate replaces the surface loop entirely:
from tensorquantlib import TTSurrogate, heston_price, HestonParams
import numpy as np
params = HestonParams(kappa=2.0, theta=0.04, xi=0.3, rho=-0.7, v0=0.04)
# Build once at startup (~2 ms for 3D)
surr = TTSurrogate.from_function(
fn=lambda s, k, t: heston_price(s, k, t, 0.05, params),
axes=[
np.linspace(80, 120, 15), # S axis
np.linspace(85, 115, 15), # K axis
np.linspace(0.25, 2.0, 15), # T axis
],
eps=1e-3,
max_rank=30,
)
# Now price 10 000 spot/strike/expiry combos in < 15 ms
pts = np.column_stack([
np.random.uniform(80, 120, 10_000),
np.random.uniform(85, 115, 10_000),
np.random.uniform(0.25, 2.0, 10_000),
])
prices = surr.evaluate(pts) # vectorised; ~1.5 µs/call = 650 K/s
TT-Rank and n_points Tuning¶
These are the two levers that control the accuracy/speed trade-off in every
TTSurrogate.
Grid resolution: n_points¶
n_points is the number of grid points per axis. The full tensor has
n_points ** d entries; TT compresses this exponentially.
Dimensions (d) |
Recommended |
Full grid size |
Build time (approx) |
Notes |
|---|---|---|---|---|
2–3 |
20–30 |
400–27 K entries |
< 5 ms |
Dense enough for smooth Heston/BS surfaces |
4–5 |
15–20 |
50 K–3 M entries |
10–100 ms (use TT-Cross) |
TT-Cross recommended; skip full grid construction |
6+ |
10–15 |
1 M+ (never built) |
N/A (TT-Cross only) |
|
Tip
For 6+ dimensional baskets use TTSurrogate.from_function.
It calls the pricing function at only
d × r² × n ≈ 6 × 400 × 15 = 36,000 carefully chosen points instead of
the 15^6 = 11 M full grid — a 300× speedup in build time.
SVD tolerance: eps¶
eps is the relative Frobenius error allowed in TT-SVD. Tighter eps
means higher ranks and more memory.
|
Max rank (3D) |
TT memory |
Compression |
When to use |
|---|---|---|---|---|
|
2–3 |
~2 KB |
100× |
Coarse smile shape only; not for pricing |
|
8 |
~17 KB |
13× |
Pre-trade screening / large portfolio VaR |
|
23 |
~124 KB |
1.7× |
Production sweet spot — 42× compression at 5D, < 0.1% error |
|
30 |
~225 KB |
~1× |
High-fidelity validation; matches CF to machine precision |
|
Full rank |
= full grid |
1× |
No benefit; use direct evaluation instead |
# Production-quality 3-asset surrogate
surr = TTSurrogate.from_basket_analytic(
S0_ranges=[(80, 120)] * 3,
K=100, T=1.0, r=0.05,
sigma=[0.2, 0.25, 0.3],
weights=[1/3, 1/3, 1/3],
n_points=20, # ← sweet spot for 3D
eps=1e-3, # ← production tolerance
max_rank=30, # ← safety cap; rarely reached at eps=1e-3
)
surr.print_summary()
# Expected: max_rank ≈ 23, memory ≈ 124 KB, compression ≈ 1.7×
Hard-capping ranks: max_rank¶
Setting max_rank prevents runaway memory on poorly-conditioned surfaces
(e.g., near-digital payoffs or very short maturities). Recommended values:
max_rank=20: fast / light; suitable for pre-trade screeningmax_rank=30: production defaultmax_rank=50: near-exact; use for EOD risk reportsmax_rank=None: unbounded; only for validation
n_sweeps in TT-Cross¶
When using TTSurrogate.from_function (TT-Cross), n_sweeps controls
how many alternating left/right passes are made over the dimension chain.
n_sweeps=4: sufficient for smooth Black-Scholes surfacesn_sweeps=6: default; good for Heston and jump-diffusionn_sweeps=8: for stiff or highly correlated surfaces
Heston Calibration: From 10 s to < 2 s¶
The default HestonCalibrator.fit with n_restarts=3, maxiter=500
takes 5–15 seconds because it runs hundreds of L-BFGS-B optimizer
steps, and each step calls heston_price (1 ms) for every
(K, T) pair in the market surface.
Breakdown for a 5 × 4 surface (20 options):
~1 ms per
heston_pricecall (100-pt CF quadrature)~15 ms per objective evaluation (15
heston_price+ 15implied_vol)~200 objective evals per L-BFGS-B run
× 3 restarts = ~200 × 15 ms × 3 = 9 s
Speedup strategy 1 — Single restart with warm start¶
In production you re-calibrate daily (or tick-by-tick). Yesterday’s parameters are an excellent starting point:
import numpy as np
from tensorquantlib.finance.heston import HestonCalibrator, HestonParams
# Day T-1 calibrated params (persisted to disk / Redis)
prev_params = HestonParams(kappa=2.1, theta=0.042, xi=0.31, rho=-0.68, v0=0.038)
cal = HestonCalibrator(S=100.0, r=0.05)
cal.params_ = prev_params # warm start
K_grid = np.array([90., 95., 100., 105., 110.])
T_grid = np.array([0.5, 1.0, 2.0])
iv_mkt = np.full((5, 3), 0.20) # replace with real market IVs
# 1 restart instead of 3 → 3× faster
# 200 maxiter instead of 500 → 2.5× faster
cal.fit(iv_mkt, K_grid, T_grid, n_restarts=1, maxiter=200)
print(f"Calibrated RMSE: {cal.rmse_:.6f}")
# Typical wall-clock: ~1.5 s
Speedup strategy 2 — Parallel restarts with joblib¶
Run multiple random-start optimisations simultaneously; keep the best:
import numpy as np
from scipy.optimize import minimize
from joblib import Parallel, delayed
from tensorquantlib import heston_price, implied_vol, HestonParams
S, r = 100.0, 0.05
K_grid = np.array([90., 95., 100., 105., 110.])
T_grid = np.array([0.5, 1.0, 2.0])
iv_mkt = np.full((5, 3), 0.20)
BOUNDS = [(0.1, 10), (0.001, 0.5), (0.1, 2.0), (-0.99, 0.99), (0.001, 0.5)]
def _objective(x):
params = HestonParams(*x)
err = 0.0
for i, K in enumerate(K_grid):
for j, T in enumerate(T_grid):
try:
price = heston_price(S, K, T, r, params)
iv_m = implied_vol(price, S, K, T, r)
err += (iv_m - iv_mkt[i, j]) ** 2
except Exception:
err += 1.0
return err
def _one_restart(seed):
rng = np.random.default_rng(seed)
x0 = np.array([rng.uniform(lo, hi) for lo, hi in BOUNDS])
return minimize(_objective, x0, method="L-BFGS-B",
bounds=BOUNDS, options={"maxiter": 200})
# 4 parallel restarts — wall-clock ≈ 1 restart time (not 4×)
results = Parallel(n_jobs=4)(delayed(_one_restart)(s) for s in range(4))
best = min(results, key=lambda r: r.fun)
params = HestonParams(*best.x)
print(f"Best RMSE: {np.sqrt(best.fun / (len(K_grid)*len(T_grid))):.6f}")
Speedup strategy 3 — Reduce the CF integration cost¶
heston_price uses n_points=100 Gaussian quadrature nodes by default.
For calibration purposes 50 nodes are just as accurate and twice as fast:
from functools import partial
from tensorquantlib.finance.heston import heston_price as _hp
# Monkey-patch a faster version for calibration
fast_heston = partial(_hp, n_points=50)
# Use fast_heston inside your objective instead of heston_price
price = fast_heston(S=100, K=100, T=1.0, r=0.05, params=params)
Speedup strategy 4 — Coarsen the IV grid¶
Calibrate to fewer points; check residual on the full grid:
# Coarse calibration grid: 4 strikes × 3 expiries = 12 evaluations/step
K_calib = np.array([90., 97., 103., 110.])
T_calib = np.array([0.5, 1.0, 2.0])
cal.fit(iv_mkt_coarse, K_calib, T_calib, n_restarts=1, maxiter=200)
# Validate RMSE on full 10×5 grid
iv_model = np.array([
[implied_vol(heston_price(S, K, T, r, cal.params_), S, K, T, r)
for T in T_grid_full]
for K in K_grid_full
])
print("Full-grid RMSE:", np.sqrt(np.mean((iv_model - iv_mkt_full)**2)))
Summary of speedups:
Strategy |
Wall-clock |
Speedup |
Accuracy impact |
|---|---|---|---|
Default (n_restarts=3, maxiter=500) |
~10 s |
1× |
Baseline |
Single restart + warm start |
~1.5 s |
~7× |
Negligible if params are stable |
Parallel restarts (4 cores) + warm start |
~1.0 s |
~10× |
As good as 4 restarts |
n_points=50 in CF |
~5 s |
~2× |
< 0.1 bp IV error |
All combined |
< 0.5 s |
~20× |
< 0.5 bps RMSE |
Memory Profiling¶
Use Python’s built-in tracemalloc to measure memory allocation of any
pricing or surrogate workflow:
import tracemalloc
import numpy as np
from tensorquantlib import TTSurrogate
tracemalloc.start()
surr = TTSurrogate.from_basket_analytic(
S0_ranges=[(80, 120)] * 4,
K=100, T=1.0, r=0.05,
sigma=[0.2, 0.22, 0.25, 0.28],
weights=[0.25] * 4,
n_points=15,
eps=1e-3,
)
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"Current: {current / 1024:.1f} KB")
print(f"Peak: {peak / 1024:.1f} KB")
# Typical output for 4-asset eps=1e-3:
# Current: 91.4 KB
# Peak: ~500 KB (includes grid construction + SVD temporaries)
Expected memory footprint (eps=1e-3, n_points=15):
Assets (d) |
Full grid (MB) |
TT size (KB) |
Peak alloc |
Notes |
|---|---|---|---|---|
2 |
< 0.01 |
3 KB |
~50 KB |
Negligible |
3 |
0.03 |
28 KB |
~200 KB |
Fits in L1 cache |
4 |
0.39 |
91 KB |
~500 KB |
4× compression |
5 |
5.79 MB |
142 KB |
~2 MB |
42× compression — TT wins decisively |
6 |
92 MB |
~200 KB |
~4 MB |
Full grid never materialised (TT-Cross) |
10 |
576 GB |
~400 KB |
~8 MB |
Impossible without TT |
The TT surrogate’s print_summary() method reports memory automatically:
surr.print_summary()
# ──────────────────────────────────────────────────────────────────
# TTSurrogate dims=4 n_points=15 eps=1e-3 max_rank=30
# Ranks : [1, 22, 25, 18, 1]
# TT memory : 91.4 KB Full-grid equivalent : 393.8 KB
# Compression: 4.3× Max rank : 25
# Build time : 10.2 ms Compress time : 2.1 ms
# ──────────────────────────────────────────────────────────────────
Monte Carlo Variance Reduction: Which Method When¶
All MC pricers default to n_paths=100_000. The table below shows the
variance reduction you get and when to use each method.
Method |
Variance reduction |
|
Best for |
|---|---|---|---|
|
~2× |
50 K |
Always-on default; free variance reduction |
|
10–100× |
1 K–10 K |
Asian options (use geometric Asian as control) |
|
O(1/N) vs O(1/√N) |
8 K (power of 2) |
Near-ATM vanilla / mild path dependency |
|
10–1000× |
100–1 K |
Deep OTM options (>3 sigma) |
|
2–10× |
10 K–50 K |
General purpose when QMC is unavailable |
from tensorquantlib import (
asian_price_mc, # crude MC baseline
asian_price_cv, # with geometric-Asian control variate
bs_price_qmc, # Sobol QMC
)
# Compare methods at equal compute budget (n_paths=10_000)
price_crude, se_crude = asian_price_mc(
S=100, K=100, T=1.0, r=0.05, sigma=0.2,
n_paths=10_000, return_stderr=True
)
price_cv, se_cv = asian_price_cv(
S=100, K=100, T=1.0, r=0.05, sigma=0.2,
n_paths=10_000, return_stderr=True
)
print(f"Crude MC : {price_crude:.4f} ± {se_crude:.4f}")
print(f"Control V: {price_cv:.4f} ± {se_cv:.4f}")
# Typical:
# Crude MC : 5.7621 ± 0.0451 (large stderr)
# Control V: 5.7608 ± 0.0043 (10× tighter)
QMC: Always pass a power-of-2 ``n_paths`` to get proper Sobol balance:
# Good (balance maintained)
price = bs_price_qmc(S=100, K=100, T=1.0, r=0.05, sigma=0.2,
n_paths=65_536) # 2^16
# Avoid — triggers UserWarning and slight accuracy degradation
price = bs_price_qmc(..., n_paths=100_000)
Parallelism: Pricing a Large Portfolio¶
TensorQuantLib has no internal parallelism by design — all functions are pure NumPy and release the GIL. This makes them safe to call from multiple threads or processes simultaneously.
Thread-parallel strike chain (I/O-light, GIL-free):
from concurrent.futures import ThreadPoolExecutor
import numpy as np
from tensorquantlib import heston_price, HestonParams
params = HestonParams(kappa=2.0, theta=0.04, xi=0.3, rho=-0.7, v0=0.04)
strikes = np.linspace(70, 130, 200)
def price_one(K):
return heston_price(S=100, K=K, T=1.0, r=0.05, params=params)
with ThreadPoolExecutor(max_workers=8) as pool:
prices = list(pool.map(price_one, strikes))
# 200 Heston prices × ~1 ms / 8 cores ≈ 25 ms end-to-end
Process-parallel for calibration restarts (CPU-bound — use processes):
from concurrent.futures import ProcessPoolExecutor
def calibrate_one(seed):
# ... same as _one_restart() above ...
pass
with ProcessPoolExecutor(max_workers=4) as pool:
results = list(pool.map(calibrate_one, range(4)))
best = min(results, key=lambda r: r.fun)
Note
GPU / CUDA: TensorQuantLib is pure NumPy. There is no built-in GPU
support today, but pricing surfaces and TT-SVD operations are compatible
with cupy.ndarray as a drop-in replacement for numpy.ndarray in
most functions. Full CuPy integration is on the roadmap (see
Known Limitations).
Production Configuration Checklist¶
Use this checklist before deploying TensorQuantLib in a production trading or risk system:
Daily startup
# 1. Verify environment
python -m tensorquantlib --version
# 2. Run smoke tests
python -m pytest tests/ -q --tb=short -x -m "not slow"
Parameter selection
Setting |
Development / backtest |
Production / real-time |
|---|---|---|
|
3–5 |
1 (with warm start) |
|
500 |
200 |
|
100 |
50–100 (50 gives < 0.1 bp error) |
|
100 K |
N/A (use CF for live pricing) |
|
|
``1e-3`` |
|
30 (2-3D) |
15–20 (3D), 10–15 (4-5D) |
|
None |
30 |
MC |
100 K |
10 K + antithetic or QMC 8 K |
MC |
100 K |
10 K (CV reduces SE 10×) |
Heston MC |
|
N/A (use |
Warm-start workflow (recommended)
import json, pathlib
from tensorquantlib.finance.heston import HestonCalibrator, HestonParams
PARAMS_FILE = pathlib.Path("heston_params.json")
def load_params() -> HestonParams:
if PARAMS_FILE.exists():
d = json.loads(PARAMS_FILE.read_text())
return HestonParams(**d)
return HestonParams() # default
def save_params(p: HestonParams) -> None:
PARAMS_FILE.write_text(json.dumps({
"kappa": p.kappa, "theta": p.theta,
"xi": p.xi, "rho": p.rho, "v0": p.v0,
}))
# At market open
cal = HestonCalibrator(S=spot, r=risk_free_rate)
cal.params_ = load_params() # warm start
cal.fit(iv_surface, K_grid, T_grid,
n_restarts=1, maxiter=200) # ~1.5 s
save_params(cal.params_) # persist for tomorrow
Error handling in production
from tensorquantlib import implied_vol, heston_price, HestonParams
import numpy as np
def safe_heston_iv(S, K, T, r, params, fallback=np.nan):
"""Return Heston-implied vol; return fallback on any numerical failure."""
try:
price = heston_price(S, K, T, r, params)
if not np.isfinite(price) or price <= 1e-10:
return fallback
return implied_vol(price, S, K, T, r)
except Exception:
return fallback
TT surrogate rebuild policy
Rebuild the surrogate when any of these change:
Heston / model parameters (after recalibration)
Risk-free rate > 5 bps shift
Spot price moves outside the surrogate’s domain ± 10 %
New expiry or strike enters the book
Otherwise, surr.evaluate is exact within the built tolerance for the
existing parameter set.
Interpreting print_summary() Output¶
TTSurrogate dims=3 n_points=20 eps=1e-4
Ranks : [1, 15, 22, 1]
TT memory : 53.6 KB Full-grid equivalent : 640.0 KB
Compression: 11.9× Max rank : 22
Build time : 1.8 ms Compress time : 1.2 ms
Ranks: boundary ranks of each TT core. High middle ranks (> 40) with
eps=1e-3suggest a rough or discontinuous payoff — consider increasingn_pointsor switching tofrom_function(TT-Cross).Compression < 1×: TT uses more memory than the dense grid. This means the surface is not low-rank. Try increasing
epsor reducingn_points.Build time >> Compress time: Most of the cost is in evaluating the pricing function on the grid. Use analytic approximations (
from_basket_analytic) rather than MC (from_basket_mc) wherever possible.Compress time >> Build time: The TT-SVD routine is dominant; consider reducing
n_points.
Profiling Your Own Workflow¶
Quick wall-clock profiling with cProfile:
python -m cProfile -s cumtime -m tensorquantlib.__main__ heston \
--S 100 --K 100 --T 1.0 --r 0.05 2>&1 | head -30
Or instrument inline with time.perf_counter:
import time
from tensorquantlib import heston_price, HestonParams
params = HestonParams()
t0 = time.perf_counter()
for _ in range(1000):
heston_price(S=100, K=100, T=1.0, r=0.05, params=params)
elapsed = time.perf_counter() - t0
print(f"{elapsed / 1000 * 1e3:.3f} ms per call")
# Typical: ~0.9 ms on M1 / ~1.2 ms on Intel Xeon
For memory-intensive workflows, use memory_profiler (install separately):
pip install memory-profiler
python -m memory_profiler my_surrogate_script.py