Performance & Production Guide¶

This guide is written for quant traders, risk engineers, and production system architects who need to know exactly how fast TensorQuantLib is, where the bottlenecks are, and how to tune every knob.

All timings were measured on Apple M1 (8-core, 8 GB RAM), Python 3.11, NumPy linked to Apple Accelerate. Expect similar or better numbers on a modern x86-64 server with an AVX-512-enabled BLAS.

At-a-Glance Latency Table ¶

Function / Workflow	Single-call latency	Throughput	Notes
`black_scholes`	< 5 µs	> 200 K/s	Fully analytic, log+exp only
`bs_greeks`	< 5 µs	> 200 K/s	Analytic Delta/Gamma/Theta/Vega/Rho
`second_order_greeks`	~1 ms	~1 K/s	4-point finite-diff bump × 2 params
`implied_vol`	< 1 ms	~2 K/s	Brent solver, ~10 iterations
`heston_price` (CF)	~1 ms	~1 K/s	100-pt Gaussian quadrature
`heston_price_mc` (QE)	200–500 ms	2–5/s	100 K paths × 252 steps
`american_option_lsm`	100–500 ms	2–10/s	50 K paths × 100 steps
`asian_price_mc`	100–400 ms	2–10/s	50 K paths × 252 steps
`asian_price_cv`	50–200 ms	5–20/s	Control variate; 10× variance reduction
`barrier_price`	< 5 µs	> 200 K/s	Rubinstein-Reiner closed form
`HestonCalibrator.fit`	5–15 s	—	Default: 3 restarts, 500 iterations
`TTSurrogate` build (3D, n=15)	2 ms	—	One-time cost; paid at startup
`TTSurrogate.evaluate` (3D)	1.5 µs	650 K/s	After build; multi-linear interp on TT cores
`TTSurrogate.evaluate` (5D)	~5 µs	~200 K/s	After build

Choosing the Right Pricer ¶

Need a price RIGHT NOW (< 10 µs)?
├── Vanilla (single asset)  →  black_scholes / bs_greeks
├── Barrier (single asset)  →  barrier_price
├── Vasicek / CIR bond      →  vasicek_bond_price / cir_bond_price
└── FX vanilla              →  garman_kohlhagen

Need something slightly slower but model-richer?
├── Heston (single price)   →  heston_price          (~1 ms)
├── Implied vol             →  implied_vol            (~1 ms)
├── SABR smile              →  sabr_implied_vol       (< 10 µs)
└── SVI surface             →  svi_implied_vol        (< 10 µs)

Monte Carlo is unavoidable?
├── American option         →  american_option_lsm   (100 ms)
├── Asian (with VR!)        →  asian_price_cv        (50 ms)
└── Heston stress-test      →  heston_price_mc       (200 ms)

Many strikes / expiries to price daily?
└── Build a TTSurrogate once at market open, query at µs latency forever.

Batch / Vectorised Pricing ¶

Every scalar pricer accepts plain Python floats and works inside np.vectorize or a list comprehension with no code change.

Price a 1 000-strike chain in one shot:

import numpy as np
from tensorquantlib import black_scholes

strikes = np.linspace(80, 120, 1000)
S, T, r, sigma = 100.0, 1.0, 0.05, 0.2

# np.vectorize adds basically zero overhead for C-speed functions
bs_vec = np.vectorize(black_scholes)
prices = bs_vec(S, strikes, T, r, sigma, option_type="call")
# ~5 ms for 1 000 strikes — 200 K/s effective throughput

Price a full Heston surface (5 strikes × 3 expiries) at once:

import numpy as np
from tensorquantlib import heston_price, HestonParams

params = HestonParams(kappa=2.0, theta=0.04, xi=0.3, rho=-0.7, v0=0.04)
K_grid = np.array([90., 95., 100., 105., 110.])
T_grid = np.array([0.5, 1.0, 2.0])
S, r = 100.0, 0.05

# 15 calls × ~1 ms each ≈ 15 ms
surface = np.array([
    [heston_price(S, K, T, r, params) for T in T_grid]
    for K in K_grid
])   # shape (5, 3)

Even faster: TTSurrogate replaces the surface loop entirely:

from tensorquantlib import TTSurrogate, heston_price, HestonParams
import numpy as np

params = HestonParams(kappa=2.0, theta=0.04, xi=0.3, rho=-0.7, v0=0.04)

# Build once at startup (~2 ms for 3D)
surr = TTSurrogate.from_function(
    fn=lambda s, k, t: heston_price(s, k, t, 0.05, params),
    axes=[
        np.linspace(80, 120, 15),   # S axis
        np.linspace(85, 115, 15),   # K axis
        np.linspace(0.25, 2.0, 15), # T axis
    ],
    eps=1e-3,
    max_rank=30,
)

# Now price 10 000 spot/strike/expiry combos in < 15 ms
pts = np.column_stack([
    np.random.uniform(80, 120, 10_000),
    np.random.uniform(85, 115, 10_000),
    np.random.uniform(0.25, 2.0, 10_000),
])
prices = surr.evaluate(pts)   # vectorised; ~1.5 µs/call = 650 K/s

TT-Rank and `n_points` Tuning ¶

These are the two levers that control the accuracy/speed trade-off in every TTSurrogate.

Grid resolution: `n_points`¶

n_points is the number of grid points per axis. The full tensor has n_points ** d entries; TT compresses this exponentially.

Dimensions (d)	Recommended `n_points`	Full grid size	Build time (approx)	Notes
2–3	20–30	400–27 K entries	< 5 ms	Dense enough for smooth Heston/BS surfaces
4–5	15–20	50 K–3 M entries	10–100 ms (use TT-Cross)	TT-Cross recommended; skip full grid construction
6+	10–15	1 M+ (never built)	N/A (TT-Cross only)	`TTSurrogate.from_function` with `max_rank=20`

Tip

For 6+ dimensional baskets use TTSurrogate.from_function. It calls the pricing function at only d × r² × n ≈ 6 × 400 × 15 = 36,000 carefully chosen points instead of the 15^6 = 11 M full grid — a 300× speedup in build time.

SVD tolerance: `eps`¶

eps is the relative Frobenius error allowed in TT-SVD. Tighter eps means higher ranks and more memory.

`eps`	Max rank (3D)	TT memory	Compression	When to use
`1e-1`	2–3	~2 KB	100×	Coarse smile shape only; not for pricing
`1e-2`	8	~17 KB	13×	Pre-trade screening / large portfolio VaR
`1e-3` ✅	23	~124 KB	1.7×	Production sweet spot — 42× compression at 5D, < 0.1% error
`1e-4`	30	~225 KB	~1×	High-fidelity validation; matches CF to machine precision
`1e-6`	Full rank	= full grid	1×	No benefit; use direct evaluation instead

# Production-quality 3-asset surrogate
surr = TTSurrogate.from_basket_analytic(
    S0_ranges=[(80, 120)] * 3,
    K=100, T=1.0, r=0.05,
    sigma=[0.2, 0.25, 0.3],
    weights=[1/3, 1/3, 1/3],
    n_points=20,    # ← sweet spot for 3D
    eps=1e-3,       # ← production tolerance
    max_rank=30,    # ← safety cap; rarely reached at eps=1e-3
)
surr.print_summary()
# Expected: max_rank ≈ 23, memory ≈ 124 KB, compression ≈ 1.7×

Hard-capping ranks: `max_rank`¶

Setting max_rank prevents runaway memory on poorly-conditioned surfaces (e.g., near-digital payoffs or very short maturities). Recommended values:

max_rank=20: fast / light; suitable for pre-trade screening
max_rank=30: production default
max_rank=50: near-exact; use for EOD risk reports
max_rank=None: unbounded; only for validation

`n_sweeps` in TT-Cross ¶

When using TTSurrogate.from_function (TT-Cross), n_sweeps controls how many alternating left/right passes are made over the dimension chain.

n_sweeps=4: sufficient for smooth Black-Scholes surfaces
n_sweeps=6: default; good for Heston and jump-diffusion
n_sweeps=8: for stiff or highly correlated surfaces

Heston Calibration: From 10 s to < 2 s ¶

The default HestonCalibrator.fit with n_restarts=3, maxiter=500 takes 5–15 seconds because it runs hundreds of L-BFGS-B optimizer steps, and each step calls heston_price (1 ms) for every (K, T) pair in the market surface.

Breakdown for a 5 × 4 surface (20 options):

~1 ms per heston_price call (100-pt CF quadrature)
~15 ms per objective evaluation (15 heston_price + 15 implied_vol)
~200 objective evals per L-BFGS-B run
× 3 restarts = ~200 × 15 ms × 3 = 9 s

Speedup strategy 1 — Single restart with warm start ¶

In production you re-calibrate daily (or tick-by-tick). Yesterday’s parameters are an excellent starting point:

import numpy as np
from tensorquantlib.finance.heston import HestonCalibrator, HestonParams

# Day T-1 calibrated params (persisted to disk / Redis)
prev_params = HestonParams(kappa=2.1, theta=0.042, xi=0.31, rho=-0.68, v0=0.038)

cal = HestonCalibrator(S=100.0, r=0.05)
cal.params_ = prev_params   # warm start

K_grid = np.array([90., 95., 100., 105., 110.])
T_grid = np.array([0.5, 1.0, 2.0])
iv_mkt = np.full((5, 3), 0.20)   # replace with real market IVs

# 1 restart instead of 3 → 3× faster
# 200 maxiter instead of 500 → 2.5× faster
cal.fit(iv_mkt, K_grid, T_grid, n_restarts=1, maxiter=200)
print(f"Calibrated RMSE: {cal.rmse_:.6f}")
# Typical wall-clock: ~1.5 s

Speedup strategy 2 — Parallel restarts with joblib ¶

Run multiple random-start optimisations simultaneously; keep the best:

import numpy as np
from scipy.optimize import minimize
from joblib import Parallel, delayed
from tensorquantlib import heston_price, implied_vol, HestonParams

S, r = 100.0, 0.05
K_grid = np.array([90., 95., 100., 105., 110.])
T_grid = np.array([0.5, 1.0, 2.0])
iv_mkt = np.full((5, 3), 0.20)

BOUNDS = [(0.1, 10), (0.001, 0.5), (0.1, 2.0), (-0.99, 0.99), (0.001, 0.5)]

def _objective(x):
    params = HestonParams(*x)
    err = 0.0
    for i, K in enumerate(K_grid):
        for j, T in enumerate(T_grid):
            try:
                price = heston_price(S, K, T, r, params)
                iv_m = implied_vol(price, S, K, T, r)
                err += (iv_m - iv_mkt[i, j]) ** 2
            except Exception:
                err += 1.0
    return err

def _one_restart(seed):
    rng = np.random.default_rng(seed)
    x0 = np.array([rng.uniform(lo, hi) for lo, hi in BOUNDS])
    return minimize(_objective, x0, method="L-BFGS-B",
                    bounds=BOUNDS, options={"maxiter": 200})

# 4 parallel restarts — wall-clock ≈ 1 restart time (not 4×)
results = Parallel(n_jobs=4)(delayed(_one_restart)(s) for s in range(4))
best = min(results, key=lambda r: r.fun)
params = HestonParams(*best.x)
print(f"Best RMSE: {np.sqrt(best.fun / (len(K_grid)*len(T_grid))):.6f}")

Speedup strategy 3 — Reduce the CF integration cost ¶

heston_price uses n_points=100 Gaussian quadrature nodes by default. For calibration purposes 50 nodes are just as accurate and twice as fast:

from functools import partial
from tensorquantlib.finance.heston import heston_price as _hp

# Monkey-patch a faster version for calibration
fast_heston = partial(_hp, n_points=50)

# Use fast_heston inside your objective instead of heston_price
price = fast_heston(S=100, K=100, T=1.0, r=0.05, params=params)

Speedup strategy 4 — Coarsen the IV grid ¶

Calibrate to fewer points; check residual on the full grid:

# Coarse calibration grid: 4 strikes × 3 expiries = 12 evaluations/step
K_calib = np.array([90., 97., 103., 110.])
T_calib = np.array([0.5, 1.0, 2.0])

cal.fit(iv_mkt_coarse, K_calib, T_calib, n_restarts=1, maxiter=200)

# Validate RMSE on full 10×5 grid
iv_model = np.array([
    [implied_vol(heston_price(S, K, T, r, cal.params_), S, K, T, r)
     for T in T_grid_full]
    for K in K_grid_full
])
print("Full-grid RMSE:", np.sqrt(np.mean((iv_model - iv_mkt_full)**2)))

Summary of speedups:

Strategy	Wall-clock	Speedup	Accuracy impact
Default (n_restarts=3, maxiter=500)	~10 s	1×	Baseline
Single restart + warm start	~1.5 s	~7×	Negligible if params are stable
Parallel restarts (4 cores) + warm start	~1.0 s	~10×	As good as 4 restarts
n_points=50 in CF	~5 s	~2×	< 0.1 bp IV error
All combined	< 0.5 s	~20×	< 0.5 bps RMSE

Memory Profiling ¶

Use Python’s built-in tracemalloc to measure memory allocation of any pricing or surrogate workflow:

import tracemalloc
import numpy as np
from tensorquantlib import TTSurrogate

tracemalloc.start()

surr = TTSurrogate.from_basket_analytic(
    S0_ranges=[(80, 120)] * 4,
    K=100, T=1.0, r=0.05,
    sigma=[0.2, 0.22, 0.25, 0.28],
    weights=[0.25] * 4,
    n_points=15,
    eps=1e-3,
)

current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()

print(f"Current: {current / 1024:.1f} KB")
print(f"Peak:    {peak / 1024:.1f} KB")
# Typical output for 4-asset eps=1e-3:
#   Current: 91.4 KB
#   Peak:    ~500 KB  (includes grid construction + SVD temporaries)

Expected memory footprint (eps=1e-3, n_points=15):

Assets (d)	Full grid (MB)	TT size (KB)	Peak alloc	Notes
2	< 0.01	3 KB	~50 KB	Negligible
3	0.03	28 KB	~200 KB	Fits in L1 cache
4	0.39	91 KB	~500 KB	4× compression
5	5.79 MB	142 KB	~2 MB	42× compression — TT wins decisively
6	92 MB	~200 KB	~4 MB	Full grid never materialised (TT-Cross)
10	576 GB	~400 KB	~8 MB	Impossible without TT

The TT surrogate’s print_summary() method reports memory automatically:

surr.print_summary()
# ──────────────────────────────────────────────────────────────────
# TTSurrogate  dims=4  n_points=15  eps=1e-3  max_rank=30
# Ranks      : [1, 22, 25, 18, 1]
# TT memory  : 91.4 KB        Full-grid equivalent : 393.8 KB
# Compression: 4.3×           Max rank : 25
# Build time : 10.2 ms        Compress time : 2.1 ms
# ──────────────────────────────────────────────────────────────────

Monte Carlo Variance Reduction: Which Method When ¶

All MC pricers default to n_paths=100_000. The table below shows the variance reduction you get and when to use each method.

Method	Variance reduction	`n_paths` to match crude MC 100 K	Best for
`bs_price_antithetic`	~2×	50 K	Always-on default; free variance reduction
`bs_price_cv`	10–100×	1 K–10 K	Asian options (use geometric Asian as control)
`bs_price_qmc`	O(1/N) vs O(1/√N)	8 K (power of 2)	Near-ATM vanilla / mild path dependency
`bs_price_importance`	10–1000×	100–1 K	Deep OTM options (>3 sigma)
`bs_price_stratified`	2–10×	10 K–50 K	General purpose when QMC is unavailable

from tensorquantlib import (
    asian_price_mc,   # crude MC baseline
    asian_price_cv,   # with geometric-Asian control variate
    bs_price_qmc,     # Sobol QMC
)

# Compare methods at equal compute budget (n_paths=10_000)
price_crude, se_crude = asian_price_mc(
    S=100, K=100, T=1.0, r=0.05, sigma=0.2,
    n_paths=10_000, return_stderr=True
)
price_cv, se_cv = asian_price_cv(
    S=100, K=100, T=1.0, r=0.05, sigma=0.2,
    n_paths=10_000, return_stderr=True
)

print(f"Crude MC : {price_crude:.4f} ± {se_crude:.4f}")
print(f"Control V: {price_cv:.4f} ± {se_cv:.4f}")
# Typical:
#   Crude MC : 5.7621 ± 0.0451   (large stderr)
#   Control V: 5.7608 ± 0.0043   (10× tighter)

QMC: Always pass a power-of-2 ``n_paths`` to get proper Sobol balance:

# Good (balance maintained)
price = bs_price_qmc(S=100, K=100, T=1.0, r=0.05, sigma=0.2,
                     n_paths=65_536)   # 2^16

# Avoid — triggers UserWarning and slight accuracy degradation
price = bs_price_qmc(..., n_paths=100_000)

Parallelism: Pricing a Large Portfolio ¶

TensorQuantLib has no internal parallelism by design — all functions are pure NumPy and release the GIL. This makes them safe to call from multiple threads or processes simultaneously.

Thread-parallel strike chain (I/O-light, GIL-free):

from concurrent.futures import ThreadPoolExecutor
import numpy as np
from tensorquantlib import heston_price, HestonParams

params = HestonParams(kappa=2.0, theta=0.04, xi=0.3, rho=-0.7, v0=0.04)
strikes = np.linspace(70, 130, 200)

def price_one(K):
    return heston_price(S=100, K=K, T=1.0, r=0.05, params=params)

with ThreadPoolExecutor(max_workers=8) as pool:
    prices = list(pool.map(price_one, strikes))
# 200 Heston prices × ~1 ms / 8 cores ≈ 25 ms end-to-end

Process-parallel for calibration restarts (CPU-bound — use processes):

from concurrent.futures import ProcessPoolExecutor

def calibrate_one(seed):
    # ... same as _one_restart() above ...
    pass

with ProcessPoolExecutor(max_workers=4) as pool:
    results = list(pool.map(calibrate_one, range(4)))
best = min(results, key=lambda r: r.fun)

Note

GPU / CUDA: TensorQuantLib is pure NumPy. There is no built-in GPU support today, but pricing surfaces and TT-SVD operations are compatible with cupy.ndarray as a drop-in replacement for numpy.ndarray in most functions. Full CuPy integration is on the roadmap (see Known Limitations).

Production Configuration Checklist ¶

Use this checklist before deploying TensorQuantLib in a production trading or risk system:

Daily startup

# 1. Verify environment
python -m tensorquantlib --version

# 2. Run smoke tests
python -m pytest tests/ -q --tb=short -x -m "not slow"

Parameter selection

Setting	Development / backtest	Production / real-time
`HestonCalibrator n_restarts`	3–5	1 (with warm start)
`HestonCalibrator maxiter`	500	200
`heston_price n_points`	100	50–100 (50 gives < 0.1 bp error)
`heston_price_mc n_paths`	100 K	N/A (use CF for live pricing)
`TTSurrogate eps`	`1e-4`	``1e-3``
`TTSurrogate n_points`	30 (2-3D)	15–20 (3D), 10–15 (4-5D)
`TTSurrogate max_rank`	None	30
MC `n_paths` (vanilla)	100 K	10 K + antithetic or QMC 8 K
MC `n_paths` (Asian with CV)	100 K	10 K (CV reduces SE 10×)
Heston MC `scheme`	`"qe"`	N/A (use `heston_price` CF instead)

Warm-start workflow (recommended)

import json, pathlib
from tensorquantlib.finance.heston import HestonCalibrator, HestonParams

PARAMS_FILE = pathlib.Path("heston_params.json")

def load_params() -> HestonParams:
    if PARAMS_FILE.exists():
        d = json.loads(PARAMS_FILE.read_text())
        return HestonParams(**d)
    return HestonParams()   # default

def save_params(p: HestonParams) -> None:
    PARAMS_FILE.write_text(json.dumps({
        "kappa": p.kappa, "theta": p.theta,
        "xi": p.xi, "rho": p.rho, "v0": p.v0,
    }))

# At market open
cal = HestonCalibrator(S=spot, r=risk_free_rate)
cal.params_ = load_params()          # warm start
cal.fit(iv_surface, K_grid, T_grid,
        n_restarts=1, maxiter=200)   # ~1.5 s
save_params(cal.params_)             # persist for tomorrow

Error handling in production

from tensorquantlib import implied_vol, heston_price, HestonParams
import numpy as np

def safe_heston_iv(S, K, T, r, params, fallback=np.nan):
    """Return Heston-implied vol; return fallback on any numerical failure."""
    try:
        price = heston_price(S, K, T, r, params)
        if not np.isfinite(price) or price <= 1e-10:
            return fallback
        return implied_vol(price, S, K, T, r)
    except Exception:
        return fallback

TT surrogate rebuild policy

Rebuild the surrogate when any of these change:

Heston / model parameters (after recalibration)
Risk-free rate > 5 bps shift
Spot price moves outside the surrogate’s domain ± 10 %
New expiry or strike enters the book

Otherwise, surr.evaluate is exact within the built tolerance for the existing parameter set.

Interpreting `print_summary()` Output ¶

TTSurrogate  dims=3  n_points=20  eps=1e-4
Ranks      :  [1, 15, 22, 1]
TT memory  :  53.6 KB        Full-grid equivalent : 640.0 KB
Compression:  11.9×          Max rank : 22
Build time :  1.8 ms         Compress time : 1.2 ms

Ranks: boundary ranks of each TT core. High middle ranks (> 40) with eps=1e-3 suggest a rough or discontinuous payoff — consider increasing n_points or switching to from_function (TT-Cross).
Compression < 1×: TT uses more memory than the dense grid. This means the surface is not low-rank. Try increasing eps or reducing n_points.
Build time >> Compress time: Most of the cost is in evaluating the pricing function on the grid. Use analytic approximations (from_basket_analytic) rather than MC (from_basket_mc) wherever possible.
Compress time >> Build time: The TT-SVD routine is dominant; consider reducing n_points.

Profiling Your Own Workflow ¶

Quick wall-clock profiling with cProfile:

python -m cProfile -s cumtime -m tensorquantlib.__main__ heston \
    --S 100 --K 100 --T 1.0 --r 0.05 2>&1 | head -30

Or instrument inline with time.perf_counter:

import time
from tensorquantlib import heston_price, HestonParams

params = HestonParams()
t0 = time.perf_counter()
for _ in range(1000):
    heston_price(S=100, K=100, T=1.0, r=0.05, params=params)
elapsed = time.perf_counter() - t0
print(f"{elapsed / 1000 * 1e3:.3f} ms per call")
# Typical: ~0.9 ms on M1 / ~1.2 ms on Intel Xeon

For memory-intensive workflows, use memory_profiler (install separately):

pip install memory-profiler
python -m memory_profiler my_surrogate_script.py

TensorQuantLib

Navigation

Related Topics