--- name: timeseries-optimization description: > High-performance timeseries processing with NumPy and Polars for financial data. Apply when working with OHLCV arrays, timestamp lookups, gap detection, or any array/dataframe operations in piker. user-invocable: false --- # Timeseries Optimization: NumPy & Polars Skill for high-performance timeseries processing using NumPy and Polars, with focus on patterns common in financial/trading applications. ## Core Principle: Vectorization Over Iteration **Never write Python loops over large arrays.** Always look for vectorized alternatives. ```python # BAD: Python loop (slow!) results = [] for i in range(len(array)): if array['time'][i] == target_time: results.append(array[i]) # GOOD: vectorized boolean indexing (fast!) results = array[array['time'] == target_time] ``` ## Timestamp Lookup Patterns The most critical optimization in piker timeseries code. Choose the right lookup strategy: ### Linear Scan (O(n)) - Avoid! ```python # BAD: O(n) scan through entire array for target_ts in timestamps: # m iterations matches = array[array['time'] == target_ts] # Total: O(m * n) - catastrophic! ``` **Performance:** - 1000 lookups x 10k array = 10M comparisons - Timing: ~50-100ms for 1k lookups ### Binary Search (O(log n)) - Good! ```python # GOOD: O(m log n) using searchsorted import numpy as np time_arr = array['time'] # extract once ts_array = np.array(timestamps) # binary search for all timestamps at once indices = np.searchsorted(time_arr, ts_array) # bounds check and exact match verification valid_mask = ( (indices < len(array)) & (time_arr[indices] == ts_array) ) valid_indices = indices[valid_mask] matched_rows = array[valid_indices] ``` **Requirements for `searchsorted()`:** - Input array MUST be sorted (ascending) - Works on any sortable dtype (floats, ints) - Returns insertion indices (not found = `len(array)`) **Performance:** - 1000 lookups x 10k array = ~10k comparisons - Timing: <1ms for 1k lookups - **~100-1000x faster than linear scan** ### Hash Table (O(1)) - Best for Repeated Lookups! If you'll do many lookups on same array, build dict once: ```python # build lookup once time_to_idx = { float(array['time'][i]): i for i in range(len(array)) } # O(1) lookups for target_ts in timestamps: idx = time_to_idx.get(target_ts) if idx is not None: row = array[idx] ``` **When to use:** - Many repeated lookups on same array - Array doesn't change between lookups - Can afford upfront dict building cost ## Performance Checklist When optimizing timeseries operations: - [ ] Is the array sorted? (enables binary search) - [ ] Are you doing repeated lookups? (build hash table) - [ ] Are struct fields accessed in loops? (extract to plain arrays) - [ ] Are you using boolean indexing? (vectorized vs loop) - [ ] Can operations be batched? (minimize round-trips) - [ ] Is memory being copied unnecessarily? (use views) - [ ] Are you using the right tool? (NumPy vs Polars) ## Common Bottlenecks and Fixes ### Bottleneck: Timestamp Lookups ```python # BEFORE: O(n*m) - 100ms for 1k lookups for ts in timestamps: matches = array[array['time'] == ts] # AFTER: O(m log n) - <1ms for 1k lookups indices = np.searchsorted( array['time'], timestamps, ) ``` ### Bottleneck: Dict Building from Struct Array ```python # BEFORE: 100ms for 3k rows result = { float(row['time']): { 'index': float(row['index']), 'close': float(row['close']), } for row in matched_rows } # AFTER: <5ms for 3k rows times = matched_rows['time'].astype(float) indices = matched_rows['index'].astype(float) closes = matched_rows['close'].astype(float) result = { t: {'index': idx, 'close': cls} for t, idx, cls in zip( times, indices, closes, ) } ``` ### Bottleneck: Repeated Field Access ```python # BEFORE: 50ms for 1k iterations for i, spec in enumerate(specs): start_row = array[ array['time'] == spec['start_time'] ][0] end_row = array[ array['time'] == spec['end_time'] ][0] process( start_row['index'], end_row['close'], ) # AFTER: <5ms for 1k iterations # 1. Build lookup once time_to_row = {...} # via searchsorted # 2. Extract fields to plain arrays indices_arr = array['index'] closes_arr = array['close'] # 3. Use lookup + plain array indexing for spec in specs: start_idx = time_to_row[ spec['start_time'] ]['array_idx'] end_idx = time_to_row[ spec['end_time'] ]['array_idx'] process( indices_arr[start_idx], closes_arr[end_idx], ) ``` ## References - NumPy structured arrays: https://numpy.org/doc/stable/user/basics.rec.html - `np.searchsorted`: https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html - Polars: https://pola-rs.github.io/polars/ - `piker.tsp` - timeseries processing utilities - `piker.data._formatters` - OHLC array handling See [numpy-patterns.md](numpy-patterns.md) for detailed NumPy structured array patterns and [polars-patterns.md](polars-patterns.md) for Polars integration. --- *Last updated: 2026-01-31* *Key win: 100ms -> 5ms dict building via field extraction*