5.1 KiB

Raw Blame History

Timeseries Optimization: NumPy & Polars

Skill for high-performance timeseries processing using NumPy and Polars, with focus on patterns common in financial/trading applications.

Core Principle: Vectorization Over Iteration

Never write Python loops over large arrays. Always look for vectorized alternatives.

# BAD: Python loop (slow!)
results = []
for i in range(len(array)):
    if array['time'][i] == target_time:
        results.append(array[i])

# GOOD: vectorized boolean indexing (fast!)
results = array[array['time'] == target_time]

Timestamp Lookup Patterns

The most critical optimization in piker timeseries code. Choose the right lookup strategy:

Linear Scan (O(n)) - Avoid!

# BAD: O(n) scan through entire array
for target_ts in timestamps:  # m iterations
    matches = array[array['time'] == target_ts]
    # Total: O(m * n) - catastrophic!

Performance: - 1000 lookups x 10k array = 10M comparisons - Timing: ~50-100ms for 1k lookups

Binary Search (O(log n)) - Good!

# GOOD: O(m log n) using searchsorted
import numpy as np

time_arr = array['time']  # extract once
ts_array = np.array(timestamps)

# binary search for all timestamps at once
indices = np.searchsorted(time_arr, ts_array)

# bounds check and exact match verification
valid_mask = (
    (indices < len(array))
    &
    (time_arr[indices] == ts_array)
)

valid_indices = indices[valid_mask]
matched_rows = array[valid_indices]

Requirements for searchsorted(): - Input array MUST be sorted (ascending) - Works on any sortable dtype (floats, ints) - Returns insertion indices (not found = len(array))

Performance: - 1000 lookups x 10k array = ~10k comparisons - Timing: <1ms for 1k lookups - ~100-1000x faster than linear scan

Hash Table (O(1)) - Best for Repeated Lookups!

If you’ll do many lookups on same array, build dict once:

# build lookup once
time_to_idx = {
    float(array['time'][i]): i
    for i in range(len(array))
}

# O(1) lookups
for target_ts in timestamps:
    idx = time_to_idx.get(target_ts)
    if idx is not None:
        row = array[idx]

When to use: - Many repeated lookups on same array - Array doesn’t change between lookups - Can afford upfront dict building cost

Performance Checklist

When optimizing timeseries operations:

Is the array sorted? (enables binary search)
Are you doing repeated lookups? (build hash table)
Are struct fields accessed in loops? (extract to plain arrays)
Are you using boolean indexing? (vectorized vs loop)
Can operations be batched? (minimize round-trips)
Is memory being copied unnecessarily? (use views)
Are you using the right tool? (NumPy vs Polars)

Common Bottlenecks and Fixes

Bottleneck: Timestamp Lookups

# BEFORE: O(n*m) - 100ms for 1k lookups
for ts in timestamps:
    matches = array[array['time'] == ts]

# AFTER: O(m log n) - <1ms for 1k lookups
indices = np.searchsorted(
    array['time'], timestamps,
)

Bottleneck: Dict Building from Struct Array

# BEFORE: 100ms for 3k rows
result = {
    float(row['time']): {
        'index': float(row['index']),
        'close': float(row['close']),
    }
    for row in matched_rows
}

# AFTER: <5ms for 3k rows
times = matched_rows['time'].astype(float)
indices = matched_rows['index'].astype(float)
closes = matched_rows['close'].astype(float)

result = {
    t: {'index': idx, 'close': cls}
    for t, idx, cls in zip(
        times, indices, closes,
    )
}

Bottleneck: Repeated Field Access

# BEFORE: 50ms for 1k iterations
for i, spec in enumerate(specs):
    start_row = array[
        array['time'] == spec['start_time']
    ][0]
    end_row = array[
        array['time'] == spec['end_time']
    ][0]
    process(
        start_row['index'],
        end_row['close'],
    )

# AFTER: <5ms for 1k iterations
# 1. Build lookup once
time_to_row = {...}  # via searchsorted

# 2. Extract fields to plain arrays
indices_arr = array['index']
closes_arr = array['close']

# 3. Use lookup + plain array indexing
for spec in specs:
    start_idx = time_to_row[
        spec['start_time']
    ]['array_idx']
    end_idx = time_to_row[
        spec['end_time']
    ]['array_idx']
    process(
        indices_arr[start_idx],
        closes_arr[end_idx],
    )

References

NumPy structured arrays: https://numpy.org/doc/stable/user/basics.rec.html
np.searchsorted: https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html
Polars: https://pola-rs.github.io/polars/
piker.tsp - timeseries processing utilities
piker.data._formatters - OHLC array handling

See numpy-patterns.md for detailed NumPy structured array patterns and polars-patterns.md for Polars integration.

Last updated: 2026-01-31 Key win: 100ms -> 5ms dict building via field extraction

5.1 KiB Raw Blame History Unescape Escape