5.1 KiB
Timeseries Optimization: NumPy & Polars
Skill for high-performance timeseries processing using NumPy and Polars, with focus on patterns common in financial/trading applications.
Core Principle: Vectorization Over Iteration
Never write Python loops over large arrays. Always look for vectorized alternatives.
# BAD: Python loop (slow!)
results = []
for i in range(len(array)):
if array['time'][i] == target_time:
results.append(array[i])
# GOOD: vectorized boolean indexing (fast!)
results = array[array['time'] == target_time]Timestamp Lookup Patterns
The most critical optimization in piker timeseries code. Choose the right lookup strategy:
Linear Scan (O(n)) - Avoid!
# BAD: O(n) scan through entire array
for target_ts in timestamps: # m iterations
matches = array[array['time'] == target_ts]
# Total: O(m * n) - catastrophic!Performance: - 1000 lookups x 10k array = 10M comparisons - Timing: ~50-100ms for 1k lookups
Binary Search (O(log n)) - Good!
# GOOD: O(m log n) using searchsorted
import numpy as np
time_arr = array['time'] # extract once
ts_array = np.array(timestamps)
# binary search for all timestamps at once
indices = np.searchsorted(time_arr, ts_array)
# bounds check and exact match verification
valid_mask = (
(indices < len(array))
&
(time_arr[indices] == ts_array)
)
valid_indices = indices[valid_mask]
matched_rows = array[valid_indices]Requirements for searchsorted(): - Input array MUST be sorted (ascending) - Works on any sortable dtype (floats, ints) - Returns insertion indices (not found = len(array))
Performance: - 1000 lookups x 10k array = ~10k comparisons - Timing: <1ms for 1k lookups - ~100-1000x faster than linear scan
Hash Table (O(1)) - Best for Repeated Lookups!
If you’ll do many lookups on same array, build dict once:
# build lookup once
time_to_idx = {
float(array['time'][i]): i
for i in range(len(array))
}
# O(1) lookups
for target_ts in timestamps:
idx = time_to_idx.get(target_ts)
if idx is not None:
row = array[idx]When to use: - Many repeated lookups on same array - Array doesn’t change between lookups - Can afford upfront dict building cost
Performance Checklist
When optimizing timeseries operations:
- Is the array sorted? (enables binary search)
- Are you doing repeated lookups? (build hash table)
- Are struct fields accessed in loops? (extract to plain arrays)
- Are you using boolean indexing? (vectorized vs loop)
- Can operations be batched? (minimize round-trips)
- Is memory being copied unnecessarily? (use views)
- Are you using the right tool? (NumPy vs Polars)
Common Bottlenecks and Fixes
Bottleneck: Timestamp Lookups
# BEFORE: O(n*m) - 100ms for 1k lookups
for ts in timestamps:
matches = array[array['time'] == ts]
# AFTER: O(m log n) - <1ms for 1k lookups
indices = np.searchsorted(
array['time'], timestamps,
)Bottleneck: Dict Building from Struct Array
# BEFORE: 100ms for 3k rows
result = {
float(row['time']): {
'index': float(row['index']),
'close': float(row['close']),
}
for row in matched_rows
}
# AFTER: <5ms for 3k rows
times = matched_rows['time'].astype(float)
indices = matched_rows['index'].astype(float)
closes = matched_rows['close'].astype(float)
result = {
t: {'index': idx, 'close': cls}
for t, idx, cls in zip(
times, indices, closes,
)
}Bottleneck: Repeated Field Access
# BEFORE: 50ms for 1k iterations
for i, spec in enumerate(specs):
start_row = array[
array['time'] == spec['start_time']
][0]
end_row = array[
array['time'] == spec['end_time']
][0]
process(
start_row['index'],
end_row['close'],
)
# AFTER: <5ms for 1k iterations
# 1. Build lookup once
time_to_row = {...} # via searchsorted
# 2. Extract fields to plain arrays
indices_arr = array['index']
closes_arr = array['close']
# 3. Use lookup + plain array indexing
for spec in specs:
start_idx = time_to_row[
spec['start_time']
]['array_idx']
end_idx = time_to_row[
spec['end_time']
]['array_idx']
process(
indices_arr[start_idx],
closes_arr[end_idx],
)References
- NumPy structured arrays: https://numpy.org/doc/stable/user/basics.rec.html
np.searchsorted: https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html- Polars: https://pola-rs.github.io/polars/
piker.tsp- timeseries processing utilitiespiker.data._formatters- OHLC array handling
See numpy-patterns.md for detailed NumPy structured array patterns and polars-patterns.md for Polars integration.
Last updated: 2026-01-31 Key win: 100ms -> 5ms dict building via field extraction