226 lines
5.1 KiB
Markdown
226 lines
5.1 KiB
Markdown
---
|
|
name: timeseries-optimization
|
|
description: >
|
|
High-performance timeseries processing with NumPy
|
|
and Polars for financial data. Apply when working
|
|
with OHLCV arrays, timestamp lookups, gap
|
|
detection, or any array/dataframe operations in
|
|
piker.
|
|
user-invocable: false
|
|
---
|
|
|
|
# Timeseries Optimization: NumPy & Polars
|
|
|
|
Skill for high-performance timeseries processing
|
|
using NumPy and Polars, with focus on patterns
|
|
common in financial/trading applications.
|
|
|
|
## Core Principle: Vectorization Over Iteration
|
|
|
|
**Never write Python loops over large arrays.**
|
|
Always look for vectorized alternatives.
|
|
|
|
```python
|
|
# BAD: Python loop (slow!)
|
|
results = []
|
|
for i in range(len(array)):
|
|
if array['time'][i] == target_time:
|
|
results.append(array[i])
|
|
|
|
# GOOD: vectorized boolean indexing (fast!)
|
|
results = array[array['time'] == target_time]
|
|
```
|
|
|
|
## Timestamp Lookup Patterns
|
|
|
|
The most critical optimization in piker timeseries
|
|
code. Choose the right lookup strategy:
|
|
|
|
### Linear Scan (O(n)) - Avoid!
|
|
|
|
```python
|
|
# BAD: O(n) scan through entire array
|
|
for target_ts in timestamps: # m iterations
|
|
matches = array[array['time'] == target_ts]
|
|
# Total: O(m * n) - catastrophic!
|
|
```
|
|
|
|
**Performance:**
|
|
- 1000 lookups x 10k array = 10M comparisons
|
|
- Timing: ~50-100ms for 1k lookups
|
|
|
|
### Binary Search (O(log n)) - Good!
|
|
|
|
```python
|
|
# GOOD: O(m log n) using searchsorted
|
|
import numpy as np
|
|
|
|
time_arr = array['time'] # extract once
|
|
ts_array = np.array(timestamps)
|
|
|
|
# binary search for all timestamps at once
|
|
indices = np.searchsorted(time_arr, ts_array)
|
|
|
|
# bounds check and exact match verification
|
|
valid_mask = (
|
|
(indices < len(array))
|
|
&
|
|
(time_arr[indices] == ts_array)
|
|
)
|
|
|
|
valid_indices = indices[valid_mask]
|
|
matched_rows = array[valid_indices]
|
|
```
|
|
|
|
**Requirements for `searchsorted()`:**
|
|
- Input array MUST be sorted (ascending)
|
|
- Works on any sortable dtype (floats, ints)
|
|
- Returns insertion indices (not found =
|
|
`len(array)`)
|
|
|
|
**Performance:**
|
|
- 1000 lookups x 10k array = ~10k comparisons
|
|
- Timing: <1ms for 1k lookups
|
|
- **~100-1000x faster than linear scan**
|
|
|
|
### Hash Table (O(1)) - Best for Repeated Lookups!
|
|
|
|
If you'll do many lookups on same array, build
|
|
dict once:
|
|
|
|
```python
|
|
# build lookup once
|
|
time_to_idx = {
|
|
float(array['time'][i]): i
|
|
for i in range(len(array))
|
|
}
|
|
|
|
# O(1) lookups
|
|
for target_ts in timestamps:
|
|
idx = time_to_idx.get(target_ts)
|
|
if idx is not None:
|
|
row = array[idx]
|
|
```
|
|
|
|
**When to use:**
|
|
- Many repeated lookups on same array
|
|
- Array doesn't change between lookups
|
|
- Can afford upfront dict building cost
|
|
|
|
## Performance Checklist
|
|
|
|
When optimizing timeseries operations:
|
|
|
|
- [ ] Is the array sorted? (enables binary search)
|
|
- [ ] Are you doing repeated lookups?
|
|
(build hash table)
|
|
- [ ] Are struct fields accessed in loops?
|
|
(extract to plain arrays)
|
|
- [ ] Are you using boolean indexing?
|
|
(vectorized vs loop)
|
|
- [ ] Can operations be batched?
|
|
(minimize round-trips)
|
|
- [ ] Is memory being copied unnecessarily?
|
|
(use views)
|
|
- [ ] Are you using the right tool?
|
|
(NumPy vs Polars)
|
|
|
|
## Common Bottlenecks and Fixes
|
|
|
|
### Bottleneck: Timestamp Lookups
|
|
|
|
```python
|
|
# BEFORE: O(n*m) - 100ms for 1k lookups
|
|
for ts in timestamps:
|
|
matches = array[array['time'] == ts]
|
|
|
|
# AFTER: O(m log n) - <1ms for 1k lookups
|
|
indices = np.searchsorted(
|
|
array['time'], timestamps,
|
|
)
|
|
```
|
|
|
|
### Bottleneck: Dict Building from Struct Array
|
|
|
|
```python
|
|
# BEFORE: 100ms for 3k rows
|
|
result = {
|
|
float(row['time']): {
|
|
'index': float(row['index']),
|
|
'close': float(row['close']),
|
|
}
|
|
for row in matched_rows
|
|
}
|
|
|
|
# AFTER: <5ms for 3k rows
|
|
times = matched_rows['time'].astype(float)
|
|
indices = matched_rows['index'].astype(float)
|
|
closes = matched_rows['close'].astype(float)
|
|
|
|
result = {
|
|
t: {'index': idx, 'close': cls}
|
|
for t, idx, cls in zip(
|
|
times, indices, closes,
|
|
)
|
|
}
|
|
```
|
|
|
|
### Bottleneck: Repeated Field Access
|
|
|
|
```python
|
|
# BEFORE: 50ms for 1k iterations
|
|
for i, spec in enumerate(specs):
|
|
start_row = array[
|
|
array['time'] == spec['start_time']
|
|
][0]
|
|
end_row = array[
|
|
array['time'] == spec['end_time']
|
|
][0]
|
|
process(
|
|
start_row['index'],
|
|
end_row['close'],
|
|
)
|
|
|
|
# AFTER: <5ms for 1k iterations
|
|
# 1. Build lookup once
|
|
time_to_row = {...} # via searchsorted
|
|
|
|
# 2. Extract fields to plain arrays
|
|
indices_arr = array['index']
|
|
closes_arr = array['close']
|
|
|
|
# 3. Use lookup + plain array indexing
|
|
for spec in specs:
|
|
start_idx = time_to_row[
|
|
spec['start_time']
|
|
]['array_idx']
|
|
end_idx = time_to_row[
|
|
spec['end_time']
|
|
]['array_idx']
|
|
process(
|
|
indices_arr[start_idx],
|
|
closes_arr[end_idx],
|
|
)
|
|
```
|
|
|
|
## References
|
|
|
|
- NumPy structured arrays:
|
|
https://numpy.org/doc/stable/user/basics.rec.html
|
|
- `np.searchsorted`:
|
|
https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html
|
|
- Polars: https://pola-rs.github.io/polars/
|
|
- `piker.tsp` - timeseries processing utilities
|
|
- `piker.data._formatters` - OHLC array handling
|
|
|
|
See [numpy-patterns.md](numpy-patterns.md) for
|
|
detailed NumPy structured array patterns and
|
|
[polars-patterns.md](polars-patterns.md) for
|
|
Polars integration.
|
|
|
|
---
|
|
|
|
*Last updated: 2026-01-31*
|
|
*Key win: 100ms -> 5ms dict building via field
|
|
extraction*
|