piker/docs/macos/compatibility-fixes.md

# macOS Compatibility Fixes for Piker/Tractor

This guide documents macOS-specific issues encountered when running `piker` on macOS and their solutions. These fixes address platform differences between Linux and macOS in areas like socket credentials, shared memory naming, and async runtime coordination.

## Table of Contents

1. [Socket Credential Passing](#1-socket-credential-passing)
2. [Shared Memory Name Length Limits](#2-shared-memory-name-length-limits)
3. [Shared Memory Cleanup Race Conditions](#3-shared-memory-cleanup-race-conditions)
4. [Async Runtime (Trio/AsyncIO) Coordination](#4-async-runtime-trioasyncio-coordination)

---

## 1. Socket Credential Passing

### Problem

On Linux, `tractor` uses `SO_PASSCRED` and `SO_PEERCRED` socket options for Unix domain socket credential passing. macOS doesn't support these constants, causing `AttributeError` when importing.

```python
# Linux code that fails on macOS
from socket import SO_PASSCRED, SO_PEERCRED  # AttributeError on macOS
```

### Error Message

```
AttributeError: module 'socket' has no attribute 'SO_PASSCRED'
```

### Root Cause

- **Linux**: Uses `SO_PASSCRED` (to enable credential passing) and `SO_PEERCRED` (to retrieve peer credentials)
- **macOS**: Uses `LOCAL_PEERCRED` (value `0x0001`) instead, and doesn't require enabling credential passing

### Solution

Make the socket credential imports platform-conditional:

**File**: `tractor/ipc/_uds.py` (or equivalent in `piker` if duplicated)

```python
import sys
from socket import (
    socket,
    AF_UNIX,
    SOCK_STREAM,
)

# Platform-specific credential passing constants
if sys.platform == 'linux':
    from socket import SO_PASSCRED, SO_PEERCRED
elif sys.platform == 'darwin':  # macOS
    # macOS uses LOCAL_PEERCRED instead of SO_PEERCRED
    # and doesn't need SO_PASSCRED
    LOCAL_PEERCRED = 0x0001
    SO_PEERCRED = LOCAL_PEERCRED  # Alias for compatibility
    SO_PASSCRED = None  # Not needed on macOS
else:
    # Other platforms - may need additional handling
    SO_PASSCRED = None
    SO_PEERCRED = None

# When creating a socket
if SO_PASSCRED is not None:
    sock.setsockopt(SOL_SOCKET, SO_PASSCRED, 1)

# When getting peer credentials
if SO_PEERCRED is not None:
    creds = sock.getsockopt(SOL_SOCKET, SO_PEERCRED, struct.calcsize('3i'))
```

### Implementation Notes

- The `LOCAL_PEERCRED` value `0x0001` is specific to macOS (from `<sys/un.h>`)
- macOS doesn't require explicitly enabling credential passing like Linux does
- Consider using `ctypes` or `cffi` for a more robust solution if available

---

## 2. Shared Memory Name Length Limits

### Problem

macOS limits POSIX shared memory names to **31 characters** (defined as `PSHMNAMLEN` in `<sys/posix_shm_internal.h>`). Piker generates long descriptive names that exceed this limit, causing `OSError`.

```python
# Long name that works on Linux but fails on macOS
shm_name = "piker_quoter_tsla.nasdaq.ib_hist_1m"  # 39 chars - too long!
```

### Error Message

```
OSError: [Errno 63] File name too long: '/piker_quoter_tsla.nasdaq.ib_hist_1m'
```

### Root Cause

- **Linux**: Supports shared memory names up to 255 characters
- **macOS**: Limits to 31 characters (including leading `/`)

### Solution

Implement automatic name shortening for macOS while preserving the original key for lookups:

**File**: `piker/data/_sharedmem.py`

```python
import hashlib
import sys

def _shorten_key_for_macos(key: str) -> str:
    '''
    macOS has a 31 character limit for POSIX shared memory names.
    Hash long keys to fit within this limit while maintaining uniqueness.
    '''
    # macOS shm_open() has a 31 char limit (PSHMNAMLEN)
    # Use format: /p_<hash16> where hash is first 16 hex chars of sha256
    # This gives us: / + p_ + 16 hex chars = 19 chars, well under limit
    # We keep the 'p' prefix to indicate it's from piker
    if len(key) <= 31:
        return key

    # Create a hash of the full key
    key_hash = hashlib.sha256(key.encode()).hexdigest()[:16]
    short_key = f'p_{key_hash}'
    return short_key


class _Token(Struct, frozen=True):
    '''
    Internal representation of a shared memory "token"
    which can be used to key a system wide post shm entry.
    '''
    shm_name: str  # actual OS-level name (may be shortened on macOS)
    shm_first_index_name: str
    shm_last_index_name: str
    dtype_descr: tuple
    size: int  # in struct-array index / row terms
    key: str | None = None  # original descriptive key (for lookup)

    def __eq__(self, other) -> bool:
        '''
        Compare tokens based on shm names and dtype, ignoring the key field.
        The key field is only used for lookups, not for token identity.
        '''
        if not isinstance(other, _Token):
            return False
        return (
            self.shm_name == other.shm_name
            and self.shm_first_index_name == other.shm_first_index_name
            and self.shm_last_index_name == other.shm_last_index_name
            and self.dtype_descr == other.dtype_descr
            and self.size == other.size
        )

    def __hash__(self) -> int:
        '''Hash based on the same fields used in __eq__'''
        return hash((
            self.shm_name,
            self.shm_first_index_name,
            self.shm_last_index_name,
            self.dtype_descr,
            self.size,
        ))


def _make_token(
    key: str,
    size: int,
    dtype: np.dtype | None = None,
) -> _Token:
    '''
    Create a serializable token that uniquely identifies a shared memory segment.
    '''
    if dtype is None:
        dtype = def_iohlcv_fields

    # On macOS, shorten long keys to fit the 31-char limit
    if sys.platform == 'darwin':
        shm_name = _shorten_key_for_macos(key)
        shm_first = _shorten_key_for_macos(key + "_first")
        shm_last = _shorten_key_for_macos(key + "_last")
    else:
        shm_name = key
        shm_first = key + "_first"
        shm_last = key + "_last"

    return _Token(
        shm_name=shm_name,
        shm_first_index_name=shm_first,
        shm_last_index_name=shm_last,
        dtype_descr=tuple(np.dtype(dtype).descr),
        size=size,
        key=key,  # Store original key for lookup
    )
```

### Key Design Decisions

1. **Hash-based shortening**: Uses SHA256 to ensure uniqueness and avoid collisions
2. **Preserve original key**: Store the original descriptive key in the `_Token` for debugging and lookups
3. **Custom equality**: The `__eq__` and `__hash__` methods ignore the `key` field to ensure tokens are compared by their actual shm properties
4. **Platform detection**: Only applies shortening on macOS (`sys.platform == 'darwin'`)

### Edge Cases to Consider

- Token serialization across processes (the `key` field must survive IPC)
- Token lookup in dictionaries and caches
- Debugging output (use `key` field for human-readable names)

---

## 3. Shared Memory Cleanup Race Conditions

### Problem

During teardown, shared memory segments may be unlinked by one process while another is still trying to clean them up, causing `FileNotFoundError` to crash the application.

### Error Message

```
FileNotFoundError: [Errno 2] No such file or directory: '/p_74c86c7228dd773b'
```

### Root Cause

In multi-process architectures like `tractor`, multiple processes may attempt to clean up shared resources simultaneously. Race conditions during shutdown can cause:

1. Process A unlinks the shared memory
2. Process B tries to unlink the same memory → `FileNotFoundError`
3. Uncaught exception crashes Process B

### Solution

Add defensive error handling to catch and log cleanup races:

**File**: `piker/data/_sharedmem.py`

```python
class ShmArray:
    # ... existing code ...

    def destroy(self) -> None:
        '''
        Destroy the shared memory segment and cleanup OS resources.
        '''
        if _USE_POSIX:
            # We manually unlink to bypass all the "resource tracker"
            # nonsense meant for non-SC systems.
            shm = self._shm
            name = shm.name
            try:
                shm_unlink(name)
            except FileNotFoundError:
                # Might be a teardown race where another process
                # already unlinked it - this is fine, just log it
                log.warning(f'Shm for {name} already unlinked?')

        # Also cleanup the index counters
        if hasattr(self, '_first'):
            try:
                self._first.destroy()
            except FileNotFoundError:
                log.warning(f'First index shm already unlinked?')

        if hasattr(self, '_last'):
            try:
                self._last.destroy()
            except FileNotFoundError:
                log.warning(f'Last index shm already unlinked?')


class SharedInt:
    # ... existing code ...

    def destroy(self) -> None:
        if _USE_POSIX:
            # We manually unlink to bypass all the "resource tracker"
            # nonsense meant for non-SC systems.
            name = self._shm.name
            try:
                shm_unlink(name)
            except FileNotFoundError:
                # might be a teardown race here?
                log.warning(f'Shm for {name} already unlinked?')
```

### Implementation Notes

- This fix is platform-agnostic but particularly important on macOS where the shortened names make debugging harder
- The warnings help identify cleanup races during development
- Consider adding metrics/counters if cleanup races become frequent

---

## 4. Async Runtime (Trio/AsyncIO) Coordination

### Problem

The `TrioTaskExited` error occurs when trio tasks are cancelled while asyncio tasks are still running, indicating improper coordination between the two async runtimes.

### Error Message

```
tractor._exceptions.TrioTaskExited: but the child `asyncio` task is still running?
>>
 |_<Task pending name='Task-2' coro=<wait_on_coro_final_result()> ...>
```

### Root Cause

`tractor` uses "guest mode" to run trio as a guest in asyncio's event loop (or vice versa). The error occurs when:

1. A trio task is cancelled (e.g., user closes the UI)
2. The cancellation propagates to cleanup handlers
3. Cleanup tries to exit while asyncio tasks are still running
4. The `translate_aio_errors` context manager detects this inconsistent state

### Current State

This issue is **partially resolved** by the other fixes (socket credentials and shared memory), which eliminate the underlying errors that trigger premature cancellation. However, it may still occur in edge cases.

### Potential Solutions

#### Option 1: Improve Cancellation Propagation (Tractor-level)

**File**: `tractor/to_asyncio.py`

```python
async def translate_aio_errors(
    chan,
    wait_on_aio_task: bool = False,
    suppress_graceful_exits: bool = False,
):
    '''
    Context manager to translate asyncio errors to trio equivalents.
    '''
    try:
        yield
    except trio.Cancelled:
        # When trio is cancelled, ensure asyncio tasks are also cancelled
        if wait_on_aio_task:
            # Give asyncio tasks a chance to cleanup
            await trio.lowlevel.checkpoint()

            # Check if asyncio task is still running
            if aio_task and not aio_task.done():
                # Cancel it gracefully
                aio_task.cancel()

                # Wait briefly for cancellation
                with trio.move_on_after(0.5):  # 500ms timeout
                    await wait_for_aio_task_completion(aio_task)

        raise  # Re-raise the cancellation
```

#### Option 2: Proper Shutdown Sequence (Application-level)

**File**: `piker/brokers/ib/api.py` (or similar broker modules)

```python
async def load_clients_for_trio(
    client: Client,
    ...
) -> None:
    '''
    Load asyncio client and keep it running for trio.
    '''
    try:
        # Setup client
        await client.connect()

        # Keep alive - but make it cancellable
        await trio.sleep_forever()

    except trio.Cancelled:
        # Explicit cleanup before propagating cancellation
        log.info("Shutting down asyncio client gracefully")

        # Disconnect client
        if client.isConnected():
            await client.disconnect()

        # Small delay to let asyncio cleanup
        await trio.sleep(0.1)

        raise  # Now safe to propagate
```

#### Option 3: Detection and Warning (Current Approach)

The current code detects the issue and raises a clear error. This is acceptable if:
1. The error is rare (only during abnormal shutdown)
2. It doesn't cause data loss
3. Logs provide enough info for debugging

### Recommended Approach

For **piker**: Implement Option 2 (proper shutdown sequence) in broker modules where asyncio is used.

For **tractor**: Consider Option 1 (improved cancellation propagation) as a library-level enhancement.

### Testing

Test the fix by:

```python
# Test graceful shutdown
async def test_asyncio_trio_shutdown():
    async with open_channel_from(...) as (first, chan):
        # Do some work
        await chan.send(msg)

        # Trigger cancellation
        raise KeyboardInterrupt

    # Should cleanup without TrioTaskExited error
```

---

## Summary of Changes

### Files Modified in Piker

1. **`piker/data/_sharedmem.py`**
   - Added `_shorten_key_for_macos()` function
   - Modified `_Token` class to store original `key`
   - Modified `_make_token()` to use shortened names on macOS
   - Added `FileNotFoundError` handling in `destroy()` methods

2. **`piker/ui/_display.py`**
   - Removed assertion that checked for 'hist' in shm name (incompatible with shortened names)

### Files to Modify in Tractor (Recommended)

1. **`tractor/ipc/_uds.py`**
   - Make socket credential imports platform-conditional
   - Handle macOS-specific `LOCAL_PEERCRED`

2. **`tractor/to_asyncio.py`** (Optional)
   - Improve cancellation propagation between trio and asyncio
   - Add graceful shutdown timeout for asyncio tasks

### Platform Detection Pattern

Use this pattern consistently:

```python
import sys

if sys.platform == 'darwin':  # macOS
    # macOS-specific code
    pass
elif sys.platform == 'linux':  # Linux
    # Linux-specific code
    pass
else:
    # Other platforms / fallback
    pass
```

### Testing Checklist

- [ ] Test on macOS (Darwin)
- [ ] Test on Linux
- [ ] Test shared memory with names > 31 chars
- [ ] Test multi-process cleanup race conditions
- [ ] Test graceful shutdown (Ctrl+C)
- [ ] Test abnormal shutdown (kill signal)
- [ ] Verify no memory leaks (check `/dev/shm` on Linux, `ipcs -m` on macOS)

---

## Additional Resources

- **macOS System Headers**:
  - `/usr/include/sys/un.h` - Unix domain socket constants
  - `/usr/include/sys/posix_shm_internal.h` - Shared memory limits

- **Python Documentation**:
  - [`socket` module](https://docs.python.org/3/library/socket.html)
  - [`multiprocessing.shared_memory`](https://docs.python.org/3/library/multiprocessing.shared_memory.html)

- **Trio/AsyncIO**:
  - [Trio Guest Mode](https://trio.readthedocs.io/en/stable/reference-lowlevel.html#using-guest-mode-to-run-trio-on-top-of-other-event-loops)
  - [Tractor Documentation](https://github.com/goodboy/tractor)

---

## Contributing

When implementing these fixes in your own project:

1. **Test thoroughly** on both macOS and Linux
2. **Add platform guards** to prevent cross-platform breakage
3. **Document platform-specific behavior** in code comments
4. **Consider CI/CD** testing on multiple platforms
5. **Handle edge cases** gracefully with proper logging

If you find additional macOS-specific issues, please contribute to this guide!