piker/docs/macos/compatibility-fixes.md

15 KiB
Raw Blame History

macOS Compatibility Fixes for Piker/Tractor

This guide documents macOS-specific issues encountered when running piker on macOS and their solutions. These fixes address platform differences between Linux and macOS in areas like socket credentials, shared memory naming, and async runtime coordination.

Table of Contents

  1. Socket Credential Passing
  2. Shared Memory Name Length Limits
  3. Shared Memory Cleanup Race Conditions
  4. Async Runtime (Trio/AsyncIO) Coordination

1. Socket Credential Passing

Problem

On Linux, tractor uses SO_PASSCRED and SO_PEERCRED socket options for Unix domain socket credential passing. macOS doesnt support these constants, causing AttributeError when importing.

# Linux code that fails on macOS
from socket import SO_PASSCRED, SO_PEERCRED  # AttributeError on macOS

Error Message

AttributeError: module 'socket' has no attribute 'SO_PASSCRED'

Root Cause

  • Linux: Uses SO_PASSCRED (to enable credential passing) and SO_PEERCRED (to retrieve peer credentials)
  • macOS: Uses LOCAL_PEERCRED (value 0x0001) instead, and doesnt require enabling credential passing

Solution

Make the socket credential imports platform-conditional:

File: tractor/ipc/_uds.py (or equivalent in piker if duplicated)

import sys
from socket import (
    socket,
    AF_UNIX,
    SOCK_STREAM,
)

# Platform-specific credential passing constants
if sys.platform == 'linux':
    from socket import SO_PASSCRED, SO_PEERCRED
elif sys.platform == 'darwin':  # macOS
    # macOS uses LOCAL_PEERCRED instead of SO_PEERCRED
    # and doesn't need SO_PASSCRED
    LOCAL_PEERCRED = 0x0001
    SO_PEERCRED = LOCAL_PEERCRED  # Alias for compatibility
    SO_PASSCRED = None  # Not needed on macOS
else:
    # Other platforms - may need additional handling
    SO_PASSCRED = None
    SO_PEERCRED = None

# When creating a socket
if SO_PASSCRED is not None:
    sock.setsockopt(SOL_SOCKET, SO_PASSCRED, 1)

# When getting peer credentials
if SO_PEERCRED is not None:
    creds = sock.getsockopt(SOL_SOCKET, SO_PEERCRED, struct.calcsize('3i'))

Implementation Notes

  • The LOCAL_PEERCRED value 0x0001 is specific to macOS (from <sys/un.h>)
  • macOS doesnt require explicitly enabling credential passing like Linux does
  • Consider using ctypes or cffi for a more robust solution if available

2. Shared Memory Name Length Limits

Problem

macOS limits POSIX shared memory names to 31 characters (defined as PSHMNAMLEN in <sys/posix_shm_internal.h>). Piker generates long descriptive names that exceed this limit, causing OSError.

# Long name that works on Linux but fails on macOS
shm_name = "piker_quoter_tsla.nasdaq.ib_hist_1m"  # 39 chars - too long!

Error Message

OSError: [Errno 63] File name too long: '/piker_quoter_tsla.nasdaq.ib_hist_1m'

Root Cause

  • Linux: Supports shared memory names up to 255 characters
  • macOS: Limits to 31 characters (including leading /)

Solution

Implement automatic name shortening for macOS while preserving the original key for lookups:

File: piker/data/_sharedmem.py

import hashlib
import sys

def _shorten_key_for_macos(key: str) -> str:
    '''
    macOS has a 31 character limit for POSIX shared memory names.
    Hash long keys to fit within this limit while maintaining uniqueness.
    '''
    # macOS shm_open() has a 31 char limit (PSHMNAMLEN)
    # Use format: /p_<hash16> where hash is first 16 hex chars of sha256
    # This gives us: / + p_ + 16 hex chars = 19 chars, well under limit
    # We keep the 'p' prefix to indicate it's from piker
    if len(key) <= 31:
        return key

    # Create a hash of the full key
    key_hash = hashlib.sha256(key.encode()).hexdigest()[:16]
    short_key = f'p_{key_hash}'
    return short_key


class _Token(Struct, frozen=True):
    '''
    Internal representation of a shared memory "token"
    which can be used to key a system wide post shm entry.
    '''
    shm_name: str  # actual OS-level name (may be shortened on macOS)
    shm_first_index_name: str
    shm_last_index_name: str
    dtype_descr: tuple
    size: int  # in struct-array index / row terms
    key: str | None = None  # original descriptive key (for lookup)

    def __eq__(self, other) -> bool:
        '''
        Compare tokens based on shm names and dtype, ignoring the key field.
        The key field is only used for lookups, not for token identity.
        '''
        if not isinstance(other, _Token):
            return False
        return (
            self.shm_name == other.shm_name
            and self.shm_first_index_name == other.shm_first_index_name
            and self.shm_last_index_name == other.shm_last_index_name
            and self.dtype_descr == other.dtype_descr
            and self.size == other.size
        )

    def __hash__(self) -> int:
        '''Hash based on the same fields used in __eq__'''
        return hash((
            self.shm_name,
            self.shm_first_index_name,
            self.shm_last_index_name,
            self.dtype_descr,
            self.size,
        ))


def _make_token(
    key: str,
    size: int,
    dtype: np.dtype | None = None,
) -> _Token:
    '''
    Create a serializable token that uniquely identifies a shared memory segment.
    '''
    if dtype is None:
        dtype = def_iohlcv_fields

    # On macOS, shorten long keys to fit the 31-char limit
    if sys.platform == 'darwin':
        shm_name = _shorten_key_for_macos(key)
        shm_first = _shorten_key_for_macos(key + "_first")
        shm_last = _shorten_key_for_macos(key + "_last")
    else:
        shm_name = key
        shm_first = key + "_first"
        shm_last = key + "_last"

    return _Token(
        shm_name=shm_name,
        shm_first_index_name=shm_first,
        shm_last_index_name=shm_last,
        dtype_descr=tuple(np.dtype(dtype).descr),
        size=size,
        key=key,  # Store original key for lookup
    )

Key Design Decisions

  1. Hash-based shortening: Uses SHA256 to ensure uniqueness and avoid collisions
  2. Preserve original key: Store the original descriptive key in the _Token for debugging and lookups
  3. Custom equality: The __eq__ and __hash__ methods ignore the key field to ensure tokens are compared by their actual shm properties
  4. Platform detection: Only applies shortening on macOS (sys.platform == 'darwin')

Edge Cases to Consider

  • Token serialization across processes (the key field must survive IPC)
  • Token lookup in dictionaries and caches
  • Debugging output (use key field for human-readable names)

3. Shared Memory Cleanup Race Conditions

Problem

During teardown, shared memory segments may be unlinked by one process while another is still trying to clean them up, causing FileNotFoundError to crash the application.

Error Message

FileNotFoundError: [Errno 2] No such file or directory: '/p_74c86c7228dd773b'

Root Cause

In multi-process architectures like tractor, multiple processes may attempt to clean up shared resources simultaneously. Race conditions during shutdown can cause:

  1. Process A unlinks the shared memory
  2. Process B tries to unlink the same memory → FileNotFoundError
  3. Uncaught exception crashes Process B

Solution

Add defensive error handling to catch and log cleanup races:

File: piker/data/_sharedmem.py

class ShmArray:
    # ... existing code ...

    def destroy(self) -> None:
        '''
        Destroy the shared memory segment and cleanup OS resources.
        '''
        if _USE_POSIX:
            # We manually unlink to bypass all the "resource tracker"
            # nonsense meant for non-SC systems.
            shm = self._shm
            name = shm.name
            try:
                shm_unlink(name)
            except FileNotFoundError:
                # Might be a teardown race where another process
                # already unlinked it - this is fine, just log it
                log.warning(f'Shm for {name} already unlinked?')

        # Also cleanup the index counters
        if hasattr(self, '_first'):
            try:
                self._first.destroy()
            except FileNotFoundError:
                log.warning(f'First index shm already unlinked?')

        if hasattr(self, '_last'):
            try:
                self._last.destroy()
            except FileNotFoundError:
                log.warning(f'Last index shm already unlinked?')


class SharedInt:
    # ... existing code ...

    def destroy(self) -> None:
        if _USE_POSIX:
            # We manually unlink to bypass all the "resource tracker"
            # nonsense meant for non-SC systems.
            name = self._shm.name
            try:
                shm_unlink(name)
            except FileNotFoundError:
                # might be a teardown race here?
                log.warning(f'Shm for {name} already unlinked?')

Implementation Notes

  • This fix is platform-agnostic but particularly important on macOS where the shortened names make debugging harder
  • The warnings help identify cleanup races during development
  • Consider adding metrics/counters if cleanup races become frequent

4. Async Runtime (Trio/AsyncIO) Coordination

Problem

The TrioTaskExited error occurs when trio tasks are cancelled while asyncio tasks are still running, indicating improper coordination between the two async runtimes.

Error Message

tractor._exceptions.TrioTaskExited: but the child `asyncio` task is still running?
>>
 |_<Task pending name='Task-2' coro=<wait_on_coro_final_result()> ...>

Root Cause

tractor uses “guest mode” to run trio as a guest in asyncios event loop (or vice versa). The error occurs when:

  1. A trio task is cancelled (e.g., user closes the UI)
  2. The cancellation propagates to cleanup handlers
  3. Cleanup tries to exit while asyncio tasks are still running
  4. The translate_aio_errors context manager detects this inconsistent state

Current State

This issue is partially resolved by the other fixes (socket credentials and shared memory), which eliminate the underlying errors that trigger premature cancellation. However, it may still occur in edge cases.

Potential Solutions

Option 1: Improve Cancellation Propagation (Tractor-level)

File: tractor/to_asyncio.py

async def translate_aio_errors(
    chan,
    wait_on_aio_task: bool = False,
    suppress_graceful_exits: bool = False,
):
    '''
    Context manager to translate asyncio errors to trio equivalents.
    '''
    try:
        yield
    except trio.Cancelled:
        # When trio is cancelled, ensure asyncio tasks are also cancelled
        if wait_on_aio_task:
            # Give asyncio tasks a chance to cleanup
            await trio.lowlevel.checkpoint()

            # Check if asyncio task is still running
            if aio_task and not aio_task.done():
                # Cancel it gracefully
                aio_task.cancel()

                # Wait briefly for cancellation
                with trio.move_on_after(0.5):  # 500ms timeout
                    await wait_for_aio_task_completion(aio_task)

        raise  # Re-raise the cancellation

Option 2: Proper Shutdown Sequence (Application-level)

File: piker/brokers/ib/api.py (or similar broker modules)

async def load_clients_for_trio(
    client: Client,
    ...
) -> None:
    '''
    Load asyncio client and keep it running for trio.
    '''
    try:
        # Setup client
        await client.connect()

        # Keep alive - but make it cancellable
        await trio.sleep_forever()

    except trio.Cancelled:
        # Explicit cleanup before propagating cancellation
        log.info("Shutting down asyncio client gracefully")

        # Disconnect client
        if client.isConnected():
            await client.disconnect()

        # Small delay to let asyncio cleanup
        await trio.sleep(0.1)

        raise  # Now safe to propagate

Option 3: Detection and Warning (Current Approach)

The current code detects the issue and raises a clear error. This is acceptable if: 1. The error is rare (only during abnormal shutdown) 2. It doesnt cause data loss 3. Logs provide enough info for debugging

For piker: Implement Option 2 (proper shutdown sequence) in broker modules where asyncio is used.

For tractor: Consider Option 1 (improved cancellation propagation) as a library-level enhancement.

Testing

Test the fix by:

# Test graceful shutdown
async def test_asyncio_trio_shutdown():
    async with open_channel_from(...) as (first, chan):
        # Do some work
        await chan.send(msg)

        # Trigger cancellation
        raise KeyboardInterrupt

    # Should cleanup without TrioTaskExited error

Summary of Changes

Files Modified in Piker

  1. piker/data/_sharedmem.py
    • Added _shorten_key_for_macos() function
    • Modified _Token class to store original key
    • Modified _make_token() to use shortened names on macOS
    • Added FileNotFoundError handling in destroy() methods
  2. piker/ui/_display.py
    • Removed assertion that checked for hist in shm name (incompatible with shortened names)
  1. tractor/ipc/_uds.py
    • Make socket credential imports platform-conditional
    • Handle macOS-specific LOCAL_PEERCRED
  2. tractor/to_asyncio.py (Optional)
    • Improve cancellation propagation between trio and asyncio
    • Add graceful shutdown timeout for asyncio tasks

Platform Detection Pattern

Use this pattern consistently:

import sys

if sys.platform == 'darwin':  # macOS
    # macOS-specific code
    pass
elif sys.platform == 'linux':  # Linux
    # Linux-specific code
    pass
else:
    # Other platforms / fallback
    pass

Testing Checklist

  • Test on macOS (Darwin)
  • Test on Linux
  • Test shared memory with names > 31 chars
  • Test multi-process cleanup race conditions
  • Test graceful shutdown (Ctrl+C)
  • Test abnormal shutdown (kill signal)
  • Verify no memory leaks (check /dev/shm on Linux, ipcs -m on macOS)

Additional Resources


Contributing

When implementing these fixes in your own project:

  1. Test thoroughly on both macOS and Linux
  2. Add platform guards to prevent cross-platform breakage
  3. Document platform-specific behavior in code comments
  4. Consider CI/CD testing on multiple platforms
  5. Handle edge cases gracefully with proper logging

If you find additional macOS-specific issues, please contribute to this guide!