ADR-0002: Content-Addressed File Store (CAS)¶

Status: Accepted

Context¶

CALISTA must store user artifacts immutably with strong provenance. We want:

A byte store where identity is the content digest (e.g., sha256:<hex>).
Simple, testable semantics for writing/reading blobs across backends (local FS, object stores, memory).
Clear separation between content identity and human/path aliases.

Historically, coupling path rules into the store complicates atomicity, recovery, and audits.

Decision¶

Adopt a pure CAS file store with the following interface and semantics:

Identity: Blobs are addressed only by digest ("<algorithm>:<hex>", lowercase).
Writes:
open_write(fsync: bool = True) -> Writer returns a staging handle.
Writer.write(bytes) appends to a temp object.
Writer.commit(expected_digest: str | None) -> BlobStat finalizes the blob.
Idempotency: commit() SHOULD be idempotent—if the digest already exists, discard staged data and return the existing blob’s BlobStat.
Leak safety: Exiting a writer context without commit() MUST discard staged data (abort()) and close resources.
Reads:
open_read(digest) -> BinaryIO returns a caller-closed stream.
stat(digest) -> BlobStat returns cheap metadata (MUST NOT read the body; size MAY be None if expensive).
Conveniences (non-abstract helpers):
exists(digest) -> bool
put_path(path, ...) -> BlobStat
put_bytes(data, ...) -> BlobStat
put_stream(stream, ...) -> BlobStat
readall(digest) -> bytes
Durability:
When fsync=True, implementations SHOULD durably install the blob on commit (e.g., fsync file and parent dir on POSIX).
Placement SHOULD be atomic (temp + os.replace or equivalent).
Error taxonomy (module-scoped exceptions):
FileStoreError (base), NotFound, ReadOnlyError, IntegrityError.

Algorithm: The canonical and only supported digest algorithm is SHA-256. All digests are formatted sha256: (lowercase). Changing the canonical algorithm would require a separate ADR and a migration plan.

Aliases/paths are out of scope here. Any mapping of (namespace, relpath) → digest is handled outside the CAS as a DB-backed projection; details are deferred to future ADR if needed.

Rationale¶

Separation of concerns: CAS remains verifiable and minimal; aliasing is a rebuildable read model.
Recoverability: If alias state is lost/corrupt, we can replay events and rescan the CAS.
Determinism & dedup: Content identity prevents accidental duplication and simplifies audits/GC.
Portability: A thin interface makes alternative backends straightforward.

Consequences¶

Callers ingest bytes once to get a digest, then persist only the digest in domain state/events.
Any user-facing or workflow paths are derived via a projection (separate ADR).
Backends must implement safe staging, atomic installs, and cheap stat().

Alternatives Considered¶

Path-addressed store (no CAS): simpler ergonomics; loses dedup/provenance. Rejected.
CAS with built-in aliases: convenient but couples identity to paths and complicates atomicity/rollbacks. Rejected.
Mandate DataLad/fsspec now: useful later; adds complexity and constraints today. Deferred integration.

Implementation Notes¶

Digest format: "<algorithm>:<hex>" (lowercase algorithm & hex). Future multi-algorithm support may add negotiation/migration.
BlobStat = { digest, size?, uri? }; uri is a hint (e.g., file:///…), never identity.
Writers: commit() and abort() MUST be idempotent; abort() MUST be a no-op after a successful commit().

Testing¶

Writer lifecycle: write/commit/abort/context-exit (no leaks; temps cleaned).
Duplicate ingest returns the same digest and does not rewrite bytes.
stat() never reads blob bodies; remote backends map to HEAD/metadata calls.
Durability hint honored where the platform permits.