ADR-0002: Content-Addressed File Store (CAS)¶
Status: Accepted
Context¶
CALISTA must store user artifacts immutably with strong provenance. We want:
- A byte store where identity is the content digest (e.g.,
sha256:<hex>). - Simple, testable semantics for writing/reading blobs across backends (local FS, object stores, memory).
- Clear separation between content identity and human/path aliases.
Historically, coupling path rules into the store complicates atomicity, recovery, and audits.
Decision¶
Adopt a pure CAS file store with the following interface and semantics:
- Identity: Blobs are addressed only by
digest("<algorithm>:<hex>", lowercase). - Writes:
open_write(fsync: bool = True) -> Writerreturns a staging handle.Writer.write(bytes)appends to a temp object.Writer.commit(expected_digest: str | None) -> BlobStatfinalizes the blob.- Idempotency:
commit()SHOULD be idempotent—if the digest already exists, discard staged data and return the existing blob’sBlobStat. - Leak safety: Exiting a writer context without
commit()MUST discard staged data (abort()) and close resources. - Reads:
open_read(digest) -> BinaryIOreturns a caller-closed stream.stat(digest) -> BlobStatreturns cheap metadata (MUST NOT read the body;sizeMAY beNoneif expensive).- Conveniences (non-abstract helpers):
exists(digest) -> boolput_path(path, ...) -> BlobStatput_bytes(data, ...) -> BlobStatput_stream(stream, ...) -> BlobStatreadall(digest) -> bytes- Durability:
- When
fsync=True, implementations SHOULD durably install the blob on commit (e.g.,fsyncfile and parent dir on POSIX). - Placement SHOULD be atomic (temp +
os.replaceor equivalent). - Error taxonomy (module-scoped exceptions):
FileStoreError(base),NotFound,ReadOnlyError,IntegrityError.
Algorithm: The canonical and only supported digest algorithm is SHA-256. All digests are formatted sha256:
(lowercase). Changing the canonical algorithm would require a separate ADR and a migration plan. Aliases/paths are out of scope here. Any mapping of
(namespace, relpath) → digestis handled outside the CAS as a DB-backed projection; details are deferred to future ADR if needed.
Rationale¶
- Separation of concerns: CAS remains verifiable and minimal; aliasing is a rebuildable read model.
- Recoverability: If alias state is lost/corrupt, we can replay events and rescan the CAS.
- Determinism & dedup: Content identity prevents accidental duplication and simplifies audits/GC.
- Portability: A thin interface makes alternative backends straightforward.
Consequences¶
- Callers ingest bytes once to get a
digest, then persist only the digest in domain state/events. - Any user-facing or workflow paths are derived via a projection (separate ADR).
- Backends must implement safe staging, atomic installs, and cheap
stat().
Alternatives Considered¶
- Path-addressed store (no CAS): simpler ergonomics; loses dedup/provenance. Rejected.
- CAS with built-in aliases: convenient but couples identity to paths and complicates atomicity/rollbacks. Rejected.
- Mandate DataLad/fsspec now: useful later; adds complexity and constraints today. Deferred integration.
Implementation Notes¶
- Digest format:
"<algorithm>:<hex>"(lowercase algorithm & hex). Future multi-algorithm support may add negotiation/migration. BlobStat = { digest, size?, uri? };uriis a hint (e.g.,file:///…), never identity.- Writers:
commit()andabort()MUST be idempotent;abort()MUST be a no-op after a successfulcommit().
Testing¶
- Writer lifecycle: write/commit/abort/context-exit (no leaks; temps cleaned).
- Duplicate ingest returns the same digest and does not rewrite bytes.
stat()never reads blob bodies; remote backends map to HEAD/metadata calls.- Durability hint honored where the platform permits.