ADR-0002: Content-Addressed File Store (CAS)¶
Status: Accepted
Context¶
CALISTA must store user artifacts immutably with strong provenance. We want:
- A byte store where identity is the content digest (e.g.,
sha256:<hex>
). - Simple, testable semantics for writing/reading blobs across backends (local FS, object stores, memory).
- Clear separation between content identity and human/path aliases.
Historically, coupling path rules into the store complicates atomicity, recovery, and audits.
Decision¶
Adopt a pure CAS file store with the following interface and semantics:
- Identity: Blobs are addressed only by
digest
("<algorithm>:<hex>"
, lowercase). - Writes:
open_write(fsync: bool = True) -> Writer
returns a staging handle.Writer.write(bytes)
appends to a temp object.Writer.commit(expected_digest: str | None) -> BlobStat
finalizes the blob.- Idempotency:
commit()
SHOULD be idempotent—if the digest already exists, discard staged data and return the existing blob’sBlobStat
. - Leak safety: Exiting a writer context without
commit()
MUST discard staged data (abort()
) and close resources. - Reads:
open_read(digest) -> BinaryIO
returns a caller-closed stream.stat(digest) -> BlobStat
returns cheap metadata (MUST NOT read the body;size
MAY beNone
if expensive).- Conveniences (non-abstract helpers):
exists(digest) -> bool
put_path(path, ...) -> BlobStat
put_bytes(data, ...) -> BlobStat
put_stream(stream, ...) -> BlobStat
readall(digest) -> bytes
- Durability:
- When
fsync=True
, implementations SHOULD durably install the blob on commit (e.g.,fsync
file and parent dir on POSIX). - Placement SHOULD be atomic (temp +
os.replace
or equivalent). - Error taxonomy (module-scoped exceptions):
FileStoreError
(base),NotFound
,ReadOnlyError
,IntegrityError
.
Algorithm: The canonical and only supported digest algorithm is SHA-256. All digests are formatted sha256:
(lowercase). Changing the canonical algorithm would require a separate ADR and a migration plan. Aliases/paths are out of scope here. Any mapping of
(namespace, relpath) → digest
is handled outside the CAS as a DB-backed projection; details are deferred to future ADR if needed.
Rationale¶
- Separation of concerns: CAS remains verifiable and minimal; aliasing is a rebuildable read model.
- Recoverability: If alias state is lost/corrupt, we can replay events and rescan the CAS.
- Determinism & dedup: Content identity prevents accidental duplication and simplifies audits/GC.
- Portability: A thin interface makes alternative backends straightforward.
Consequences¶
- Callers ingest bytes once to get a
digest
, then persist only the digest in domain state/events. - Any user-facing or workflow paths are derived via a projection (separate ADR).
- Backends must implement safe staging, atomic installs, and cheap
stat()
.
Alternatives Considered¶
- Path-addressed store (no CAS): simpler ergonomics; loses dedup/provenance. Rejected.
- CAS with built-in aliases: convenient but couples identity to paths and complicates atomicity/rollbacks. Rejected.
- Mandate DataLad/fsspec now: useful later; adds complexity and constraints today. Deferred integration.
Implementation Notes¶
- Digest format:
"<algorithm>:<hex>"
(lowercase algorithm & hex). Future multi-algorithm support may add negotiation/migration. BlobStat = { digest, size?, uri? }
;uri
is a hint (e.g.,file:///…
), never identity.- Writers:
commit()
andabort()
MUST be idempotent;abort()
MUST be a no-op after a successfulcommit()
.
Testing¶
- Writer lifecycle: write/commit/abort/context-exit (no leaks; temps cleaned).
- Duplicate ingest returns the same digest and does not rewrite bytes.
stat()
never reads blob bodies; remote backends map to HEAD/metadata calls.- Durability hint honored where the platform permits.