Architecture
The schema and ingest pipeline have their own route
This page exposes the raw / norm / canon / ops layout plus the latest ingest runs for each source.
Database
Postgres + Alembic
Schema layout: raw / norm / canon / ops
raw
norm
canon
ops
Runtime surfaces
Deployment components
frontend
- runtime: Next.js
- target: Vercel or containerized Node
api
- runtime: FastAPI
- target: Fly.io app
search
- runtime: database-first search
- sync: ops.outbox_event -> search.reindex.requested
auth
- runtime: Firebase
- configured: true
billing
- runtime: Stripe
- configured: true
Scale principles
Pipeline guarantees
- raw.source_record and ops.outbox_event ledger ingest and downstream sync requests
- ops.ingest_run stores per-source requested_count, raw_record_count, paper_count, and status
- raw / norm / canon / ops are separated in the SQLAlchemy model and Alembic migration
- canonical papers keep reversible source_slug + paper_key mappings back to raw records
- search stays on a dedicated ConoHa-ready Postgres host so the full DBLP corpus can be queried without loading every paper in memory
Latest ingest status
Per-source run results
These counts come from ops.ingest_run and the canonical paper inventory.
OpenAlex
openalex
0 raw0 papersrunning
OpenCitations
opencitations
100 raw0 paperscompleted
Crossref
crossref
100 raw100 paperscompleted
DataCite
datacite
100 raw100 paperscompleted
ROR / ORCID
ror_orcid
100 raw0 paperscompleted
arXiv
arxiv
100 raw100 paperscompleted
DBLP
dblp
12414064 raw12414064 paperscompleted
OpenReview
openreview
100 raw100 paperscompleted
ACL Anthology
acl_anthology
100 raw100 paperscompleted
PubMed
pubmed
100 raw100 paperscompleted
PMC Open Access Subset
pmc_open_access
100 raw100 paperscompleted
DOAJ
doaj
100 raw100 paperscompleted
CORE
core
100 raw100 paperscompleted
OpenAIRE Graph
openaire_graph
100 raw100 paperscompleted
Unpaywall
unpaywall
100 raw0 paperscompleted
Common Crawl
common_crawl
100 raw0 paperscompleted
