Operations Monitor
Concept
TATEKANMonitor
TATEKANMonitor is the cross-app operations monitoring layer for detecting app outages, security signals, runtime drift, and response needs across TATEKANOS.
- Purpose: monitor the operation of each app itself, not business workflow delays inside each app.
- Scope: Cloud Run, Pages, jobs, schedulers, logs, security events, deploy state, and response coordination.
- Source context: Issue #644 TATEKANOS AI-Native Platform and the AI-native operations vision.
0. Boundary
- TATEKANMonitor does not own approval backlog, payment delay, inspection delay, or other app-specific business process checks.
- Toriteki, dMemo, Mitsumori, and other apps remain responsible for their own domain-level alerts and workflow exceptions.
- TATEKANMonitor focuses on whether each app is healthy, secure, deploy-consistent, observable, and recoverable.
1. Monitoring Targets
- Runtime health: Cloud Run revisions, traffic split, 5xx rate, latency, cold-start symptoms, and failed health checks.
- Job health: Cloud Run Jobs, Cloud Scheduler, cleanup jobs, retention jobs, and batch failures.
- Frontend delivery: Cloudflare Pages deploy state, custom-domain reachability, and Access redirect expectations.
- Security signals: abnormal login patterns, IAM / Secret / environment changes, unexpected public exposure, and audit-log anomalies.
- Engineering signals: GitHub Actions failures, failed deploy checks, dependency drift, and config drift against canonical policy.
2. Response Model
- Detect: collect operational signals from logs, metrics, deploy metadata, CI, and security audit sources.
- Classify: separate outage, degradation, security concern, drift, and informational events.
- Route: point the incident to the owning app and the right runbook without guessing.
- Escalate: require PM approval for production-impacting changes, deploys, IAM, secrets, and public exposure changes.
- Record: preserve timestamped evidence, operator, target, command or source, and resolution status.
3. First Implementation Shape
- Start read-only: dashboards, health summaries, log queries, and GitHub check summaries.
- Prefer deterministic checks before AI interpretation: explicit thresholds, known service maps, and canonical expected states.
- Use AI for summarization, prioritization, and incident briefing, not silent production mutation.
- Keep high-risk operations behind human approval: deploy, rollback, IAM, secret changes, DNS, and access policy changes.