TATEKANMonitor

TATEKANMonitor is the cross-app operations monitoring layer for detecting app outages, security signals, runtime drift, and response needs across TATEKANOS.

Purpose: monitor the operation of each app itself, not business workflow delays inside each app.
Scope: Cloud Run, Pages, jobs, schedulers, logs, security events, deploy state, and response coordination.
Source context: Issue #644 TATEKANOS AI-Native Platform and the AI-native operations vision.

0. Boundary

TATEKANMonitor does not own approval backlog, payment delay, inspection delay, or other app-specific business process checks.
Toriteki, dMemo, Mitsumori, and other apps remain responsible for their own domain-level alerts and workflow exceptions.
TATEKANMonitor focuses on whether each app is healthy, secure, deploy-consistent, observable, and recoverable.

1. Monitoring Targets

Runtime health: Cloud Run revisions, traffic split, 5xx rate, latency, cold-start symptoms, and failed health checks.
Job health: Cloud Run Jobs, Cloud Scheduler, cleanup jobs, retention jobs, and batch failures.
Frontend delivery: Cloudflare Pages deploy state, custom-domain reachability, and Access redirect expectations.
Security signals: abnormal login patterns, IAM / Secret / environment changes, unexpected public exposure, and audit-log anomalies.
Engineering signals: GitHub Actions failures, failed deploy checks, dependency drift, and config drift against canonical policy.

2. Response Model

Detect: collect operational signals from logs, metrics, deploy metadata, CI, and security audit sources.
Classify: separate outage, degradation, security concern, drift, and informational events.
Route: point the incident to the owning app and the right runbook without guessing.
Escalate: require PM approval for production-impacting changes, deploys, IAM, secrets, and public exposure changes.
Record: preserve timestamped evidence, operator, target, command or source, and resolution status.

3. First Implementation Shape

Start read-only: dashboards, health summaries, log queries, and GitHub check summaries.
Prefer deterministic checks before AI interpretation: explicit thresholds, known service maps, and canonical expected states.
Use AI for summarization, prioritization, and incident briefing, not silent production mutation.
Keep high-risk operations behind human approval: deploy, rollback, IAM, secret changes, DNS, and access policy changes.