Platform Health Monitoring

The Platform Health section in The Tower gives you a real-time overview of every service that powers Mallnline — from the Malet management backend to uChat, uCart, databases, and documentation portals. If something is down, you'll see it here first.

NOTE

Platform Health monitoring is available to platform administrators only. If you don't see The Tower in your navigation, see Managing Platform Admin Access.

Reading the Health Dashboard

When you open The Tower's Overview tab, you'll see the Platform Health widget near the top of the page. Here's what each section means:

Cluster Uptime

The large percentage in the top-left (e.g., 95.2%) shows the overall uptime of the platform — the proportion of services that are currently healthy. Green means all systems go; orange or red indicates issues.

KPI Cards

Three summary cards sit alongside the uptime:

Card	Meaning
Monitored	Total number of services being tracked (currently 34)
Healthy	Services that are fully operational
Down	Services that are unreachable or failing

Status Badge

The badge in the top-right corner tells you the overall platform status:

🟢 ALL SYSTEMS OPERATIONAL — Everything is healthy
🟡 DEGRADED — Some services have issues but the platform is functional
🔴 ISSUES DETECTED — One or more services are down

Service Categories

Services are organized into collapsible groups. Each group header shows how many services are online (e.g., "16/23 online"):

Group	What it Contains
🔌 Subgraphs	All backend services — Malets, Products, uCart, uChat, Payments, Search, and more
🗄️ Infrastructure	Databases — PostgreSQL, Redis, MongoDB
📡 Observability	Monitoring tools — Meilisearch, MinIO, GlitchTip, Prometheus, Grafana
🖥️ Frontends	The main Mallnline web app
📚 Portals	Developer Portal and Support Center

Click any group header to collapse or expand it.

Understanding the D and F Badges

Each service card displays one or two small colored badges:

D (Direct) — Is the service process itself alive? Checked by contacting the service directly.
F (Federated) — Can the service be reached through the Gateway? Checked by routing a query through the central API.

What the Colors Mean

Badge Color	Meaning
🟢 Green	Service is reachable
🔴 Red	Service is unreachable
🟡 Yellow	Service is degraded

Common Combinations

D	F	What it Means	What to Do
🟢	🟢	Everything is working	Nothing — you're all good
🟢	🔴	Service is alive but not reachable through the API	The API Gateway may need to be restarted to pick up schema changes
🔴	🔴	Service is completely down	Check the service logs for crash details
🔴	🟢	Unusual — stale cache	The Gateway is serving cached data. The service needs to be restarted

TIP

The D: green / F: red state is the most useful diagnostic — it means the service's process is healthy but the Gateway isn't routing traffic to it. This usually resolves with a Gateway restart.

Viewing Service Details

Click any service card to open a detail modal with more information:

What You'll See

Direct Probe — Status, response time, and the address being checked
Federated Probe — Status, response time, and "via Gateway" label
Diagnosis — A plain-English explanation of the service's state (e.g., "Fully operational" or "Process alive — federation issue")
Last Checked / Last Success — Timestamps showing when the service was last probed and when it last responded successfully
Consecutive Failures — How many probe cycles have failed in a row (0 means healthy)

Documentation Link

At the bottom of each service detail, you'll see a View Documentation → link. This opens the service's technical documentation in the Developer Portal. If you're a Malet Owner curious about how a specific service works, these docs provide the full architectural details.

Gateway Metrics

If you click the gateway service specifically, you'll also see extra metrics:

Total Requests — How many API requests the Gateway has processed
Total Errors — Error count (highlighted in red if non-zero)
Cache Hit Ratio — Percentage of requests served from cache
APQ Hit Ratio — Percentage of queries using Automatic Persisted Queries
Uptime — How long the Gateway has been running since last restart

How Often is Health Data Updated?

The health monitor runs every 30 seconds by default. The "Last checked" timestamp at the bottom of the widget shows when the most recent probe completed. The data auto-refreshes in the dashboard — you don't need to manually reload the page.

Frequently Asked Questions

How do I access Platform Health monitoring? Navigate to The Tower from the sidebar (under "Platform"), the User Menu dropdown, or the footer. The health widget is on the Overview tab. You must be a platform administrator to see The Tower.

What does "70.6% Cluster Uptime" mean? It means 70.6% of the 34 monitored services are currently healthy. For example, if 24 out of 34 are UP, that's 70.6%. In a fully healthy environment, this number is 100%.

Why does a service show green D but red F? The service is running fine on its own, but the API Gateway isn't able to reach it. This usually happens after a code change that modified the service's GraphQL schema. Restarting the Gateway forces it to re-compose all schemas.

Can I see historical uptime data? The current dashboard shows real-time health status. For historical metrics, use the Grafana dashboard at localhost:3100 which provides time-series graphs of Gateway performance. See Understanding Analytics Dashboards for more on analytics.

What should I do if a service shows as DOWN? Click the service card to see the diagnosis. The modal will suggest which log file to check (e.g., logs/auth.log). Contact the development team if the issue persists after a service restart.

Why do some services only have a D badge and no F badge? Infrastructure services (PostgreSQL, Redis, MongoDB) and some observability tools don't participate in the federated GraphQL schema — they're checked via direct connection only.

Navigating The Tower — How to find and access The Tower admin workspace
Platform Administration & The Tower — Full overview of Tower capabilities
Monitoring Platform Errors — How to use the error tracking dashboard to investigate crashes
Managing Platform Admin Access — How to grant or revoke Tower access
Beta Platform Availability — Why beta-*.mallnline.com occasionally shows the "Temporarily Offline" page