Platform Health Monitoring
The Platform Health section in The Tower gives you a real-time overview of every service that powers Mallnline â from the Malet management backend to uChat, uCart, databases, and documentation portals. If something is down, you'll see it here first.
NOTE
Platform Health monitoring is available to platform administrators only. If you don't see The Tower in your navigation, see Managing Platform Admin Access.
Reading the Health Dashboard
When you open The Tower's Overview tab, you'll see the Platform Health widget near the top of the page. Here's what each section means:
Cluster Uptime
The large percentage in the top-left (e.g., 95.2%) shows the overall uptime of the platform â the proportion of services that are currently healthy. Green means all systems go; orange or red indicates issues.
KPI Cards
Three summary cards sit alongside the uptime:
| Card | Meaning |
|---|---|
| Monitored | Total number of services being tracked (currently 34) |
| Healthy | Services that are fully operational |
| Down | Services that are unreachable or failing |
Status Badge
The badge in the top-right corner tells you the overall platform status:
- đĸ ALL SYSTEMS OPERATIONAL â Everything is healthy
- đĄ DEGRADED â Some services have issues but the platform is functional
- đ´ ISSUES DETECTED â One or more services are down
Service Categories
Services are organized into collapsible groups. Each group header shows how many services are online (e.g., "16/23 online"):
| Group | What it Contains |
|---|---|
| đ Subgraphs | All backend services â Malets, Products, uCart, uChat, Payments, Search, and more |
| đī¸ Infrastructure | Databases â PostgreSQL, Redis, MongoDB |
| đĄ Observability | Monitoring tools â Meilisearch, MinIO, GlitchTip, Prometheus, Grafana |
| đĨī¸ Frontends | The main Mallnline web app |
| đ Portals | Developer Portal and Support Center |
Click any group header to collapse or expand it.
Understanding the D and F Badges
Each service card displays one or two small colored badges:
- D (Direct) â Is the service process itself alive? Checked by contacting the service directly.
- F (Federated) â Can the service be reached through the Gateway? Checked by routing a query through the central API.
What the Colors Mean
| Badge Color | Meaning |
|---|---|
| đĸ Green | Service is reachable |
| đ´ Red | Service is unreachable |
| đĄ Yellow | Service is degraded |
Common Combinations
| D | F | What it Means | What to Do |
|---|---|---|---|
| đĸ | đĸ | Everything is working | Nothing â you're all good |
| đĸ | đ´ | Service is alive but not reachable through the API | The API Gateway may need to be restarted to pick up schema changes |
| đ´ | đ´ | Service is completely down | Check the service logs for crash details |
| đ´ | đĸ | Unusual â stale cache | The Gateway is serving cached data. The service needs to be restarted |
TIP
The D: green / F: red state is the most useful diagnostic â it means the service's process is healthy but the Gateway isn't routing traffic to it. This usually resolves with a Gateway restart.
Viewing Service Details
Click any service card to open a detail modal with more information:
What You'll See
- Direct Probe â Status, response time, and the address being checked
- Federated Probe â Status, response time, and "via Gateway" label
- Diagnosis â A plain-English explanation of the service's state (e.g., "Fully operational" or "Process alive â federation issue")
- Last Checked / Last Success â Timestamps showing when the service was last probed and when it last responded successfully
- Consecutive Failures â How many probe cycles have failed in a row (0 means healthy)
Documentation Link
At the bottom of each service detail, you'll see a View Documentation â link. This opens the service's technical documentation in the Developer Portal. If you're a Malet Owner curious about how a specific service works, these docs provide the full architectural details.
Gateway Metrics
If you click the gateway service specifically, you'll also see extra metrics:
- Total Requests â How many API requests the Gateway has processed
- Total Errors â Error count (highlighted in red if non-zero)
- Cache Hit Ratio â Percentage of requests served from cache
- APQ Hit Ratio â Percentage of queries using Automatic Persisted Queries
- Uptime â How long the Gateway has been running since last restart
How Often is Health Data Updated?
The health monitor runs every 30 seconds by default. The "Last checked" timestamp at the bottom of the widget shows when the most recent probe completed. The data auto-refreshes in the dashboard â you don't need to manually reload the page.
Frequently Asked Questions
How do I access Platform Health monitoring? Navigate to The Tower from the sidebar (under "Platform"), the User Menu dropdown, or the footer. The health widget is on the Overview tab. You must be a platform administrator to see The Tower.
What does "70.6% Cluster Uptime" mean? It means 70.6% of the 34 monitored services are currently healthy. For example, if 24 out of 34 are UP, that's 70.6%. In a fully healthy environment, this number is 100%.
Why does a service show green D but red F? The service is running fine on its own, but the API Gateway isn't able to reach it. This usually happens after a code change that modified the service's GraphQL schema. Restarting the Gateway forces it to re-compose all schemas.
Can I see historical uptime data?
The current dashboard shows real-time health status. For historical metrics, use the Grafana dashboard at localhost:3100 which provides time-series graphs of Gateway performance. See Understanding Analytics Dashboards for more on analytics.
What should I do if a service shows as DOWN?
Click the service card to see the diagnosis. The modal will suggest which log file to check (e.g., logs/auth.log). Contact the development team if the issue persists after a service restart.
Why do some services only have a D badge and no F badge? Infrastructure services (PostgreSQL, Redis, MongoDB) and some observability tools don't participate in the federated GraphQL schema â they're checked via direct connection only.
Related
- Navigating The Tower â How to find and access The Tower admin workspace
- Platform Administration & The Tower â Full overview of Tower capabilities
- Monitoring Platform Errors â How to use the error tracking dashboard to investigate crashes
- Managing Platform Admin Access â How to grant or revoke Tower access
NOTE
For Developers: See Tower Health Observability for the full architectural deep dive into the dual-layer probing system, smoke query mapping, and configuration variables.