Support Center

Platform Health Monitoring

The Platform Health section in The Tower gives you a real-time overview of every service that powers Mallnline — from the Malet management backend to uChat, uCart, databases, and documentation portals. If something is down, you'll see it here first.

NOTE

Platform Health monitoring is available to platform administrators only. If you don't see The Tower in your navigation, see Managing Platform Admin Access.


Reading the Health Dashboard

When you open The Tower's Overview tab, you'll see the Platform Health widget near the top of the page. Here's what each section means:

Cluster Uptime

The large percentage in the top-left (e.g., 95.2%) shows the overall uptime of the platform — the proportion of services that are currently healthy. Green means all systems go; orange or red indicates issues.

KPI Cards

Three summary cards sit alongside the uptime:

Card Meaning
Monitored Total number of services being tracked (currently 34)
Healthy Services that are fully operational
Down Services that are unreachable or failing

Status Badge

The badge in the top-right corner tells you the overall platform status:

  • đŸŸĸ ALL SYSTEMS OPERATIONAL — Everything is healthy
  • 🟡 DEGRADED — Some services have issues but the platform is functional
  • 🔴 ISSUES DETECTED — One or more services are down

Service Categories

Services are organized into collapsible groups. Each group header shows how many services are online (e.g., "16/23 online"):

Group What it Contains
🔌 Subgraphs All backend services — Malets, Products, uCart, uChat, Payments, Search, and more
đŸ—„ī¸ Infrastructure Databases — PostgreSQL, Redis, MongoDB
📡 Observability Monitoring tools — Meilisearch, MinIO, GlitchTip, Prometheus, Grafana
đŸ–Ĩī¸ Frontends The main Mallnline web app
📚 Portals Developer Portal and Support Center

Click any group header to collapse or expand it.


Understanding the D and F Badges

Each service card displays one or two small colored badges:

  • D (Direct) — Is the service process itself alive? Checked by contacting the service directly.
  • F (Federated) — Can the service be reached through the Gateway? Checked by routing a query through the central API.

What the Colors Mean

Badge Color Meaning
đŸŸĸ Green Service is reachable
🔴 Red Service is unreachable
🟡 Yellow Service is degraded

Common Combinations

D F What it Means What to Do
đŸŸĸ đŸŸĸ Everything is working Nothing — you're all good
đŸŸĸ 🔴 Service is alive but not reachable through the API The API Gateway may need to be restarted to pick up schema changes
🔴 🔴 Service is completely down Check the service logs for crash details
🔴 đŸŸĸ Unusual — stale cache The Gateway is serving cached data. The service needs to be restarted

TIP

The D: green / F: red state is the most useful diagnostic — it means the service's process is healthy but the Gateway isn't routing traffic to it. This usually resolves with a Gateway restart.


Viewing Service Details

Click any service card to open a detail modal with more information:

What You'll See

  • Direct Probe — Status, response time, and the address being checked
  • Federated Probe — Status, response time, and "via Gateway" label
  • Diagnosis — A plain-English explanation of the service's state (e.g., "Fully operational" or "Process alive — federation issue")
  • Last Checked / Last Success — Timestamps showing when the service was last probed and when it last responded successfully
  • Consecutive Failures — How many probe cycles have failed in a row (0 means healthy)

At the bottom of each service detail, you'll see a View Documentation → link. This opens the service's technical documentation in the Developer Portal. If you're a Malet Owner curious about how a specific service works, these docs provide the full architectural details.

Gateway Metrics

If you click the gateway service specifically, you'll also see extra metrics:

  • Total Requests — How many API requests the Gateway has processed
  • Total Errors — Error count (highlighted in red if non-zero)
  • Cache Hit Ratio — Percentage of requests served from cache
  • APQ Hit Ratio — Percentage of queries using Automatic Persisted Queries
  • Uptime — How long the Gateway has been running since last restart

How Often is Health Data Updated?

The health monitor runs every 30 seconds by default. The "Last checked" timestamp at the bottom of the widget shows when the most recent probe completed. The data auto-refreshes in the dashboard — you don't need to manually reload the page.


Frequently Asked Questions

How do I access Platform Health monitoring? Navigate to The Tower from the sidebar (under "Platform"), the User Menu dropdown, or the footer. The health widget is on the Overview tab. You must be a platform administrator to see The Tower.

What does "70.6% Cluster Uptime" mean? It means 70.6% of the 34 monitored services are currently healthy. For example, if 24 out of 34 are UP, that's 70.6%. In a fully healthy environment, this number is 100%.

Why does a service show green D but red F? The service is running fine on its own, but the API Gateway isn't able to reach it. This usually happens after a code change that modified the service's GraphQL schema. Restarting the Gateway forces it to re-compose all schemas.

Can I see historical uptime data? The current dashboard shows real-time health status. For historical metrics, use the Grafana dashboard at localhost:3100 which provides time-series graphs of Gateway performance. See Understanding Analytics Dashboards for more on analytics.

What should I do if a service shows as DOWN? Click the service card to see the diagnosis. The modal will suggest which log file to check (e.g., logs/auth.log). Contact the development team if the issue persists after a service restart.

Why do some services only have a D badge and no F badge? Infrastructure services (PostgreSQL, Redis, MongoDB) and some observability tools don't participate in the federated GraphQL schema — they're checked via direct connection only.


NOTE

For Developers: See Tower Health Observability for the full architectural deep dive into the dual-layer probing system, smoke query mapping, and configuration variables.