Node Health Checks

Node Health Checks

Cloud Foundry (CF) has a concept of health checks. These can take a few forms, but the preferred approach is HTTP (vs port- or process-based checks). Applications can expose a /healthcheck endpoint which the Cloud Controller polls. If an application becomes unhealthy (indicated by non-200 HTTP status code), restarts are attempted.

There is a health check timeout which I like to think of as the startup timeout. During initial startup of the application, the health check process will wait this long (polling every couple seconds) for the app to become healthy. This value is configurable (60–600 seconds in Pivotal Cloud Foundry), supporting apps which have more intensive setup tasks.

There is also an invocation timeout which controls how long the health checker will wait for a response from a running app (after initial startup has completed, and the application has provided the first healthy response). Historically, this was hard coded to one second.

This kept health checks from bogging down the system, but also meant apps that needed to do intensive health checks of many dependencies would often timeout and be judged unhealthy (restart storm!).

A patch was provided which makes this configurable, but when I first heard about it I had flashbacks to a prior life… The general problem is nothing CF specific. I recall working around similar issues when Nagios and Cacti were cutting edge (poller timeouts).

This is not a pattern I invented, colleagues much smarter than I pointed out the solution was decoupling the response from test execution rather than fewer (reduced coverage) or less accurate tests (port or process).

Code Time!

Clone the repository to follow along

Even with configurable timeouts, decoupling is good… so I wanted to create a simple Node app utilizing promises and async/await to simulate a health check pattern designed to work with Cloud Foundry. The idea is using async/await to properly run and gather results for any number of long-lived tests while keeping your endpoint as responsive and accurate as possible.

Just so the examples make more sense if you haven’t looked at the sample app in its entirety, I create a global object we can use to store test results. We also set a default status code which you’ll see more of later.

// hold test results
global.testResults = { status: 200 };

To simulate potentially slow tests of application dependencies, I used setTimeout to introduce delay. Imagine an overloaded database or your favorite upstream API getting slammed during peak hours.

const databaseTest = () =>
  new Promise(resolve => {
    setTimeout(() => {
      console.log('db test running');
      resolve({
        message: 'OK',
        timestamp: Date.now(),
      });
    }, 3000);
  });

In the sample app I have a couple of these, and a real application might have many… so the next step was wrapping the test suite in an async function which calls each test and awaits the responses. In the simple case it checks our mock results for failure and updates the status code accordingly.

const testRunner = async (req, next) => {
  testResults.database = await databaseTest();
  if (testResults.database.message !== 'OK') {
    testResults.status = 500;
  }
  testResults.network = await networkTest();
  if (testResults.network.message !== 'OK') {
    testResults.status = 500;
  }
  // etc...
};

The last piece is properly structuring the health check endpoint itself… we want to fire off testRunner, but not as middleware which would cause the response to block.

app.get('/healthcheck', (req, res, next) => {
  // not middleware so we don't wait for next()
  testRunner();
  res.status(testResults.status).json(testResults);
});

As health check requests come in, the async function will run off and do all the required checks, updating testResults as it progresses. If something goes wrong, subsequent requests will receive a non-200 status code. In addition, a potentially useful message and timestamp are returned for each test… perhaps integrated with third-party monitoring.

If you have Docker running on your machine, you can simply clone this project then run make build; make run to get Express listening on http://localhost:3000/healthcheck.

The endpoint will respond immediately with the default response code (200). This keeps the health check process happy. If you watch stdout, you’ll see db test running after a few seconds, and network test running a couple seconds later. Refreshing the endpoint will show the test results. You can set message to something other than OK to simulate failure. Failure of any test results in a 500 status code.

❯ http localhost:3000/healthcheck
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 14
Content-Type: application/json; charset=utf-8
Date: Sun, 20 Jan 2019 18:09:30 GMT
ETag: W/"e-QlsUp1vTYvBgYHrHCBYe2n/q268"
X-Powered-By: Express
{
    "status": 200
}

❯ http localhost:3000/healthcheck
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 121
Content-Type: application/json; charset=utf-8
Date: Sun, 20 Jan 2019 18:06:56 GMT
ETag: W/"79-topudR8vULOkkpcpIVCdvk+S1nQ"
X-Powered-By: Express
{
    "database": {
        "message": "OK",
        "timestamp": 1548007610802
    },
    "network": {
        "message": "OK",
        "timestamp": 1548007612803
    },
    "status": 200
}

❯ http localhost:3000/healthcheck
HTTP/1.1 500 Internal Server Error
Connection: keep-alive
Content-Length: 125
Content-Type: application/json; charset=utf-8
Date: Sun, 20 Jan 2019 18:09:43 GMT
ETag: W/"7d-NbOObgl/2uT9jLi9gSpqy8qDyWE"
X-Powered-By: Express
{
    "database": {
        "message": "OK",
        "timestamp": "1548007773994"
    },
    "network": {
        "message": "FAIL",
        "timestamp": 1548007775999
    },
    "status": 500
}

Hopefully this gives you some ideas on ways to create better health checks for your Node apps running on Cloud Foundry!

Show Comments