Monitoring the Database w/ Prometheus, doobie and HikariCP

Purpose

The purpose of this post is to explain how to monitor Database Performance with Prometheus and Grafana.

Connection Pooling & Thread Pools

The docs of doobie, the Functional Programming JDBC Access Library used nearly ubiquitously at Banno, offers a helpful explanation of Connection Pooling and Thread Pools: https://tpolecat.github.io/doobie/docs/14-Managing-Connections.html#about-transactors.

diagram

Collecting Metrics via Prometheus

Code

  def createDeviceRegistrationRepositoryConfig[F[_]: ContextShift](
      ...,						    
      cr: CollectorRegistry,
...
hikariConfig <- Stream.eval(
  F.delay {
    val c = new HikariConfig()
    c.setDriverClassName(connection.dbDriver)
    c.setJdbcUrl(connection.dbServer)
    c.setUsername(creds.username)
    c.setPassword(creds.password)
    c.setMetricsTrackerFactory(new PrometheusMetricsTrackerFactory(cr)) // Writes metrics to the CollectorRegistry. Consequently, Prometheus can then
    				   					//  scrape (collect) them for display in Grafana, alerting, etc.
    c.setMinimumIdle(connection.minIdle)  				// Sets minimum idle connections
    c.setMaximumPoolSize(connection.maximumPoolSize) 			// Sets max pool size
    c
  }
)

Collecting Metrics for > 1 Database in a Service

If you’re connecting to multiple different DBs from a single service, you’ll want to share a single PrometheusMetricsTrackerFactory instance between the two connection pools. Otherwise, you’ll likely run into exceptions in initializing the underlying metrics, and end up only recording metrics for one of the pools.

Example: https://github.com/Banno/consumer-eulas/blob/1.57.0/src/main/scala/com/banno/eulas/app/ConfigService.scala#L29-L46

Full source - https://github.com/Banno/device-service/blob/1.180.0/src/main/scala/com/banno/device/Config.scala#L87-L137

Minimum Idle and Max Pool Size

https://github.com/brettwooldridge/HikariCP, the Java Connection Pooler that powers doobie’s Connection Pooler, notes:

Watch a video from the Oracle Real-world Performance group, and learn about why connection pools do not need to be sized as large as they often are. In fact, oversized connection pools have a clear and demonstrable negative impact on performance

Please watch that video for motivation and experimentation that shows the benefits of using a small pool size.

Hikari Prometheus Metrics

HikariCPCollector.java documents the following Gauges:

hikaricp_active_connections - Active connections, i.e. current number of active (in-use) connections in the pool
hikaricp_idle_connections - Idle connections
hikaricp_pending_threads - Pending threads, i.e. awaiting connections from the pool.
hikaricp_connections - The number of current connections
hikaricp_max_connections - Max connections
hikaricp_min_connections - Min connections

Note that hikaricp_max_connections and hikaricp_min_connections are static values, i.e. they’re set when constructing the doobie Connection Manager (see above code).

There are also a few Summary metrics:

hikaricp_connection_acquired_nanos - Connection acquired time (ns), i.e. the amount of time threads calling .getConnection() spend waiting for a connection from the pool.
hikaricp_connection_usage_millis - Connection usage (ms), i.e. the amount of time connections are used before being returned to the pool. Also known as the “out of pool” or “in-use” time.
hikaricp_connection_creation_millis - Connection creation (ms), i.e. the amount of time required to create a new connection to the database.

Visualizing Metrics in Grafana

Monitoring Database Performance is a template Grafana Dashboard.

Note that the device-service application is selected since it has 10 instances and connects to a database.

It includes nine different graphs.

HikariCP Max Pool Size (Set at App Startup)
- max by (pool)(hikaricp_max_connections{app=~"$service"})
- shows the max pool size, which is static, i.e. set when the app starts up.
HikariCP Min Pool Size (Set at App Startup)
- max by (pool)(hikaricp_min_connections{app=~"$service"})
- shows the min pool size, which is static, i.e. set when the app starts up.
HikariCP Threads Awaiting Connection
- sum by (instance, pool)(hikaricp_pending_threads{app=~"$service"})
- A spike in this value could indicate increased database contention, i.e. multiple threads are trying to get a connection to the DB.
HikariCP Active Database Connections
- sum by (instance, pool)(hikaricp_active_connections{app=~"$service"})
HikariCP Idle Connections
- sum by (instance, pool)(hikaricp_idle_connections{app=~"$service"})
HikariCP Total Connections
- sum by (instance, pool)(hikaricp_connections{app=~"$service"})
Max p99 Connection acquired time (ns)
- max(hikaricp_connection_acquired_nanos{app=~"$service", quantile="0.99"})
Max p99 Connection usage (ms)
- max(hikaricp_connection_usage_millis{app=~"$service", quantile="0.99"})
Max p99 Connection creation (ms)
- max(hikaricp_connection_creation_millis{app=~"$service", quantile="0.99"})

Note that sum is used as it will take the sum of the current value of each gauge’s value.

banno_all

Multiple applications, such as MDS, consumer-enrollment, banno-users, etc, connect to the banno_all database. Folks from Team Infrastructure and Data Services have improved overall performance based on experience and experimentation.

For questions involving DB performance of banno_all, please consider reaching out in the #temp-mds-pgbouncer-commits Slack channel.

Conclusions

doobie and HikariCP make it simple to collect Prometheus Metrics. Strongly consider the advice of Oracle (per the HikariCP link) and set a small pool size. Lastly, closely monitor hikaricp_pending_threads since, in the past, that number has spiked for a prolonged amount of time during Espresso incidents.