Shaping and Enhancing our Merchant Communications - Our Splunk Integration with Statuspal
This month’s engineering blog explores how Judopay shaped the way we keep our merchants updated, should any transaction errors arise.
Working collaboratively, Judopay’s Technology and Service Reliability Engineering teams facilitated the way to enhance our SplunkTM system logging, resulting in the provision to pinpoint any faults quicker. A custom alert plugin was then built, to push the enriched logging from SplunkTM to Statuspal.
Wayne, Judopay’s Head of Technology and Operations, will dive into how we created an automated and complete transaction monitoring and alerting solution within our SplunkTM service.
Houston, we have a problem....
When I started at Judopay way back when, one of the biggest issues faced by our merchants was communication.
Not normal day-to-day communication, but when things weren't right, or going according to plan, for example transaction processing issues.
It was apparent when transactions were in error, or failing it took a lengthy time to investigate and ascertain:
Payments traverses many different systems to make a simple payment. In short, it used to take a long time to work out where the failures were occurring.
As a result, communicating this to our merchants became time dependant. We didn't want to alarm all merchants who weren't affected by such interruptions to service, until the full picture as to where and why was fully understood.
Millions of log lines are involved in this process. Bringing it all together to ascertain failure was initially cumbersome, and time consuming.
This delay resulted in causing our merchants critical time loss, when deciding what to do and how to communicate this to their consumer base.
- How can we ascertain the fault quicker?
- How can we inform merchants sooner?
Finding the Fault Quicker …
After reviewing the steps to establish the root cause of any issue, we established the need to enrich our logging, to be able to pin-point failures:
- At the Acquirer level
- Identify the fault
- The affected merchants
At the time, Judopay was using SplunkTM in its infancy. A great deal of work was undertaken in collaboration with Judopay’s SRE team.
Stuart, Judopay’s Head of Reliability Engineering, reflected on the work involved to further enrich the logging:
“The logging refinement work involved ensuring that a few key pieces of information, for example ApiToken and ReceiptId, were recorded in the logging context on every log line.
Then a single summary line logged on each transaction (successful or failed), that listed the key attributes. That summary line could then be used as the source for the SplunkTM alerts. Additionally, where errors were returned we attempted to include key attributes from the request on that error line. This was to minimise the number of joins with other log lines needed in order to pull together reports of the requests associated with failed calls.”
Furthermore, timings along each part of the payment journey, the full identification of the merchant and the route the transaction was taking, were also included in the refinement.
Once this was completed, we were able to create a transaction monitoring solution within our SplunkTM service, enabling us to visualise failures often before Gateways or Acquirers were aware.
We were now well ahead of any potential failure, with the ability to proactively monitor all our connectivity down to every single transaction flow.
How can we inform merchants sooner?
Statuspal was selected as a public facing status page outside of Judopay’s infrastructure, in order to keep merchants up to date. We knew that our status page would be external to our services and as such, available for merchant visibility at all times.
We chose Statuspal for its simple interface which can be easily navigated and consumed by merchants including a great sign-up service for specific service modules, so merchants could be informed of any changes to services that directly affect them.
A major driver for Statuspal was its feature rich API capability, which ultimately lead to our final solution. The full API feature set can be found at https://www.statuspal.io/api-docs/v2
Specifically using the following API call to create a new incident :
Automating the Status Updates
Following Splunks' reference guide, we created our own SplunkTM / PythonTM plugin pushing custom alert actions, including latency, transaction failures across merchants, and Judopay’s service availability to Statuspal.
This plugin has been written in such a manner that adding additional Gateways / Acquirers or services to our overall solution, means we can easily add triggers from SplunkTM to update Statuspal statuses for that new Gateway / Acquirer or service, without the need for any re-coding.
We also trigger based on a pre-set failure count and latency check, setting the severity to minor or major accordingly.
The plugin is a customisable alert, placing exact text into minor, major and resolved alerting, depending on the failure observed.
Our SplunkTM alerts manage this automatically, pushing on regular 5 minute intervals. Should an incident continue with failure or high latency, then the status page will trigger:
- A major alert
- Update the alert
- Resolve the alert once the failure reason has been corrected
We then signed up our merchants to their specific service options, so that in future any failures, or any potential failures, merchants would receive alerts within 5 minutes of Judopay noticing them.
The service provides regular updates, until the service has been resumed. Often some of the service failures are at a Gateway or Acquirer level, whereby Judopay is working with our partners to proactively manage and eventually resolve.
Should there be any instance of a failure that has not been included in the automation, Judopay also manually updates the status page to keep our merchants fully updated. However, once we spot any such instances, we will create an automated process for such failures in the future. We are constantly improving the functionality.
You can sign up and subscribe for alerts to our status page here: https://judopay.statuspal.io/
Then click Subscribe to updates. If you are unsure which service you are currently on, contact firstname.lastname@example.org who will be happy to assist.