Dec 17, 2019 9:18:00 AM
Solvemate, like most technology companies, does not only provide a service powered by its software, but it connects and integrates with a multitude of other partners.
We integrate with CRMs like Salesforce, Zendesk and Freshdesk, depend on partners to send emails and text messages, and have partners that help manage and track billing. Our product would simply not offer the same value if it wasn’t for all of these partnerships.
Partnering with 3rd party providers is a great way to quickly move your business forward and focus on what makes your brand unique. Our development team knows they don’t need to understand the intricacies of sending emails in 2019, they instead know they can rely on our partner to handle those problems and to offer a simple API.
Most of the time this works and we don’t need to think about it.
However, software systems are not that simple. No service is available 100% of the time, and if any provider told you that: they’re lying. Our systems can fail, their systems can fail, cloud providers can fail, internet lines can be cut, data centers can catch fire, the moons can be out of alignment…
Software fails, and if you are not building with that in mind you are bound to create an experience which disappoints your customers.
In Solvemate’s case, this is even more important. The users who interact with our software do so as they seek support from one of our customers. They don’t consider that they are interacting with Solvemate, but instead that they are interacting with our customer. Our users are our customer’s customers. If we fail to provide them with a reliable service it makes our customers look bad. This is why reliability is one of the most important tenants for the software that we build.
For the rest of this article we will focus on one of our most important 3rd party integrations: how we reliably hand off support tickets to our CRM partners.
If a user goes through a Solvemate chat flow and their issue is complex and requires the direct involvement of a support agent, we need to reliably hand over the chat history and user information into our customer's CRM system.
Solvemate integrates with many different CRMs, but the flow is always fairly similar. We need to collect everything we know so far about the user’s conversation, authenticate ourselves with the 3rd party system and make the appropriate set of API calls.
These API calls are typically quite fast, and most of the time nothing goes wrong. The simplest implementation would involve contacting the CRMs system right as a user clicks the handover button and waiting to hear back from them – either in live chat with a person, or bringing the chatbot conversation to a conclusion. However, what may normally only take a couple milliseconds, can take seconds or even hours in the worst case scenario.
An API call can fail. That’s why third party integrations should typically be handled asynchronously.
These operations that involve 3rd party providers can take much longer than expected, can fail and can continue to fail for long periods of time. We need to be able to take our time and retry them when necessary. To handle this, we should put these tasks into a queuing system and deal with them asynchronously.
In our case, as all of our systems run on the Google Cloud Platform, we’ve chosen to enqueue asynchronous tasks onto their Cloud Pub/Sub product. When we receive a handover request from a user, we quickly enqueue it and immediately respond to the user. This keeps the UI fluid and does not force them to wait for the handoff to actually complete.
Once this task is enqueued we use Google’s event driven serverless platform, Cloud Functions, to process the task and make the handoff to one of our partners. Cloud Functions allows us to write small and simple scripts to handle all of our various integrations.
Now that the function is being executed asynchronously we can safely handle different failure scenarios. Our primary tool to handle failures are retries. When an action fails we simply try it again. But we do need to be careful about when and how we retry these actions.
If at first you don’t succeed, dust yourself and try again - but when the services are run by your partners, be respectful when retrying/doing that.
The Cloud Functions product allows us to configure our task to automatically be retried in the face of failure, however this retry logic is too simple for our use case. When a function fails it will try again, and again and again… Only stopping after thousands of attempts over multiple days. The CRMs that we work with are our partners and if for whatever reason they cannot handle the API call, slamming them with thousands of retries will not help.
First of all we want to make sure that our retry mechanism is using Exponential Backoff. This means that if we waited 1 second between our first and second attempt, we should wait 2 seconds for the next one, 4 seconds after that and so forth. Every time we see a failure we wait for a longer period of time before we attempt the process again.
We also want to limit the amount of tries we make. If a task has failed 15 times in a row, it’s very unlikely that the 16th attempt will be successful. At this point we need to carefully ensure that we move this task to another durable location.
Where does an event go when it has failed to run?
This terminology is one that multiple queueing services use to describe the place that failed events go. In Solvemate’s case, we send these events to a database.
When a task has failed enough times to end up in the Dead Letter Queue we notify our on-call developer and give them the necessary context to understand the error. Typically these kinds of failed events are due to longer term downtime with one of our providers.
The developer responsible for these failures can take the time to research the failure and can try to understand what the proper fix is. In the case of long term downtime, the best answer is often to wait and follow their status page.
Once we’re confident that the problem has been fixed, we have a system in place to give that developer the ability to replay any events in the Dead Letter Queue.
It all depends on the error message - not all operations are safe to retry.
Retrying a failed operation may seem like an easy way to make a system more reliable, but you need to be very careful when and how you retry.
Continuing with the ticket handoff example, if the API call to create a ticket in our customer’s CRM succeeded but the task failed for another reason, our retry mechanism could end up creating 15 duplicate tickets in our customer’s CRM. Duplicating events can be a little less problematic than losing an event altogether, but you will quickly lose customer trust if bugs cause you to pollute their systems with duplicates.
Duplicate events are especially bad when sending emails or text message notifications to users.
To handle this case, we ensure that we only mark a function to be retried if we get a very specific error from our partners. We need to ensure that the ticket was not created in the CRM before we attempt a retry. In any other case, we immediately send the event to the Dead Letter Queue and have a developer look into the issue.
Another cause of duplication may be the guarantees provided by your queuing system. In our case, Cloud Pub/Sub guarantees that an event will be processed “at least once”. This means that in very rare cases our queueing provider may process the same event twice. We handle this case by using a locking system to ensure that each event is only ever processed once.
Customer service automation allows for more personalised service and an improved service experience, when they eventually come in contact with your service agents. However, this is only true if the information and context is successfully transferred to the CRM or other partner software. While failures may not happen often, it is important to plan systems reliably so as not to disappoint customer expectations.
Lost requests are among the top reasons end customers think you don’t care - and subsequently, why they churn. At the end of the day, your end-customer doesn’t care that a third party integration failed and that’s why you lost their request. It is therefore important that you select a reliable software provider who is aware of his responsibility and who plans important processes and workflows so that you always look good - even in difficult cases.
Meet Alex, Platform Engineer at Solvemate by day, music geek by night. He has been building reliable platforms for various startups over the years, fuelled by his passion for databases and data storage tech. In his spare time you can find him attending concerts, walking his dog or ski-ing the mountains of his homeland Canada.