Key Takeaways
- Building a resilient asynchronous workflow implies a number of challenges including managing state, retries, auditability, and observability.
- A workflow orchestration solution abstracts away state management and provides retries, auditability, and observability out-of-the-box.
- Using a workflow orchestration solution helps greatly to get a simplified view of any data pipeline, asynchronous system, or event-driven system.
- You can adopt and integrate a workflow orchestration solution like Temporal into your existing platform in stages or use it only for parts of your workflow.
Twilio is a customer engagement platform that allows you to engage with your customers on your application using different channels like Voice, Messaging, Whatsapp, email, video.
When you think of SMS, one of the big problems with SMS is spam. Additionally, there’s a lot of phishing going on over SMS (smishing).
Due to the increase in spam messages, many consumers have lost trust in SMS as a form of communication. In the US, A2P 10DLC is the standard that carriers have implemented to regulate this communication pathway.
A2P 10DLC improves the end-user experience by making sure consumers can opt in and out of messaging and also know who is sending them messages. It also benefits businesses, offering them higher messaging throughput, brand awareness, and accountability.
To be compliant with A2P messaging, you need to register your application at three different levels. First, you will register your brand or business, which will be manually reviewed and approved. Next, you need to register your campaign, where you will detail what messages you are going to send, for example, sending two-factor authentications for account setup and login as well as some kind of notifications. Here, again, your use case will be vetted by someone and you may be required to provide additional detail about your use case. Finally, you are required to register a set of phone numbers you will use to send messages.
In this article, we will discuss what problems we had to solve at Twilio to efficiently build a resilient and scalable asynchronous system to handle a complex workflow implementing A2P Messaging Compliance and the advantages we got from adopting a Workflow Orchestration solution.
How Twilio implemented A2P compliance platform
We originally built the A2P compliance platform using a state machine to orchestrate different registration processes described above. This system was built using an event-driven architecture and communicated using queues and ensured event processing at a specific rate adhering to the downstream rate limits. Over time, this platform evolved into a complex system which became hard to maintain (state machines, queues, rate limiting, error handling, auditing etc). We then started seeing challenges in terms of scaling systems and teams.
Challenges building a resilient asynchronous workflow
Challenges we faced building an event driven system by implementing state machines:
State Management
The first one is state management. Basically, the problem here is that you need to contemplate lots of possible combinations of states and events. For example, the "review received" message could come in while the campaign is in pending state instead of the relevant waiting state, or an out of sequence event could come in from somewhere, and so on. All of those cases need to be handled, even though they are not the most likely sequence of events and states.
As the number of states and messages grow, so does the complexity of ensuring you are handling messages in all states accurately. You may want to handle states differently or not at all. If you want to add a new intermediate step, you have to look at all your states and add code to ensure that this message is reliably handled even in different states. The fundamental problem with the state machine is that you cannot configure a state machine as a sequence of steps required to carry through registration, but you can configure it in terms of events, states and actions: "I am getting event X, my database is in state Y, so I'm going to perform action Z." This thinking in terms of state machines and handling state transitions became complex and was often very error prone.
Retry mechanism
Handling retries becomes a task almost as complex as implementing primary logic, sometimes even more so. You can think of implementing your retry mechanisms in different ways, for example by storing a retry counter in the database and incrementing it on each failed attempt until either you succeed or reach the maximum allowed number of retries. Alternatively, you could embed the retry counter in the queue message itself, so you dequeue a message, process it, and, if it fails, re-enqueue the message and increment the retry count. In both cases this implies a huge overhead for developers.
Auditability
The first step in building reliable workflows is to ensure good auditability, which is essential for debugging and identifying issues, or assessing the impact during an outage. In complex systems, developers bear the responsibility of correctly logging information in the appropriate places. However, auditability becomes more challenging when trying to debug issues, as going through the logs can be a huge task. The number of hours spent trying to sift through logs and piece together information, especially during customer escalations or third-party API outages, can be extremely burdensome.
Observability
The next crucial element is observability. This introduces a substantial overhead for developers, who need to actively push metrics from the application.All these metrics have to be explicitly managed by the developer.
Adopting a Workflow Orchestration solution
To address the challenges discussed above, we evaluated several workflow orchestration solutions, including Temporal, Apache Airflow, AWS Step Functions, and Netflix Conductor, based on the following criteria.
Firstly, we wanted to write workflows as code, because we had complex business use cases with branching logic, and having workflows as code improves readability, writability and flexibility. The possibility of running Java code was a plus for us, since our initial system was already in Java.
Secondly, we wanted to adopt dynamic workflow execution. Dynamic workflow execution ensures that any new updates to the workflow will be picked up by both new workflows and in-flight workflows. This is especially important for long running workflows.
Thirdly, we wanted rate-limited requests to downstreams. We integrate with a lot of third-party APIs, some of them with very low rate limits of one or two requests per second. To ensure that we adhered to those rate limits and did not get throttled by our downstreams, our previous architecture managed queues to enqueue requests, dequeue them at the specified rate, and handle retries. Something that would help us easily rate-limit downstream requests was a huge plus for us.
Lastly, we looked for something with a managed cloud, because we wanted to maximize our time spent on building the product rather than the infrastructure.
We ultimately chose Temporal because it provides the best fit for our criteria and enables defining workflow as code, supports Java and many other languages, dynamic workflow execution, and downstream rate limiting etc.
Another notable aspect of Temporal architecture is that the execution is fully decoupled from the server or orchestrator, allowing executors to be scaled up independently of the server and vice versa.
Architecture Migration Using Temporal
Temporal allows us to trigger child workflows owned by different teams. For example, if we want to trigger a workflow as a child of a parent workflow owned by a different team, this is simple to do without the two workflows having to share the same code base. This allowed us to easily draw team boundaries and rearchitect our platform without being constrained by our team structure. We sketched out our workflow designs, using parent-child relationships, then split the workflows across the teams.
Immediate Impact
The first immediate impact we experienced was a change in our thought process. We stopped thinking in terms of states and state transitions, about how the addition of one step would impact the other 20 steps, and started thinking in terms of simple sequences of steps.
We noticed that the teams had a short learning curve and adopted Temporal in every initiative. We started small by running it in hybrid mode, but in no time migrated the entire end-to-end workflow to Temporal.
We were also able to scale engineering teams quickly after adopting Temporal and switching to the new architecture, which had been a challenge for us in the past.
Building Resilient Workflows Made Easy
In summary, the adoption of Temporal as our new architecture has helped us in the following ways:
State management. We no longer have to think in terms of state management, since there was no concept of state management for us. It was completely abstracted away from us and we can reason in terms of sequence of steps.
There was no other overhead for us to have a retry mechanism in place. All we have to do is set a simple configuration to define the maximum number of attempts.
Temporal gives us auditability out of the box. Temporal uses the concept of event sourcing and it stores every received event in a log so you can see all the events you received, all the inputs you provided to a certain downstream, and all the outputs. Everything is essentially logged, making our debugging very efficient and easy.
Temporal gives us observability because all the metrics that we have discussed came ready out-of-box, completely abstracting away state management for us.
There are some more features we like and use frequently, such as global rate limiting, signaling, and scheduling.
How Temporal interacts with your hosts
Temporal works by decoupling the application code and its execution. It just orchestrates things. Using the managed cloud option, our code is run entirely in-house and Temporal Cloud orchestrates all of that.
This reduces the risks of vendor lock-in. Indeed, if Temporal Cloud changes their business model, or we want to move away from Temporal Cloud for whatever reason, all we have to do is spin up our own Temporal cluster in-house without needing to change anything in our own infrastructure.
In case your host is dead for some time, the orchestrator still has an understanding of exactly where the workflow stopped. If your application stopped after executing step five, for example, that is recorded in the server, and when your application comes back up again, it will ask the server, "I don't know where I am. Where do I start executing the workflow again?" Temporal will send the entire workflow history to the application allowing you to check which steps you have executed and restart execution from step 6.
Using hosted cloud solutions for workflow orchestration
Hosted cloud solutions provide a significant advantage for workflow orchestration by eliminating the need for businesses to manage their own infrastructure. Most modern workflow orchestrators like Temporal, Apache airflow, Conductor offer cloud hosting solutions. The key advantage is:
Infrastructure Management: Hosted cloud solutions take over the responsibility of infrastructure setup, maintenance, and scaling. This means businesses can focus on developing their workflows and application logic without worrying about the underlying hardware, network configurations, or server capacities. Cloud providers handle these aspects, offering a reliable, secure, and scalable environment for the orchestration tools to operate. This shift not only reduces the operational burden but also allows for quicker deployment and adaptability to changing business needs.
Summary
To sum it all up, if we take a step back and look at all the other solutions that we have built, we would use a workflow orchestrator with Temporal anytime we need to manage queues.
You could think of this as a very niche use case, but we think this is an everyday one. For every data pipeline tasks, asynchronous system, or event-driven system, you can use an orchestrator to get a simplified view of your system where state management is completely abstracted away, and you get many capabilities ready out of the box, including retry, observability, audibility, and others.