Service interruptions - 2021/03/10
Update August 2021
During the incident, our high available architecture managed the event pretty well, and our DNS fallback succeed. However, the propagation was too slow for users in Australia and New Zealand. Consequently, We are strengthening our architecture, automation and notification. We are working on the following elements:
Migrate to a new DNS provider with a much faster DNS response and propagation time.
Deploy new servers in new locations (Singapore, Australia and more).
Setting up a notification system by phone call and SMS
A final update is coming with all details and actions since the incident.
The Carbone Team.
The 10th of March 2021 was a particular day for us, and for other companies, especially our cloud provider OVH which experienced a major fire. It has affected several major online services, including the French government.
Official OVH news: https://www.ovh.com/world/news/press/cpl1787.fire-our-strasbourg-site
News from Octave Klaba, CEO of OVH: https://twitter.com/olesovhcom/status/1369478732247932929
French news: https://www.lefigaro.fr/faits-divers/strasbourg-important-incendie-sur-le-site-de-l-entreprise-ovh-classe-seveso-20210310
Carbone was a little bit impacted, mostly Carbone Studio, and our Carbone Render High Availability architecture was heavily challenged. Some of you may have noticed some timeouts for Carbone Render and we are sorry. We learned many things.
Here is a summary of chronological events:
At 00:47 am (Paris, France) on Wednesday 10 March 2021, a fire broke out in a datacenter of our provider. This information became public at 3:42am (Paris France). This datacenter contains one of our front-ends servers (IP 188.8.131.52) and some Carbone Renders servers.
At 2:48 am (Paris, France) we received some timeout alerts from one of our worldwide monitoring services. In reality, some packets were routed to the destroyed datacenter. We were not aware of the fire at this moment
At 5:30 am (Paris, France), we used our IP failover to redirect instantaneously the IP 184.108.40.206 to another front-end server in another datacenter.
At 9:34 am (Paris, France), we noticed that some worldwide network packets were still routed to the destroyed datacenter. We think the failover system was affected by the fact all data centers in proximity of the fire were switched off later in the night. So, we changed our DNS (render, studio and account) to remove completely the IP 220.127.116.11 from the DNS round-robin. The DNS propagation took some time. We think it was completely resolved at 10:30 am (Paris, France)
Currently, we are working to rebuild our High Availability architecture and some internal services (Gitlab and Continuous Integration, backups).
Hereafter is what we will do in the coming days. Most of these actions were already planned for 2021 but we have changed our priorities:
Next week: spread our High Availability Architecture around the world : Canada (Beauharnois), Germany (Francfort), Australia (Sidney). Clients in these areas will notice a speed-up. We will announce our new public IP if you have firewall rules to set before activating these new zones. We will also create new subdomains for clients who need to force the traffic in France or Europe. That said, default subdomain (render.carbone.io) will be routed to the closest datacenter (DNS anycast)
Copy backups to another independent provider (another datacenter is not enough)
Understand why the IP failover did not work properly in that specific case and find solutions
Improving our alert and monitoring systems
FYI, in 2021 you will have the ability to easily deploy Carbone Render in your own AWS, GCP and Azure cloud. Currently, you can already do it yourself with our On-Premise solution. Feel free to contact us if you are interested.
The Carbone Team.
Updated on: 10/14/2021