I’m excited to finally be able to start showing what I’ve been working on. A few months back, I took one of the servers in our main cluster that serves the NRE Labs site, and built a separate cluster with it. I did this because the next version of the NRE Labs platform requires a few different technologies that can’t just be done at the container level, but rather required a full upgrade of a bunch of components, all the way down to the cluster node OS.
I am going on leave soon, and have been working towards a milestone that allows me to leave things in a place where everyone can see where things are at and start to play with them. There’s more work to do before this gets near production-quality, but a version of NRE Labs that is running on this new infrastructure can be found here:
A quick note to set expectations - everything I’m about to talk about is still an experiment. No stability guarantees here. If you run into problems, all I can tell you is try again later. If problems persist, you can provide feedback in this thread. This was originally built as a way of experimenting with different options and there’s a lot of work left to do before this is ready to handle the main NRE Labs site. In addition, this is a single worker node, so it’s possible that resources may be exhausted, depending on how many people hit it at once. Again, if you run into issues, try again later.
New JSNAPy Content using cRPD!
The main reason for this work is to enable new content, and in the short-term, this emphasizes content that uses the new cRPD image. I teased this last week, and I’m pleased to announce that the new cluster allows me to show a re-vamped version of the JSNAPy lesson, which runs on three cRPD instances!
This is much more than simply swapping out the old vQFX endpoints for cRPD. The existing JSNAPy lesson is one of our oldest, and one of the last lessons that still exists in a form that is pretty similar to when we originally launched NRE Labs back in 2018. Most lessons like these were built not as full-blown lessons to teach a subject, but rather as PoCs for the platform and its capabilities. So, as part of this effort, I took the opportunity to re-think this lesson, and rewrote it from scratch. What you’ll see in this new version will be the first chapter in what will likely become a four or five part lesson on using JSNAPy to write unit tests for your network.
Why a New Cluster?
About a month ago, I posted that I was working on adding support for Kata containers to NRE Labs. I’d like to get into a little more detail on that work and why it was necessary.
I teased in that post that a big part of this was using containerd with the CRI plugin, instead of docker, which is what the old cluster uses. Using docker with kubernetes natively doesn’t offer the flexibility needed to be able to use different runtimes per container. The CRI spec, however, does, and containerd with the CRI plugin seemed to be the easiest way to get that functionality.
However, this isn’t a simple matter of just telling Kubernetes to use a different container management system. There’s a bunch of new configurations and software that need to be put into place, including passing totally different flags to the kubelet, etc. Like everything else, I want it to be automated so I don’t have to fuss around with configs manually inside the cluster. In addition, the software running on the infrastructure has become a bit old, so while I was reconfiguring all my automation playbooks to install the new container runtimes, I took the opportunity to upgrade almost everything. Here’s a quick summary of what’s changed in the new cluster:
- Upgraded from Centos 7 to Centos 8
- Upgraded from Kubernetes 1.14 to 1.18
- Using Containerd+CRI instead of Docker (the latter isn’t even installed on the nodes) - this gets us the ability to use alternate runtimes.
- Kata Containers Runtime - this is what’s used for the “untrusted” flavor, which is the vast majority of images in the NRE Labs curriculum.
- Using the latest Antidote code, which includes a slew of changes to support the above, and more. This includes moving to Go modules, and upgrading the vendored kubernetes Go client to support K8s 1.18, as well as implementing the new image flavor feature which allows us to indirectly select the container runtime for a lesson endpoint.
There are a few other components that I’m looking at upgrading, but all are less disruptive than the above, so I’m taking that slow for now. So, this is not a comprehensive list of what’s changing, but probably the big rocks for now.
Preview Service is Live On New Cluster
Last week, I mentioned that the preview service was still running on the old cluster. This meant that any previews that were created for pull requests to the curriculum would not have access to any of the new features.
However, as of today, the preview service is now running on the new cluster, and all previews are kicked off using our shiny new Github Actions integration. The benefit of this is that all previews running on this new cluster will naturally have access to all of the improvements we’ve made to the infrastructure and the platform. You’ll have access to the cRPD image, and be able to open pull requests to merge into the
master branch of the curriculum, which has long-since been incompatible with old versions of Antidote. There is one caveat, which is that the new cluster is pretty small right now, so there may be some compute constraints. I’ll cover this further below, so read on. Just wanted to circle back on this issue.
Test and Provide Feedback!
I’m doing this now so that in the next few months while I’m away, others can test and provide feedback on the new experience. Please see below for some important caveats, and information on how to provide feedback.
I plan on emailing this post to a few folks - please do not provide feedback via email reply. Use the mechanisms described below, so others can see your feedback and respond if they wish.
The best way to provide feedback is to just reply to this post. So if you’re not sure if what you’re seeing is expected, or a bug or whatever, just post here. However, if you’re pretty sure that what you’re seeing isn’t normal and needs to be fixed, a Github issue may be more appropriate.
Please peruse the list of open issues before creating one, in case someone else has seen the same problem and already opened one.
The below list should help you identify the right repository:
- For problems related to look and feel, open an issue in antidote-web
- For problems related to content, open an issue in nrelabs-curriculum
- For bigger “platform-y” problems, such as lessons outright failing to load or configure properly, open an issue in antidote-core
Important Caveats and Known Issues
As mentioned a few times now, there are a few important caveats to go over. Keep these in mind as you test the new experience:
The new “cluster” is composed of a single node. While I don’t forsee it happening often, it’s not infeasible that this node will become saturated. This could have all kinds of strange effects, including lessons failing to launch, lessons “breaking” right in the middle, etc. Note that since the preview service is also running here, this extends to anything spawned by a preview instance as well. Fortunately, everything is designed to clean up old resources automatically, so waiting an hour two should resolve these kinds of problems.
To help mitigate the aforementioned constraints on compute power, I may reduce the TTL for various things, including lessons and preview instances. I haven’t done this yet, and plan to post to this thread when I do this so you have a heads up. For now, don’t worry about this, but stay tuned.
Note that the cluster is small so it may get clobbered based on usage. If you’re not able to launch a lesson in a preview or on the main site, try again later.
The production NRE Labs site runs on tagged versions of not only the Antidote platform, but also the NRE Labs curriculum. This is to provide a stable experience, as we’re pushing to
latestdocker images) all the time. However, this cluster is running on the bleeding edge of everything - the main reason is that there is no stable release that has any of this functionality. So, there’s likely problems to fix, and this is a big reason I’m seeking feedback before trying to do a release.
The lesson “Automating the Troubleshooting Chain” is kind of flaky right now, due to a little bit of unpredictability in how IP addresses are configured on those nodes. The commands may not do what you expect.