NRE Labs Community

Adding Kata support to NRE Labs

All -

It’s probably long-past time that I updated you on progress towards something that’s been talked about for a while, but I just didn’t have the cycles to wrap my head around until now, and that is, the ability to support alternative runtimes like Kata containers. One of the biggest drivers for this is so we can replace all (or at least nearly all) of the vQFX endpoints with the much more sustainable and reliable cRPD, which is a containerized version of Juniper’s routing stack. This will be way more lightweight, faster, and more capable. It’s also an actual product (vQFX is not) so overall the UX will be better. However, it is just a container, and like any network device, needs an operating system with reasonable containment. A simple container would not do, we need a virtual layer, and that’s where kata comes in. Looks and feels like a container but is actually a lightweight VM. I think it makes sense to extend this as the default option for all endpoints, so we can have more security out of the box, and thus far it looks like this works great.

A little side note about cRPD as well. The vQFX had some issues that prevented us from publishing the OpenConfig and JET lessons, which have been sitting in “drafts” for over a year, and I’ve felt bad about this every day since then. I am very excited to see if adding a cRPD image will allow these to move forward - some initial testing has shown a lot of promise.

The interesting thing is that from a code perspective, the changes are very minimal. I have already implemented the necessary changes in https://github.com/nre-learning/antidote-core/pull/189. In short, we’ll have two image flavors. “untrusted” executes the container with the kata runtime. “trusted” will use the default runtime in privileged mode. I am not aware of a use case that will require the regular runtime but also not privileged mode, so that’s it for now. The good news is that this can be added later if needed.

The vast majority of the work goes into ensuring that the infrastructure can support this, which is what I’ve been doing for a few weeks now, building out a sibling cluster for the main NRE Labs site that runs updated versions of everything, and has all of the components that are required for alternative runtimes (at the moment this is containerd with the cri plugin).

It’s tough for me to update in all places right now, so the antidote-core repo is probably the best place for any updates. Like I said though, the changes to the code pale in comparison to what’s necessary behind the scenes, so it won’t be too much.

Eventually selfmedicate will need to be updated, and I’ve been putting a lot of thought into that lately as well. Since it’s no longer the official method for contributing to the curriculum, I think we can return to a place where we’re hyper-opinionated about how its run, so that it can be simpler. In doing some early research for the kata work, I ran selfmedicate with the kvm2 driver, and it worked flawlessly. Since support on that will be minimal anyways, I’m wondering if it’s time to return to our roots there.

At the moment, I’m blocked on what seems like either a bug or a misunderstanding on my part of Kata’s capabilities. In Antidote we use the subPath option when mounting the lesson directory as a volume to each endpoint pod. This is so that each lesson guide doesn’t have to cd through the whole directory structure just to get to that lessons’ files. Not strictly required, but definitely convenient. However, it appears that using this doesn’t work on Kata. I opened this issue to explain this in more detail and hopefully get answers:

I am also updating the curriculum to be compatible with this. In this PR I am minimally updating the JSNAPy lesson to use cRPD as a PoC, while also of course adding the new flavor parameter to all images. There will inevitably be changes for the rest of the curriculum, though, not just to remove other instances of vQFX and update those lessons accordingly (the commands may be slightly different) but also running some of the software in Kata may have unintended implications I haven’t forseen. My plan is to obviously test all of these lessons on this new cluster, and make fixes/updates as needed. More than likely this will be in a separate PR once I’ve finished with the JSNAPy PoC.

I’m very excited about this, and apologize that it’s been hard to publish updates, as I’ve been heads-down working on this for quite some time. I am hoping that in the next few weeks not only will the actual work plateau a bit (this is already becoming true on the infra side, as I have a pretty stable, automated cluster build that I’m hoping will come in handy when it comes time to update selfmedicate) but I’ll have more cycles to spend on updating here and other media. There are plenty of updates that have been in prod for some time that I would still like to do more outreach on, such as our opentracing instrumentation which is proving itself to be super handy. For now, back to the lab!