RFC: New APIs for delegation of privileged operations

Andrea Bolognani

unread,

Nov 29, 2022, 12:05:39 PM11/29/22

to libvir...@redhat.com, kubevi...@googlegroups.com

Hi,

this is a proposal for introducing a new family of APIs in libvirt,
with the goal of improving integration with management applications.

KubeVirt is intended to be the primary consumer of these APIs.

Background
----------

KubeVirt makes it possible to run VMs on a Kubernetes cluster, side
by side with containers.

It does so by running QEMU and libvirtd themselves inside a
container. The architecture is explained in more detail at

https://kubevirt.io/user-guide/architecture/

but for the purpose of this discussion we only need to keep in mind
two components:

* virt-launcher

- runs in the same container as QEMU and libvirtd
- one instance per VM

* virt-handler

- runs in a separate container
- one instance per node

Conceptually, these two components roughly map to QEMU processes and
libvirtd respectively.

From a security perspective, there is a strong push in Kubernetes to
run workloads under unprivileged user accounts and without additional
capabilities. Again, this is similar to how libvirtd itself runs as
root but the QEMU processes it starts are under the unprivileged
"qemu" account.

KubeVirt has been working towards the goal of running VMs as
completely unprivileged workloads and made excellent progress so far.

Some of the operations needed for running a VM, however, inherently
require elevated privilege. In KubeVirt, the conundrum is solved by
having virt-handler (a privileged component) take care of those
operations, making it possible for virt-launcher (as well as QEMU and
libvirtd) to run in an unprivileged context.

Examples
--------

Here are a few examples of how KubeVirt has been able to reduce the
privilege required by virt-launcher by selectively handing over
responsibilities to virt-handler:

* Remove SYS_RESOURCE capability from launcher pod
https://github.com/kubevirt/kubevirt/pull/2584

* Drop SYS_RESOURCE capability
https://github.com/kubevirt/kubevirt/pull/5558

* Housekeeping cgroup
https://github.com/kubevirt/kubevirt/pull/8233

* Real time VMs fail to change vCPU scheduler and priority in
non-root deployments
https://github.com/kubevirt/kubevirt/pull/8750

* virt-launcher: Drop SYS_PTRACE capability
https://github.com/kubevirt/kubevirt/pull/8842

The pattern we can see is that, initially, libvirt just assumes that
it can perform a certain privileged operation. This fails in the
context of KubeVirt, where libvirtd runs with significantly reduced
privileges. As a consequence, libvirt is patched to be more resilient
to such lack of privilege: for example, instead of attempting to
create a file and erroring out due to lack of permissions, it will
instead first check whether the file already exists and, if it does,
assume that it has been prepared ahead of time by an external entity.

Limitations
-----------

This approach works fine, but only for the privileged operations that
would be performed by libvirt before the VM starts running.

Looking at the "housekeeping cgroup" PR in particular, we notice that
the VM is initially created in paused state: this is necessary in
order to create a point in time in which all the VM threads already
exist but, crucially, none of the vCPUs have stated running yet. This
is the only opportunity to move threads across cgroups without
invalidating the expectations of a real time workload.

When it comes to live migration, however, there is no way to create
similar conditions, since the VM is running on the destination host
right out of the gate. As a consequence, live migration has to be
blocked when the housekeeping cgroup is in use, which is an
unfortunate limitation.

Moreover, there's an overall sense of fragility surrounding these
interactions: both KubeVirt and, to some extent, libvirt need to be
acutely aware of what the other component is going to do, but there
is never an explicit handover and the whole thing only works if you
just so happen to do everything with the exact right order and
timing.

Proposal
--------

In order to address the issues outlined above, I propose that we
introduce a new set of APIs in libvirt.

These APIs would expose some of the inner workings of libvirt, and
as such would come with *massively reduced* stability guarantees
compared to the rest of our public API.

The idea is that applications such as KubeVirt, which track libvirt
fairly closely and stay pinned to specific versions, would be able to
adapt to changes in these APIs relatively painlessly. More
traditional management applications such as virt-manager would simply
not opt into using the new APIs and maintain the status quo.

Using memlock as an example, the new API could look like

typedef int (*virInternalSetMaxMemLockHandler)(pid_t pid,
unsigned long long bytes);

int virInternalSetProcessSetMaxMemLockHandler(virConnectPtr conn,

virInternalSetMaxMemLockHandler handler);

The application-provided handler would be responsible for performing
the privileged operation (in this case raising the memlock limit for
a process). For KubeVirt, virt-launcher would have to pass the baton
to virt-handler.

If such an handler is installed, libvirt would invoke it (and likely
go through some sanity checks afterwards); if not, it would attempt
to perform the privileged operation itself, as it does today.

This would make the interaction between libvirt and the management
application explicit rather than implicit. Not having to stick to our
usual API stability guarantees would make it possible to be more
liberal in exposing the internals of libvirt as interaction points.

Scope
-----

I think we should initially limit the new APIs to the scenarios that
have already been identified, then gradually expand the scope as
needed. In other words, we shouldn't comb through the codebase
looking for potential adopters.

Since the intended consumers of these APIs are those that can
adopt a new libvirt release fairly quickly, this shouldn't be a
problem.

Once the pattern has been established, we might consider introducing
support for it at the same time as a new feature that might benefit
from it is added.

Caveats
-------

libvirt is all about stable API, so introducing an API that is
unstable *by design* is completely uncharted territory.

To ensure that the new APIs are 100% opt-in, we could define them in
a separate <libvirt/libvirt-internal.h> header. Furthermore, we could
have a separate libvirt-internal.so shared library for the symbols
and a corresponding libvirt-internal.pc pkg-config file. We could
even go as far as requiring a preprocessor symbol such as

VIR_INTERNAL_UNSTABLE_API_OPT_IN

to be defined before the entry points are visible to the compiler.
Whatever the mechanism, we would need to make sure that it's usable
from language bindings as well.

Internal APIs are amendable to not only come and go, but also change
semantics between versions. We should make sure that such changes are
clearly exposed to the user, for example by requiring them to pass a
version number to the function and erroring out immediately if the
value doesn't match our expectations. KubeVirt has a massive suite of
functional tests, so this kind of change would immediately be spotted
when a new version of libvirt is imported, with no risk of an
incompatibility lingering in the codebase until it affects users.

Disclaimer
----------

This proposal is intentionally vague on several of the details.
Before attempting to nail those down, I want to gather feedback on
the high-level idea, both from the libvirt and KubeVirt side.

Credits
-------

Thanks to Michal and Martin for helping shape and polish the idea
from its initial rough state.

--
Andrea Bolognani / Red Hat / Virtualization

Andrea Bolognani

unread,

Nov 30, 2022, 11:48:38 AM11/30/22

to libvir...@redhat.com, kubevi...@googlegroups.com, Fabian Deutsch

On Tue, Nov 29, 2022 at 09:05:33AM -0800, Andrea Bolognani wrote:
> Hi,
>
> this is a proposal for introducing a new family of APIs in libvirt,
> with the goal of improving integration with management applications.
>
> KubeVirt is intended to be the primary consumer of these APIs.

This proposal was cross-posted to libvir-list and kubevirt-dev with
the hope that both communities would weigh in on it.

However, I've been notified that kubevirt-dev doesn't accept posts
from non-members so all replies so far, coming from libvir-list
subscribers, have only been delivered to other libvir-list
subscribers.

Is there a chance kubevirt-dev could start accepting posts from
non-members? Possibly with manual approval for first-time posters?
If I'm not mistaken, this is exactly the configuration libvir-list
uses.

Stu Gott

unread,

Nov 30, 2022, 3:01:29 PM11/30/22

to Andrea Bolognani, Andrew Burden, libvir...@redhat.com, kubevi...@googlegroups.com, Fabian Deutsch

+Andrew Burden This sounds like a reasonable request. Who would be in charge of manual moderation if we took this approach?

--
Andrea Bolognani / Red Hat / Virtualization

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CABJz62N8SUdTn1NSh-_rckm_ksHWAbU4FHbX_oTP35LOTNAVLg%40mail.gmail.com.

Andrew Burden

unread,

Dec 1, 2022, 5:17:24 AM12/1/22

to Stu Gott, Andrea Bolognani, libvir...@redhat.com, kubevi...@googlegroups.com, Fabian Deutsch

Hello,

Yeah that's how the kubevirt-dev mailing list works - or is supposed to work - as well. Emails from non-members are pending until manually approved. I don't see anything pending though so we have an issue there.

Perhaps the mechanism in googlegroups is being thrown off by the inclusion of a second mailing list?

I shall look into it.

Andrew Burden

unread,

Dec 2, 2022, 9:08:45 AM12/2/22

to Daniel P. Berrangé, Stu Gott, kubevi...@googlegroups.com, Fabian Deutsch, Andrea Bolognani, libvir...@redhat.com

This should now be working as intended.

Sorry for the inconvenience, folks.

On Thu, Dec 1, 2022 at 12:36 PM Daniel P. Berrangé <berr...@redhat.com> wrote:

On Thu, Dec 01, 2022 at 11:17:02AM +0100, Andrew Burden wrote:
> Hello,
> Yeah that's how the kubevirt-dev mailing list works - or is supposed to
> work - as well. Emails from non-members are pending until manually
> approved. I don't see anything pending though so we have an issue there.
> Perhaps the mechanism in googlegroups is being thrown off by the inclusion
> of a second mailing list?
> I shall look into it.

Test reply to see if it gets to kubevirt-dev now config changes
have been made.

With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Martin Kletzander

unread,

Dec 7, 2022, 6:52:44 AM12/7/22

to Daniel P. Berrangé, Peter Krempa, kubevi...@googlegroups.com, Andrea Bolognani, libvir...@redhat.com

On Thu, Dec 01, 2022 at 10:17:49AM +0000, Daniel P. Berrangé wrote:
>On Wed, Nov 30, 2022 at 09:47:02AM +0100, Peter Krempa wrote:
>> On Tue, Nov 29, 2022 at 09:05:33 -0800, Andrea Bolognani wrote:
>>
>> [...]

>>
>> > Proposal
>> > --------
>> >
>> > In order to address the issues outlined above, I propose that we
>> > introduce a new set of APIs in libvirt.
>> >
>> > These APIs would expose some of the inner workings of libvirt, and
>> > as such would come with *massively reduced* stability guarantees
>> > compared to the rest of our public API.
>> >
>> > The idea is that applications such as KubeVirt, which track libvirt
>> > fairly closely and stay pinned to specific versions, would be able to
>> > adapt to changes in these APIs relatively painlessly. More
>> > traditional management applications such as virt-manager would simply
>> > not opt into using the new APIs and maintain the status quo.
>> >
>> > Using memlock as an example, the new API could look like
>> >
>> > typedef int (*virInternalSetMaxMemLockHandler)(pid_t pid,
>> > unsigned long long bytes);
>> >
>> > int virInternalSetProcessSetMaxMemLockHandler(virConnectPtr conn,
>> >
>> > virInternalSetMaxMemLockHandler handler);
>>

>> As proposed, with 'pid'/'bytes' as arguments the callbacks don't really
>> feel to provide any form of privilege restriction to libvirt.
>>
>> Libvirt can choose to invoke the callback on any PID it likes and the
>> application registering the callback doesn't have any way to check that
>> the pid actually belongs to the started VM.
>>
>> I'm afraid that with many of those you'll have the same issue.
>>
>> At the very least the API/callback should carry additional information
>> outlining to which object the callback invocation is tied to.
>>
>> (I understand that kubevirt most likely can assume that it's the single
>> VM it's handling inside that specific container, though)
>>
>> By carefuly scoping this though the management layer can at least in
>> some situations trust this the data (e.g. if no untrusted code ran yet).
>> In fact libvirt does the same e.g. when probing vCPU thread pids at
>> startup (while cpus) are paused.
>>
>> I simply want to point out that by this mechanism the code inside the
>> "unprivileged" container is not as unprivileged as it has backdoors to
>> higher layers to do privileged operations and the abstraction in the
>> callbacks needs to be carefully chosen so that it is not simply a
>> security charade.
>
>Yep, authorization policy is the single biggest design challenge
>with doing privilege separation and it is way too easy to screw
>it up, even with very simple operations.
>
>In the case of the libvirt polkit access control, polkit uses
>SCM_RIGHTS to identify the calling process, and prompts the user
>for password. We also provide the VM uuid, name, id along side.
>The key thing there though is that the uuid, name + id have been
>validated by libvirt which is the trusted component.
>
>The other end of the
>
> virInternalSetProcessSetMaxMemLockHandler
>
>wouldn't have ability to validate the VM identity even if we
>passed it, since the master source of VM identity info is
>the unprivileged and untrusted component.
>
>This means it is a big challenge to do more than just a blanket
>allow/deny for the entire 'max mem lock' feature, rather than try
>to finese it per VM.
>

Exactly what I was afraid of with another approach I discussed with
someone else a while ago. If you start providing ways to do arbitrary
privileged operations, then you are effectively giving away privileged
access.

In this case I think it was an unfortunate choice of an API. If the API
is *designed* to provide the proper identifying information, then the
management application can then choose the action properly.

>> > The application-provided handler would be responsible for performing
>> > the privileged operation (in this case raising the memlock limit for
>> > a process). For KubeVirt, virt-launcher would have to pass the baton
>> > to virt-handler.
>> >
>> > If such an handler is installed, libvirt would invoke it (and likely
>> > go through some sanity checks afterwards); if not, it would attempt
>> > to perform the privileged operation itself, as it does today.
>>

>> I think we also need to carefully consider how to do the callbacks in
>> our RPC. Currently events don't wait for the client to respond and these
>> would need to do so.
>>
>> Also use of these should e.g. require keepalive to be used.
>
>What you're describing is no longer callbacks, but it is method
>calls operating in the reverse direction to what they do today.
>Furthermore this would tie up the RPC worker dispatch thread
>while waiting for a response. This could easily result in a
>deadlock situation if the client app servicing this request
>was busy doing something else, or had to wait on its own
>locks, but those locks were held by another thread that is
>waiting on libvirt. Then there's the complexity of tieing
>this up into thue actual RPC implementations of both cllient
>and server.
>

My idea how to avoid callback failures was to split
e.g. virDomainCreate() similarly to how virDomainMigrate...() is,
i.e. into phases which would be invisible from the caller unless they
want to do "something special" with the domain. Of course this does not
work for actions initiated by guest/libvirt, but to start small this was
one of the ideas.

>This really just re-inforces the point that this doesn't belong
>in the public API / RPC layer at all. It needs to all work via
>a plugin loaded in-process. That plugin can call out to whatever
>service it wants to use, whether this uses DBus or REST or
>something else is upto the plugin author.
>

One counterargument to this was that it would require mgmt app
developers to write an .so, or at least a C-compatible plugin. While I
don't see that as an issue (apart from it possibly hindering the
adoption) I can offer an idea which came before we even discussed this
particular topic. It actually started from the other end, but let me
give you some idea how it might be more usable approach for mgmt apps
like kubevirt.

If we provide a way to initialise this "plugin" at runtime, then there
is no need to mangle with reloading the daemon. If we provide a way to
do this over the already existing connection, then it might be even
easier to use. However that would require a way to transfer a coded
algorithm and being able to run it (ideally with reduced access to the
system for security purposes). A format that could be pretty usable for
this is WASM. And before anyone rolls their eyes give me a minute to
mention few points about it:

- there is no limit to the source code language used

- it can be built particularly for the libvirt daemon dynamically (if
someone wants that)

- it can be ran safely by giving it access only to a limited set of
predefined APIs

- libvirt can run it itself, the runtime can be built in

I'm not saying this is the solution to the points in question, just an
idea I had few months ago which lead nowhere because I did not have
enough time to make a simple bytecode run in a C program.

Of course most of the above points can be solved without utilising WASM,
e.g. by allowing plugins to be handled by admin APIs, running them in a
sandbox, etc.

Martin

signature.asc

Daniel P. Berrangé

unread,

Dec 7, 2022, 7:25:40 AM12/7/22

to Martin Kletzander, Peter Krempa, kubevi...@googlegroups.com, Andrea Bolognani, libvir...@redhat.com

On Wed, Dec 07, 2022 at 12:42:06PM +0100, Martin Kletzander wrote:
> On Thu, Dec 01, 2022 at 10:17:49AM +0000, Daniel P. Berrangé wrote:
> > The other end of the
> >
> > virInternalSetProcessSetMaxMemLockHandler
> >
> > wouldn't have ability to validate the VM identity even if we
> > passed it, since the master source of VM identity info is
> > the unprivileged and untrusted component.
> >
> > This means it is a big challenge to do more than just a blanket
> > allow/deny for the entire 'max mem lock' feature, rather than try
> > to finese it per VM.
> >
>
> Exactly what I was afraid of with another approach I discussed with
> someone else a while ago. If you start providing ways to do arbitrary
> privileged operations, then you are effectively giving away privileged
> access.
>
> In this case I think it was an unfortunate choice of an API. If the API
> is *designed* to provide the proper identifying information, then the
> management application can then choose the action properly.

I think it is challenging no matter what because the privileged component
is placing trust on the unprivilged component to supply honest identifying
info. This is a key reason why polkit ACL checks are done based on
process ID + permission name. Process ID can't be faked, and you're asking
the user to confirm the honesty of the permission name.

Overall, I think if you're going to allow "mem lock" to an unprivileged
VM that's fine, but the expectation should be that we're allowing this
for *any* VM, and not able to offer per-VM access control on that usage.

I think the experiance with migration would strongly bias me against
decomposing the startup process. We had no choice with migrate but
to decompse because we needed to co-ordinate across multiple hosts,
but it has been a horrible burden, requiring us to re-design the
internal migration steps many many times. I don't want us to get
into that trap for other APIs.

> > This really just re-inforces the point that this doesn't belong
> > in the public API / RPC layer at all. It needs to all work via
> > a plugin loaded in-process. That plugin can call out to whatever
> > service it wants to use, whether this uses DBus or REST or
> > something else is upto the plugin author.
> >
>
> One counterargument to this was that it would require mgmt app
> developers to write an .so, or at least a C-compatible plugin. While I
> don't see that as an issue (apart from it possibly hindering the
> adoption) I can offer an idea which came before we even discussed this
> particular topic. It actually started from the other end, but let me
> give you some idea how it might be more usable approach for mgmt apps
> like kubevirt.

I don't see the need to write an .so to be a unreasonable burden.
The plugin does not need to contain any significant amount of
logic. If desired it could be nothing more than a shim which talks
to a separate $LANG daemon over an RPC system. Even if wanting a
self-contained plugin, its possible to write shared libraries in
common system languages like C, Go, Rust, as they can expose shim
functions with C linkage semantics. If you really wanted it would
be possible to write a C shim that loads the python/perl/ruby/wasm/
etc runtime, and then write your plugins in dynamic languages too.

> If we provide a way to initialise this "plugin" at runtime, then there
> is no need to mangle with reloading the daemon. If we provide a way to
> do this over the already existing connection, then it might be even
> easier to use. However that would require a way to transfer a coded
> algorithm and being able to run it (ideally with reduced access to the
> system for security purposes). A format that could be pretty usable for
> this is WASM. And before anyone rolls their eyes give me a minute to
> mention few points about it:
>
> - there is no limit to the source code language used
>
> - it can be built particularly for the libvirt daemon dynamically (if
> someone wants that)
>
> - it can be ran safely by giving it access only to a limited set of
> predefined APIs
>
> - libvirt can run it itself, the runtime can be built in
>
> I'm not saying this is the solution to the points in question, just an
> idea I had few months ago which lead nowhere because I did not have
> enough time to make a simple bytecode run in a C program.
>
> Of course most of the above points can be solved without utilising WASM,
> e.g. by allowing plugins to be handled by admin APIs, running them in a
> sandbox, etc.

I feel like the type of functionality we need plugins to support is
not something that we need to, nor should, make dynamically configurable
at runtime. If we need to have matched logic at startup and shutdown of
a VM, or plug and unplug of a device to a running VM, it is creating
us an unecessary problem to allow this to be dynamically changed. It
should be more than sufficient to configure this at host deployment
time prior to starting libvirtd. For kubevirt usage this is certainly
trivial, since they deploy and start a new libvirt at VM startup.

If we had static plugins, someone could write a loadable .so library
that contained the WASM runtime, and exposed a RPC API for injecting
WASM code to a running libvirt, allowing the things you describe
above, without having to tie libvirt to that approach.

Martin Kletzander

unread,

Dec 8, 2022, 4:12:31 AM12/8/22

to Daniel P. Berrangé, Peter Krempa, kubevi...@googlegroups.com, Andrea Bolognani, libvir...@redhat.com

What I meant was something more along the lines of "place_vm_in_cgroup" where we
would offload the cgroup placement to kubevirt. It already has to trust us with
the information we provide now (which threads are placed in which cgroups). The
benefit of having the callback per-connection and calling it on the connection
that is starting the VM would kind of make the argument easier.

Fair point.

>> > This really just re-inforces the point that this doesn't belong
>> > in the public API / RPC layer at all. It needs to all work via
>> > a plugin loaded in-process. That plugin can call out to whatever
>> > service it wants to use, whether this uses DBus or REST or
>> > something else is upto the plugin author.
>> >
>>
>> One counterargument to this was that it would require mgmt app
>> developers to write an .so, or at least a C-compatible plugin. While I
>> don't see that as an issue (apart from it possibly hindering the
>> adoption) I can offer an idea which came before we even discussed this
>> particular topic. It actually started from the other end, but let me
>> give you some idea how it might be more usable approach for mgmt apps
>> like kubevirt.
>
>I don't see the need to write an .so to be a unreasonable burden.

Me neither, I suggested that if someone wants to mend the behaviour of libvirt
in a specific way (e.g. requests that the daemon does not call some function)
can already be solved (or at least tried and tested) by LD_PRELOAD-ing a
library. The first time it was said in a bit of an anger, but thinking about it
the point is still valid. Of course that's not fitting for any production usage
for which we'd want something at least resembling a plug-in system.

Once there is an interface defined (for example the plugin) then it can do
anything, of course. The advantage was mainly with code transfer from the
client to the server, once that is out of the question (well it was never in)
the other advantages are fall apart too.

Static plug-ins are the safest (and probably easiest) choice. The argument
about writing them, and mostly other arguments as well, are mostly constructed
until people from mgmt apps weigh in with their opinions.

signature.asc

Daniel P. Berrangé

unread,

Dec 8, 2022, 4:24:31 AM12/8/22

to Martin Kletzander, Peter Krempa, kubevi...@googlegroups.com, Andrea Bolognani, libvir...@redhat.com

In terms of privilege questions, I think kubevirt is rather a special
case, because they don't have any privilege boundary between libvirt,
qemu or virt-handler, which are all running inside the same POD. Their
plugin would also be requesting something to be done on their behalf
outside the VM POD. In terms of privilege checks they won't need
anything more fine grained than a PID check, as that's sufficient to
identify the POD that is requesting the resource,from any other process.

Reply all

Reply to author

Forward