Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes BZ-2123209: Real time VMs fail to change vCPU scheduler and priority in non-root deployments #8750

Merged

Conversation

jordigilh
Copy link
Contributor

@jordigilh jordigilh commented Nov 9, 2022

This PR addresses an issue found during the deployment of real time workloads in non-root deployments where libvirt was unable to change the scheduling and priorities of the vCPU threads due to lack of CAP_SYS_NICE.
The solution proposed here is to re-implement the logic that changes the scheduling and priority in virt-handler for non-root deployments. The old logic that adds the libvirt XML knobs remains there for root based deployments.

Release note:

Fixes an issue that prevented running real time workloads in non-root configurations due to libvirt's dependency on CAP_SYS_NICE to change the vcpu's thread's scheduling and priority to FIFO and 1. The change of priority and scheduling is now executed in the virt-launcher for both root and non-root configurations, removing the dependency in libvirt.

@kubevirt-bot kubevirt-bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/L labels Nov 9, 2022
@jordigilh jordigilh changed the title [DRAFT] Fixes BG2123209: Realtime VMs fail to change vCPU scheduler and priority in non-root deployments [DRAFT] Fixes BZ2123209: Realtime VMs fail to change vCPU scheduler and priority in non-root deployments Nov 9, 2022
@jordigilh jordigilh changed the title [DRAFT] Fixes BZ2123209: Realtime VMs fail to change vCPU scheduler and priority in non-root deployments [DRAFT] Fixes BZ-2123209: Realtime VMs fail to change vCPU scheduler and priority in non-root deployments Nov 9, 2022
@jordigilh jordigilh force-pushed the fix/BG2123209_RT_VM_nonroot_fail_bootup branch 2 times, most recently from 8e2ae88 to e47f95f Compare November 9, 2022 17:32
@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Nov 9, 2022
@jordigilh jordigilh changed the title [DRAFT] Fixes BZ-2123209: Realtime VMs fail to change vCPU scheduler and priority in non-root deployments [DRAFT] Fixes BZ-2123209: Real time VMs fail to change vCPU scheduler and priority in non-root deployments Nov 9, 2022
@jordigilh jordigilh force-pushed the fix/BG2123209_RT_VM_nonroot_fail_bootup branch from e47f95f to 092ea36 Compare November 9, 2022 17:45
@jordigilh jordigilh force-pushed the fix/BG2123209_RT_VM_nonroot_fail_bootup branch from 092ea36 to ed8058e Compare November 16, 2022 19:33
@kubevirt-bot kubevirt-bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. and removed dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Nov 16, 2022
@jordigilh jordigilh changed the title [DRAFT] Fixes BZ-2123209: Real time VMs fail to change vCPU scheduler and priority in non-root deployments Fixes BZ-2123209: Real time VMs fail to change vCPU scheduler and priority in non-root deployments Nov 16, 2022
@jordigilh jordigilh force-pushed the fix/BG2123209_RT_VM_nonroot_fail_bootup branch from ed8058e to efb6134 Compare November 16, 2022 21:21
…e <cpusched> knobs when running in a non-root environment or the VM labeled as non-root

Signed-off-by: Jordi Gil <jgil@redhat.com>
… would fail during the manifest validation

Signed-off-by: Jordi Gil <jgil@redhat.com>
…t environment

Signed-off-by: Jordi Gil <jgil@redhat.com>
@jordigilh jordigilh force-pushed the fix/BG2123209_RT_VM_nonroot_fail_bootup branch from efb6134 to 315a5fb Compare November 16, 2022 21:25
@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Nov 16, 2022
@jordigilh
Copy link
Contributor Author

@vladikr @iholder-redhat @rmohr PTAL.

@jordigilh
Copy link
Contributor Author

The old logic that adds the libvirt XML knobs remains there for root based deployments.

I'm unsure whether to keep the old logic or not. Preferably it would be better to stick to one logic (this PR's implementation for both non-root and root use cases), but I'd like your input on this.

@rmohr
Copy link
Member

rmohr commented Nov 17, 2022

I'm unsure whether to keep the old logic or not. Preferably it would be better to stick to one logic (this PR's implementation for both non-root and root use cases), but I'd like your input on this.

Does this mean that with your change we could drop CAP_SYS_NICE for root and non-root? If the logic you introduce works for both cases, yep let's converge :)

}
log.Log.V(5).Object(vmi).Infof("set process %+v memlock rlimits to: Cur: %[2]d Max:%[2]d",
qemuProcess, memlockSize.Value())

return nil
}

// GetQEMUProcess encapsulates and exposes the logic to retrieve the QEMU process ID
func GetQEMUProcess(parentPID int) (ps.Process, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to me like this should be more of a method on the isolation result, than a general purpose method where you pass in an arbitrary pid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I will move this to be a func method to IsolationResult

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Field: field.Child("domain", "cpu", "dedicatedCpuPlacement").String(),
})
}
// if nonroot {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left over

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops.. good catch! 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@iholder101 iholder101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @jordigilh! +1!
Left mostly nits and small comments

Comment on lines 27 to 28
type SchedParam C.sched_param
type Policy uint32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be unexposed?

Suggested change
type SchedParam C.sched_param
type Policy uint32
type schedParam C.sched_param
type policy uint32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely!. I will change their scope 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// except that in this case it uses a map[string]maskType instead of a bit array.
func parseCPUMask(mask string) (map[string]maskType, error) {

if len(strings.TrimSpace(mask)) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
if len(strings.TrimSpace(mask)) == 0 {
if strings.TrimSpace(mask) != "" {

Copy link
Contributor Author

@jordigilh jordigilh Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code I should return an empty map if the mask is either empty or full of spaces. Only return error if the parsing fails.

	vcpus := make(map[string]maskType)
	if strings.TrimSpace(mask) == "" {
		return vcpus, nil
	}

The reason is that mask is a string and will never be nil, so it should always treat the empty case as default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!
Just one note, you can return a nil map instead of an initialized one. It would save the space of allocating an empty map. isRealtimeVCPU() will work with that as the len of a nil map is 0.

	if strings.TrimSpace(mask) == "" {
		return nil, nil
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, thanks for the suggestion! 😄

Copy link
Contributor Author

@jordigilh jordigilh Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further thought I think a mask filled with spaces should be considered invalid, compared with an empty string.

	if len(mask) == 0 {
		return nil, nil
	}

So basically:

Entry("Empty mask", "", nil, nil),
Entry("Empty mask with spaces", "  ", nil, fmt.Errorf("invalid mask value '  ' in '  '")),

WDYT?

if startID < 0 {
return nil, fmt.Errorf("invalid vcpu mask start index `%d`", startID)
}
if endID < 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can endID be zero? Maybe switch with endID <= 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good point. I will add a test case to cover this 0-0 range as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after adding the test it seems to be working already as it is:

Entry("Correctly extracts a single range with one core", "0-0", map[string]maskType{"0": enabled}, nil),

Are you OK to leave the logic as it is or you'd feel more comfortable changing it as you initially suggested?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting.
Well, as long as it works + covered with a unit test I guess we're fine :)

func schedSetScheduler(pid int, policy Policy, param SchedParam) error {
_, _, e1 := unix.Syscall(unix.SYS_SCHED_SETSCHEDULER, uintptr(pid), uintptr(policy), uintptr(unsafe.Pointer(&param)))
if e1 != 0 {
return syscall.Errno(e1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cast is unnecessary

Suggested change
return syscall.Errno(e1)
return e1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

. "github.com/onsi/gomega"
)

var _ = Describe("Set pthread scheduling type and priority", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I suggest to have one Describe with different Contexts inside so it would look a bit clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following you here. This Describe section has only 5 entries/tests to run in the same context. Are you talking in general or just to this Describe section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'm not sure why we need 3 deffierent Describes, each with a Context and a DescribeTable. I'd go with one Describe and 3 Contexts.

But of course it's a styling manner and might be subjective. Feel free to ignore if you don't feel the same as I :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, makes sense to me. I will migrate the Describes into Context and add a top Describe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, please take a look.


var _ = Describe("Set pthread scheduling type and priority", func() {
Context("when parsing the thread command for CPU ID", func() {
DescribeTable("extracts the CPU ID", func(comm []byte, cpuID string, parseOK bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits:

  1. would change comm []byte to comm string and cast inside
  2. cpuID -> expectedCpuID, parseOK -> expectParseOK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing is that I need it to be []byte inside the function:

v := vcpuRegex.FindSubmatch(comm)

So casting it to string when the source is already a byte array and then recasting it to []byte doesn't sound that good. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point.

@jordigilh
Copy link
Contributor Author

I'm unsure whether to keep the old logic or not. Preferably it would be better to stick to one logic (this PR's implementation for both non-root and root use cases), but I'd like your input on this.

Does this mean that with your change we could drop CAP_SYS_NICE for root and non-root? If the logic you introduce works for both cases, yep let's converge :)

Terrific ... I will remove the old logic 😄

@kubevirt-bot kubevirt-bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. size/L and removed dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/XL labels Dec 1, 2022
@jordigilh jordigilh force-pushed the fix/BG2123209_RT_VM_nonroot_fail_bootup branch from 9b4c891 to ada476f Compare December 1, 2022 13:38
@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Dec 1, 2022
@iholder101
Copy link
Contributor

@jordigilh seems like a bazel file isn't configured correctly. make should solve that

@jordigilh jordigilh force-pushed the fix/BG2123209_RT_VM_nonroot_fail_bootup branch from ada476f to 29f371c Compare December 1, 2022 14:37
Signed-off-by: Jordi Gil <jgil@redhat.com>
@jordigilh jordigilh force-pushed the fix/BG2123209_RT_VM_nonroot_fail_bootup branch from 29f371c to f365f1d Compare December 1, 2022 14:37
@iholder101
Copy link
Contributor

/lgtm
/hold cancel

thanks @jordigilh!

@kubevirt-bot kubevirt-bot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Dec 1, 2022
@jordigilh
Copy link
Contributor Author

/test pull-kubevirt-e2e-k8s-1.24-sig-compute-realtime

@jordigilh
Copy link
Contributor Author

/test pull-kubevirt-e2e-k8s-1.24-sig-compute-realtime-root

@jordigilh
Copy link
Contributor Author

/retest

3 similar comments
@jordigilh
Copy link
Contributor Author

/retest

@jordigilh
Copy link
Contributor Author

/retest

@jordigilh
Copy link
Contributor Author

/retest

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@jordigilh
Copy link
Contributor Author

/retest

@kubevirt-bot kubevirt-bot merged commit 458a768 into kubevirt:main Dec 2, 2022
@EdDev
Copy link
Member

EdDev commented Dec 4, 2022

Hello there,

Unfortunately, this PR seems to have broken my local setup from building/running the e2e test binary.
The change here required a CGO linkage at the virt-handler package.

Sadly, this in turn exposed a nasty dependency from the e2e tests to the package here.

We seem to have several issues we better fix soon:

  • Avoid CGO linkage on virt-handler except for very special cases. I am not sure if this was discussed here directly.
  • Do not depend in the e2e test on internal packages which are not dedicated to be formal API/s. It is preferred to have duplication over such dependency.
  • Add a build job for the e2e test, which can detect such issues. Or any other alternative to try and avoid such cases in the future.

I will try to send a quick fix later today, but it is a small workaround, not fixing the core problem.
I hope we could avoid the C binding here.

@EdDev
Copy link
Member

EdDev commented Dec 4, 2022

For completeness, the errors seen while building the e2e tests are these:

/lib64/libc.so.6: version `GLIBC_2.32'
/lib64/libc.so.6: version `GLIBC_2.34'

@xpivarc
Copy link
Member

xpivarc commented Feb 1, 2023

/cherry-pick release-0.58

@kubevirt-bot
Copy link
Contributor

@xpivarc: #8750 failed to apply on top of branch "release-0.58":

Applying: Disabled admission check for real time manifest and avoid defining the <cpusched> knobs when running in a non-root environment or the VM labeled as non-root
Using index info to reconstruct a base tree...
M	pkg/virt-api/webhooks/validating-webhook/admitters/vmi-create-admitter.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/virt-api/webhooks/validating-webhook/admitters/vmi-create-admitter.go
CONFLICT (content): Merge conflict in pkg/virt-api/webhooks/validating-webhook/admitters/vmi-create-admitter.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Disabled admission check for real time manifest and avoid defining the <cpusched> knobs when running in a non-root environment or the VM labeled as non-root
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-0.58

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants