Exponential backoff and jitter for agent reconnections

Project goal: Add improved reconnection algorithm to agents

Skills to study/improve: Java

NOTE: This idea is published as a draft under active discussion, but it is confirmed in principle. It is FINE to apply to it. The scope and the suggested implementation may change significantly before the final version is published. Sections like quickstart guide and newbie-friendly issues may be also missing. As a contributor, you are welcome to request additional information and to join the discussions using channels linked on this page.

Details

Background

Jenkins administrators are often facing a particular problem when restarting a large controller. It can trigger a thundering herd of agents trying to reconnect. Remoting and Swarm (a Remoting wrapper) have (separate) retry implementations. Swarm’s supports exponential backoff but not jitter. Remoting’s supports neither. Users have requested both in the form of bug reports and pull requests. But initial pull requests need a lot of improvement before they can be merged. Additionally, it would be desirable to unify the two implementations rather than to continue to maintain separate logic.

This would be a good GSoC project for the following reasons:

  • Intellectually stimulating and therefore a high "brag factor": who couldn’t resist spending a summer working on a topic that the AWS Architecture Blog discusses?

  • High demand: Companies like Netflix needs this feature and would likely be willing to test it

  • Feasible within three months: while Remoting is a complex subsystem, retrying logic is at the outer edges of the subsystem. One does not need to concern oneself with what happens inside of the guts of Remoting, just how to restart the connection process at a high level.

  • Stretchable and shrinkable scope: If the intern is struggling, scope can be reduced to unification of the existing implementations, or merely implementing exponential backoff rather than exponential backoff and jitter. If the intern is doing better than expected, the scope can be expanded to writing a stress testing framework for the end result.

  • Not fatal if the project fails: Even if all the intern manages to complete is a unification of the existing two implementations, that would be a net benefit. If the project fails, no harm: the status quo would simply remain.

Project Difficulty Level

Intermediate

Project Size

175 hours

Expected outcomes

Release of an updated remoting component that implements a similar exponential backoff and jitter functionality as in the Swarm plugin.

Details to be clarified interactively, together with the mentors, during the Contributor Application drafting phase.

Potential Mentors

Project Links

Organization Links

> Go back to other GSoC 2023 project ideas