Optimizing RTC bandwidth estimation with machine learning

Bandwidth estimation (BWE) and congestion management play an vital position in delivering high-quality real-time communication (RTC) throughout Meta’s household of apps.
We’ve adopted a machine learning (ML)-based strategy that permits us to unravel networking issues holistically throughout cross-layers resembling BWE, community resiliency, and transport.
We’re sharing our experiment outcomes from this strategy, a few of the challenges we encountered throughout execution, and learnings for brand spanking new adopters.

Our present bandwidth estimation (BWE) module at Meta is predicated on WebRTC’s Google Congestion Controller (GCC). We have made a number of enhancements by parameter tuning, however this has resulted in a extra complicated system, as proven in Figure 1.
Figure 1: BWE module’s system diagram for congestion management in RTC.
One problem with the tuned congestion management (CC)/BWE algorithm was that it had a number of parameters and actions that had been depending on community situations. For instance, there was a trade-off between high quality and reliability; enhancing high quality for high-bandwidth customers typically led to reliability regressions for low-bandwidth customers, and vice versa, making it difficult to optimize the consumer expertise for various community situations.
Additionally, we seen some inefficiencies with regard to enhancing and sustaining the module with the complicated BWE module:

Due to the absence of life like community situations throughout our experimentation course of, fine-tuning the parameters for consumer purchasers necessitated a number of makes an attempt.
Even after the rollout, it wasn’t clear if the optimized parameters had been nonetheless relevant for the focused community sorts.
This resulted in complicated code logics and branches for engineers to take care of.

To clear up these inefficiencies, we developed a machine learning (ML)-based, network-targeting strategy that gives a cleaner various to hand-tuned guidelines. This strategy additionally permits us to unravel networking issues holistically throughout cross-layers resembling BWE, community resiliency, and transport.
Network characterization
An ML model-based strategy leverages time sequence information to enhance the bandwidth estimation through the use of offline parameter tuning for characterised community sorts.
For an RTC name to be accomplished, the endpoints should be linked to one another by community units. The optimum configs which were tuned offline are saved on the server and may be up to date in real-time. During the decision connection setup, these optimum configs are delivered to the shopper. During the decision, media is transferred instantly between the endpoints or by a relay server. Depending on the community indicators collected through the name, an ML-based strategy characterizes the community into differing types and applies the optimum configs for the detected kind.
Figure 2 illustrates an instance of an RTC name that’s optimized utilizing the ML-based strategy.
Figure 2: An instance RTC name configuration with optimized parameters delivered from the server and based mostly on the present community kind.
Model learning and offline parameter tuning
On a excessive degree, community characterization consists of two principal elements, as proven in Figure 3. The first element is offline ML mannequin learning utilizing ML to categorize the community kind (random packet loss versus bursty loss). The second element makes use of offline simulations to tune parameters optimally for the categorized community kind.
Figure 3: Offline ML-model learning and parameter tuning.
For mannequin learning, we leverage the time sequence information (community indicators and non-personally identifiable data, see Figure 6, beneath) from manufacturing calls and simulations. Compared to the mixture metrics logged after the decision, time sequence captures the time-varying nature of the community and dynamics. We use FBLearner, our inside AI stack, for the coaching pipeline and ship the PyTorch mannequin information on demand to the purchasers initially of the decision.
For offline tuning, we use simulations to run community profiles for the detected sorts and select the optimum parameters for the modules based mostly on enhancements in technical metrics (resembling high quality, freeze, and so forth.).
Model structure
From our expertise, we’ve discovered that it’s vital to mix time sequence options with non-time sequence (i.e., derived metrics from the time window) for a extremely correct modeling.
To deal with each time sequence and non-time sequence information, we’ve designed a mannequin structure that may course of enter from each sources.
The time sequence information will move by an extended short-term reminiscence (LSTM) layer that may convert time sequence enter right into a one-dimensional vector illustration, resembling 16×1. The non-time sequence information or dense information will move by a dense layer (i.e., a totally linked layer). Then the 2 vectors can be concatenated, to completely characterize the community situation up to now, and handed by a totally linked layer once more. The last output from the neural community mannequin would be the predicted output of the goal/process, as proven in Figure 4.
Figure 4: Combined-model structure with LSTM and Dense Layers
Use case: Random packet loss classification
Let’s contemplate the use case of categorizing packet loss as both random or congestion. The former loss is because of the community elements, and the latter is because of the limits in queue size (that are delay dependent). Here is the ML process definition:Given the community situations up to now N seconds (10), and that the community is at present incurring packet loss, the objective is to characterize the packet loss on the present timestamp as RANDOM or not.
Figure 5 illustrates how we leverage the structure to attain that objective:
Figure 5: Model structure for a random packet loss classification process.
Time sequence options
We leverage the next time sequence options gathered from logs:
Figure 6: Time sequence options used for mannequin coaching.
BWE optimization
When the ML mannequin detects random packet loss, we carry out native optimization on the BWE module by:

Increasing the tolerance to random packet loss within the loss-based BWE (holding the bitrate).
Increasing the ramp-up velocity, relying on the hyperlink capability on excessive bandwidths.
Increasing the community resiliency by sending further forward-error correction packets to recuperate from packet loss.

Network prediction
The community characterization downside mentioned within the earlier sections focuses on classifying community sorts based mostly on previous data utilizing time sequence information. For these easy classification duties, we obtain this utilizing the hand-tuned guidelines with some limitations. The actual energy of leveraging ML for networking, nonetheless, comes from utilizing it for predicting future community situations.
We have utilized ML for fixing congestion-prediction issues for optimizing low-bandwidth customers’ expertise.
Congestion prediction
From our evaluation of manufacturing information, we discovered that low-bandwidth customers typically incur congestion because of the conduct of the GCC module. By predicting this congestion, we are able to enhance the reliability of such customers’ conduct. Towards this, we addressed the next downside assertion utilizing round-trip time (RTT) and packet loss:Given the historic time-series information from manufacturing/simulation (“N” seconds), the objective is to foretell packet loss because of congestion or the congestion itself within the subsequent “N” seconds; that’s, a spike in RTT adopted by a packet loss or an additional progress in RTT.
Figure 7 reveals an instance from a simulation the place the bandwidth alternates between 500 Kbps and 100 Kbps each 30 seconds. As we decrease the bandwidth, the community incurs congestion and the ML mannequin predictions hearth the inexperienced spikes even earlier than the delay spikes and packet loss happen. This early prediction of congestion is useful in sooner reactions and thus improves the consumer expertise by stopping video freezes and connection drops.
Figure 7: Simulated community situation with alternating bandwidth for congestion prediction
Generating coaching samples
The principal problem in modeling is producing coaching samples for quite a lot of congestion conditions. With simulations, it’s tougher to seize several types of congestion that actual consumer purchasers would encounter in manufacturing networks. As a outcome, we used precise manufacturing logs for labeling congestion samples, following the RTT-spikes standards up to now and future home windows in accordance with the next assumptions:

Absent previous RTT spikes, packet losses up to now and future are impartial.
Absent previous RTT spikes, we can’t predict future RTT spikes or fractional losses (i.e., flosses).

We break up the time window into previous (4 seconds) and future (4 seconds) for labeling.
Figure 8: Labeling standards for congestion prediction
Model efficiency
Unlike community characterization, the place floor fact is unavailable, we are able to acquire floor fact by inspecting the long run time window after it has handed after which evaluating it with the prediction made 4 seconds earlier. With this logging data gathered from actual manufacturing purchasers, we in contrast the efficiency in offline coaching to on-line information from consumer purchasers:
Figure 9: Offline versus on-line mannequin efficiency comparability.
Experiment outcomes
Here are some highlights from our deployment of assorted ML fashions to enhance bandwidth estimation:
Reliability wins for congestion prediction
✅ connection_drop_rate -0.326371 +/- 0.216084✅ last_minute_quality_regression_v1 -0.421602 +/- 0.206063✅ last_minute_quality_regression_v2 -0.371398 +/- 0.196064✅ bad_experience_percentage -0.230152 +/- 0.148308✅ transport_not_ready_pct -0.437294 +/- 0.400812
✅ peer_video_freeze_percentage -0.749419 +/- 0.180661✅ peer_video_freeze_percentage_above_500ms -0.438967 +/- 0.212394
Quality and consumer engagement wins for random packet loss characterization in excessive bandwidth
✅ peer_video_freeze_percentage -0.379246 +/- 0.124718✅ peer_video_freeze_percentage_above_500ms -0.541780 +/- 0.141212✅ peer_neteq_plc_cng_perc -0.242295 +/- 0.137200
✅ total_talk_time 0.154204 +/- 0.148788
Reliability and high quality wins for mobile low bandwidth classification
✅ connection_drop_rate -0.195908 +/- 0.127956✅ last_minute_quality_regression_v1 -0.198618 +/- 0.124958✅ last_minute_quality_regression_v2 -0.188115 +/- 0.138033
✅ peer_neteq_plc_cng_perc -0.359957 +/- 0.191557✅ peer_video_freeze_percentage -0.653212 +/- 0.142822
Reliability and high quality wins for mobile excessive bandwidth classification
✅ avg_sender_video_encode_fps 0.152003 +/- 0.046807✅ avg_sender_video_qp -0.228167 +/- 0.041793✅ avg_video_quality_score 0.296694 +/- 0.043079✅ avg_video_sent_bitrate 0.430266 +/- 0.092045
Future plans for making use of ML to RTC
From our mission execution and experimentation on manufacturing purchasers, we seen {that a} ML-based strategy is extra environment friendly in concentrating on, end-to-end monitoring, and updating than conventional hand-tuned guidelines for networking. However, the effectivity of ML options largely depends upon information high quality and labeling (utilizing simulations or manufacturing logs). By making use of ML-based options to fixing community prediction issues – congestion particularly – we absolutely leveraged the ability of ML.
In the long run, we can be consolidating all of the community characterization fashions right into a single mannequin utilizing the multi-task strategy to repair the inefficiency because of redundancy in mannequin obtain, inference, and so forth. We can be constructing a shared illustration mannequin for the time sequence to unravel completely different duties (e.g., bandwidth classification, packet loss classification, and so forth.) in community characterization. We will concentrate on constructing life like manufacturing community situations for mannequin coaching and validation. This will allow us to make use of ML to determine optimum community actions given the community situations. We will persist in refining our learning-based strategies to reinforce community efficiency by contemplating present community indicators.