Paths to improved laser reliability
Implementing strategies that slash laser diode failure rates by at least two orders of magnitude can banish reliability-related delays in the deployment of co-packaged optics.
BY ROBERT HERRICK FROM ROBERT HERRICK CONSULTING
Over the last two decades, fibre-optic transceivers have played an increasingly critical role in modern data centres. However, as these data centres have scaled up, concerns have escalated surrounding energy consumption. For more than ten years there’s been a proven solution, ‘co-packaged optics’ – it’s an architecture that is capable of power savings of 30-70 percent – but industry is reluctant to adopt this approach, held back by concerns associated with reliability and maintainability.
A number of issues have hampered laser reliability in transceivers for the last 30 years, a situation that’s not helped by poor design choices. The emphasis has been on ease-of-fabrication, cost, performance, and supply chain diversity – an agenda that’s neglected reliability.
This has led to a number of high-profile failures, that have made transceiver reliability the source of a lot of apprehension amongst buyers. The earliest of these failures occurred on the first ‘low-cost’ (sub-$1,000) transceiver, known as the Gigabit Link Module, or for short, GLM. In this case, failure of the entire population started after just a few years of deployment, and the product had to be recalled.
A few years later, makers of VCSELs had challenges deploying the first generation of parallel optical links. The first parallel-optic module based on this class of laser never made it to product release, while the second had to pulled off the market and the product line permanently cancelled, only a year or so after initial deployment. It then took a few more years for engineers to figure out how to solve the problem of dark-line defects (see Figure 1) that propagate from the edge of the die.
Figure 1. Dark Line Defects (DLDs) are dislocation networks that grow
in lasers from crystallographic or mechanical defects, increasing
optical loss and reducing laser output power. These have been
responsible for most transceiver reliability failures. Examples can be
seen from a failed GaAs 850nm VCSEL (left) and an AlGaInAs 1310 nm
cleaved-facet laser (right).
Despite these woes, one could argue that lessons have not been learned, with reliability issues continuing to plague lasers. Recently, chipmakers deployed low-cost AlGaInAs edge-emitters with failure rates ten times higher than the target. This oversight had severe ramifications, including significant delays and cost over-runs during data-centre bring-ups.
There are concerns even among typical transceivers. The key metric for reliability is FIT, which is the number of failures-in-time per billion operating hours. For an ‘average’ transceiver the FIT is around 200-300, implying that engineers can expect a typical switch with 32 transceivers to need maintenance during its deployment lifetime 10-40 percent of the time. For this reason, to ensure easy maintenance most switches are currently designed with front-panel pluggable transceivers. This allows failing transceivers to be quickly removed and replaced. However, according to data centre operators, there’s a significant cost associated with identifying failed links and performing maintenance.
Figure 2. Today’s optics (upper diagram) rely on power-hungry retimer
chips to recover rapidly degrading electrical signals that are far from
the switch ASIC. New co-packaged optics (lower diagram) place mini
fiber-optic transceivers much closer to the switch ASIC, to where
retimers can be eliminated.
The desire for embedded optics
Another consideration is that as data rates increase, it is more difficult to maintain signal integrity over links of just a few centimetres. To address this matter, engineers are adding ‘re-timer’ circuits that clean up signals after they propagate the 20 cm required by front-panel pluggable designs (see Figure 2 (a)). That’s an imperfect solution, as the power consumption of these re-timers is significant. To trim this, one can mount smaller fibre-optic transceivers that don’t require re-timers right next to the switch ASIC (see Figure 2 (b)). Note, though, that these miniaturised transceivers are usually not pluggable, and certainly not accessible from the front panel. Thus, repair cost and repair time are expected to increase by around an order of magnitude, hikes previously viewed as unacceptable with the fibre optic transceivers currently available.
A potential solution could be the uptake of co-packaged optics. Based on industry consensus, the adoption of this technology will commence with next-generation switches, introduced this year. A number of providers have showcased systems based on co-packaged optics that are not just ‘technology demos’, but expected to be mainstream products.
The need for greater reliability
Laser reliability levels that have been ‘good enough’ during the past few decades are unlikely to be acceptable for future applications. In this context, it’s worth considering an application beyond co-packaged optics – future AI clusters, involving hundreds of thousands or even millions of links. In the case of one million transceivers with a typical 200 FIT failure rate – that equates to an annual failure rate of about 0.2 percent – existing transceivers would be responsible for a link failure every 5 hours, on average. This is unacceptable. An improvement by a factor of at least 100 is needed to support next-generation million-link AI clusters.
Another application demanding consideration is lidar photonic integrated circuits (PICs). As lidar is a safety-critical component for autonomous driving and robotaxis, the automotive industry has failure-rate expectations that are typically multiple orders-of-magnitude lower than today’s state-of-the-art.
Figure 3. Co-packaged optics (or ‘CPO’, left) using front-panel
pluggable ELSFP laser module (right) powering the optics (courtesy of
Broadcom). The lasers are accessible from the front panel, and can be
easily replaced in case of failure, while the electrical-to-optical
conversions can be close to the switch ASIC.
Five solutions
There are five existing solutions that provide significant improvements in reliability: External Laser Small Form-Factor Pluggable (ELSFP) products, heterogeneous lasers, redundancy, improved materials and improved screening. Now we will consider these approaches, one by one. Note that some are starting to be used, but most have only limited adoption.
Enjoying the widest adoption is the ELSFP. Its popularity stems from retaining the laser that provides the light to the photonics in a ‘front-panel-pluggable’ position, where it can easily be replaced if it fails. With this technology light is routed in with fibre optics (see Figure 3). In addition to moving the laser to a more maintainable location, this product employs a more expensive, more reliable laser. The light source has a number of costly-to-fabricate features that quash the failure rate by orders of magnitude over low-cost ridge lasers, which were previously used for most data communication.
The second solution, heterogeneously-integrated lasers, has been used by Intel. With this approach Intel obtained some of the lowest failure rates ever reported, with a FIT laser failure rate of just 0.09 – that’s 0.8 parts per million per year. It is thought that the superior reliability stems from elimination of cleaved facets (a modification that provides many of the benefits of ‘window’ lasers), as well as a current density 8-10 times lower than that of direct-modulated-lasers (see Figure 4). In addition, many of these designs are also used in tandem with redundancy.
Redundancy, the third solution, has much appeal. Rather than straining to improve reliability by orders of magnitude, engineers include backup channels that allow the use of conventional lasers. This redundancy-related approach boosts system availability by 100-fold or more. If there’s monitoring of the performance degradation in the link, it’s possible to predict failure hours or even months before it happens, allowing software repair to be proactive and a graceful switch-over to take place.
Two types of redundancy are being proposed for future systems (see Figure 5). One option is to provide a spare for every channel (a condition known as 1:1 sparing); and the second is having a pool of lasers with some spares (‘m of n sparing’). While gains in reliability depend on failure rates, and whether failures are correlated or fully random, it is relatively straightforward to obtain improvement by a factor of 100 in redundant systems. However, there’s still the need for ‘software repairs’ that switch to the backup channel once failure has been forecast or observed.
The fourth pathway for improving reliability is associated with gains in material quality. Unfortunately, most laser material used in data communication systems is not selected with reliability in mind, so it is vulnerable to ‘dark line defects’ (DLDs). These imperfections are implicated in the failures found in 850 nm GaAs VCSELs, and in the cleaved 1310 nm AlGaInAs directly modulated lasers mentioned earlier and shown in Figure 1. But other laser materials exist. They can be adopted, especially if industry standards committees are flexible about wavelength, or less demanding about requirements relating to uncooled operation. Today there are requirements for operation at very high-temperatures, often greatly exceeding what is actually present in data centres.
Figure 4. Heterogenous laser (right) benefits from not having a
cleaved and pumped facet like the directly-modulate laser (DML) does
(left). The heterogeneous laser also benefits from having its gain
spread among a larger area, with current density less than 1/8th the
level of the DML.
Options based on this line of thinking include replacing unstrained 850 nm GaAs quantum well VCSELs with variants that emit at 980 nm, and are based on strained InGaAs. The standards committees could have supported this move in the mid 1990s, but shied away, preferring an 850 nm standard, due to wider availability of low-cost GaAs and silicon photodetectors. However, with the benefit of hindsight it’s clear that retaining the 850 nm standard has only delivered minimal cost savings associated with photodiodes, and this upside is overshadowed by problems coming from DLDs. Now, at least on proprietary links, 980 nm VCSELs are being adopted. Their merits are not limited to a lack of vulnerability to sudden DLD failures, and include a much higher modulation bandwidth under direct modulation.
Another option for improving reliability via the introduction of better materials is to turn to quantum-dot lasers. Due to the high strain of the dots, DLDs do not grow in them, instead appearing to be tangled around these nanostructures. With normal quantum-well material, the presence of just a single threading dislocation leads to growth of a DLD network during laser operation, prior to causing device failure. In sharp contrast, despite being surrounded by hundreds or thousands of threading dislocations in the III-V, quantum-dot lasers grown on silicon substrates pass reliability tests, with no DLDs observed during their aging. This class of laser has been developed or commercialised by a number of companies, with efforts targeting deployment in industrial and data-communications applications.
A third example of a more robust material is that of InGaN, which is used in lasers and LEDs. These sources, not vulnerable to DLDs, are also being explored for use in co-packaged optics by companies like Avicena.
Finally, while prior techniques are probably preferred for improving reliability, there may be times where redundancy cannot be added to the design – for example, in many lidar PICs. And there can be instances where performance or wavelength is critical to the application, creating a compelling reason for using materials that are vulnerable to DLDs.
In such cases, our company, Robert Herrick Consulting, is working on implementing the first known application of a new generation of high-speed inspection tools. These tools, capable of detecting small and isolated defects in devices prior to deployment, are slated for pilot roll-out later this year.
Figure 5. 1-to-1 sparing (left) has a spare backup laser for each channel; ‘m of n’ sparing (right) has a pool of spares lasers, switched in if any of the original channels fail.
One of the significant weaknesses of traditional laser screening methods that use burn-in is that they only provide a single indicator of performance. With that approach, lasers tend to be rejected if they degrade by more than 5 percent by the end of the burn-in aging period. However, in many cases, lasers occupy thousands of square micrometres, but the defects only impact performance in a few square microns of the device – less than 1 percent of the area. What this means is that defects have minimal impact on performance and are hard to detect. By imaging through transparent substrates, one can obtain thousands of data points for the device, and megabytes of information (see Figure 6), rather than a single indicator, allowing defects to be detected long before they impact overall device performance. A reliability improvement by a factor between 20 and 100 is expected for many failure modes, although this depends on laser design and the quality of the inspection system.
A call for change
Unfortunately, since the turn of the millennium, there has not been a significant improvement in laser reliability. That’s not surprising, given that there has only been minimal change in the methods for manufacturing and screening them. But the substantial improvement that industry is demanding from suppliers – a factor of 100 to 1000 – is unlikely to materialise without a more proactive role from research agencies and/or the customer base.
Another impediment to progress is an absence of an industrial reliability improvement roadmap. To avoid supply chain shocks and allow for price negotiation, hyper-scale cloud service providers only buy compatible transceivers through a ‘multi-source agreement’, where at least three suppliers are available. These providers pursue aggressive cost targets, tending to leave almost no margin for suppliers – and that hampers efforts directed at fundamental research into novel materials, or many of the other potential solutions listed above.
Even if some of the more advanced suppliers could set aside millions of dollars required to demonstrate one of the solutions, industry will not shift over to supporting that technology unless multiple suppliers provide it. A transition would require low-cost licensing of competitive advantages that one firm might develop, meaning they couldn’t recoup much of the investment through increased market share.
It’s not all doom and gloom, though. There is an opportunity for industry-funded consortiums that are comparable to the SEMI global industry organisation. In this model, highly profitable hyper-scale cloud infrastructure providers – that’s the main customer base – would fund research, either at universities, research institutes such as imec or Sandia, or at multiple suppliers, based on competing research proposals. This would create IP and licenses not owned by the suppliers, but by the industry organisation, and licensed to suppliers. With this model, consortium researchers would assist suppliers in bringing chosen technology improvements into production, and getting through the qualification process.
There’s also a more traditional path available, involving government-funding agencies. However, these agencies often think that laser development is ‘mature’, and argue that further development must be funded by industry. But that’s not the reality. Many details of laser degradation are not understood at the fundamental level, including those as simple as why the addition of indium to a GaAs quantum-well laser diodes ‘pins’ DLDs and stops their growth, or how to make a AlGaInAs laser diode that’s DLD-resistant.
Figure 6. Backside inspection breaks the device down into hundreds or thousands of 1 mm2
pixels, where each one can be examined for uniformity, and subtle
mechanical or crystal defects identified prior to aging. Most lasers
will appear featureless and uniform in this type of inspection, but
lasers with latent defects can be identified and removed from the
shipping population.
Another example of a lack of understanding relates to nature of the DLD. It’s is a planar structure with an additional interstitial half-plane – think of it like a structure that keeps adding more ‘bricks’ as it grows. As the number of interstitials required is quite large, it is not understood how those are generated and transported – and debates dating back to the 1970s are still to be resolved by experimental evidence today.
Key questions remain: If we want to make a new generation of ‘DLD-resistant lasers’, what principles should we use to predict which materials are most likely to succeed? And would the most promising approach involve strain engineering, or instead modelling the band structure of optimised alloys compositions? Government research agencies should fund projects to answer these key questions, if they want to help to enable the next generation of reliable lasers for power AI clusters and lidar PICs.
Within industry, laser reliability is viewed along similar lines to the weather – something you complain about, but have little control of. But that’s not the case: there are many potential paths for improvement, even if few are aware of the options. The reality is that the real obstacles for the simplest fixes are the cost of implementation, and support of industry standards bodies that drive multi-source agreements.
For more powerful, fundamental improvements, one should consider that more than 80 percent of the semiconductor lasers manufactured to date have unfortunately been those that are vulnerable to DLDs, and in many cases have fallen far short of their reliability targets. Government research agencies could change that, ushering in an era where laser reliability is a given, rather than one subject to a number of uncontrollable unknowns.






























