OCP's Server Resilience Initiative: SDC Academic Research Awards Announced!

The Open Compute Project (OCP) is thrilled to announce a significant milestone in our Server Resilience initiative: awarding academic research funding to tackle Silent Data Corruption (SDC). Our recent call for proposals garnered an impressive 19 cutting-edge proposals from esteemed universities around the world, demonstrating a strong interest in addressing the challenge of silent data corruption. This demonstrates a powerful symbol of commitment from the academic community to address this critical challenge.

 

Collaborative Review Process

The selection process involved a thorough review of all proposals by the OCP Server Component Resilience Workstream, in close collaboration with key committee members AMD, ARM, Google, Intel, Microsoft, Meta, and NVIDIA. Each proposal was evaluated based on technical merit, research plans, and alignment with industry priorities and research needs of each company. Through collaborative discussions, the participating companies identified the proposals with the highest potential for far-reaching impact.

 

Announcing the Awardees: Driving Innovation in SDC Mitigation

We are proud to announce that the following six projects have been selected for funding:

 

  • Understanding test escapes and SDC failures in ICs caused by transistors with extreme device parameters from random manufacturing variations.
    • Adit D. Singh, Auburn University
    • This research will develop new methodologies to determine causes of manufacturing test escapes leading to SDCs and screens to detect them.
  • Probabilistic fault modeling and test generation using on-chip telemetry integration & generative AI
    • Krishnendu Chakrabarty, Arizona State University
    • This research investigates ways to detect and mitigate SDCs in cloud data centers by leveraging on-chip telemetry, probabilistic fault modeling, and generative AI techniques.\
  • SDC detection and correction in software via application-level coding techniques
    • Rashmi Vinayak, Carnegie Mellon University
    • This research will develop software-based tools for efficiently detecting and correcting SDCs in data centers, utilizing coding theory, machine learning, and optimization techniques.
  • Mobilizing hardware and software towards SDC testing, detection, & correction
    • Caroline Trippel, Stanford University and Baris Kasikci, University of Washington
    • This research will create offline and online techniques to test for and detect SDCs in data centers, leveraging domain-specific knowledge and device-specific defect profiles.
  • Diagnosing hardware failures & understanding their impact with focus on AI/ML accelerators
    • Yanjing Li, University of Chicago
    • This research will create a proactive system-level hardware failure diagnosis methodology for deep learning accelerators to enable failure prediction and prevention, thereby enhancing system reliability and sustainability.
  • Grade early and detect fast – Tackling Silent Data Corruption through the power of microarchitectural modeling
    • Dimitris Gizopoulos, University of Athens
    • This research leverages microarchitecture-level fault modeling and grading to develop fast, hardware-aware test programs for scalable detection of SDCs in CPUs, GPUs, and AI accelerators.

 

Collaboration for Real-World Impact

These projects represent a diverse range of innovative approaches to SDC, showcasing the potential for advancements in detection, prevention, and mitigation strategies. These institutions will receive unrestricted gifts to support their proposed research.

 

The selected academic teams will work closely with the industry partners throughout the research process, ensuring that their findings are directly applicable to real-world computing environments. This collaborative model is key to driving innovation and maximizing the impact of the research outcomes.

 

Industry Collaboration Drives Innovation

This achievement is a testament to the power of collaboration within the OCP community. By uniting industry leaders and academic researchers, we are fostering a dynamic environment that accelerates innovation and addresses the most pressing challenges facing the computing industry today.

 

A Note of Gratitude and Encouragement

We extend our sincere gratitude to all the universities and researchers who submitted proposals. The quality and depth of the submissions was truly impressive. While we were only able to fund a limited number of projects, we encourage all researchers to continue their important work in this challenging field. Your dedication to advancing knowledge and addressing SDC is invaluable to the industry. We believe that collaboration is key, and we welcome opportunities to engage with the industry through OCP and other channels.

 

Looking Ahead

We are thrilled to bring industry and academia together as a community, amplifying our collective response to the SDC challenge. We are excited for this next phase of our research and novel innovation journey, where the expertise of both industry and academia converge to create a more resilient computing infrastructure. Stay tuned for updates on the progress of these exciting research projects as we continue our journey to combat Silent Data Corruption.

 

Next week at the IEEE RAS in Data Centers Summit, on Day Two, there will be a talk on the Project’s Server Resilience Specification 1.0.

 

Additionally, on Day One, George Tchaparian, OCP’s CEO, will discuss Challenges and OCP Community Progress towards Reliability, Availability, and Serviceability (RAS) with a Special Focus on Artificial Intelligence.

 

Currently, the event is sold out but there is a waiting list. And we will make the results of the sessions available afterwards in a further blog post. Stay tuned!