2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W)
Download PDF

Abstract

Current practice for mitigating DRAM hardwarefaults is to simply discard the entire faulty DIMM. However, this becomes increasingly expensive and wasteful as the priceof memory hardware increases and moves physically closer toprocessing units. Accurately characterizing memory faults inreal-time in order to pre-empt future potentially catastrophicfailures is crucial to conserving resources by blacklisting smallaffected regions of memory rather than discarding an entirehardware component. We further evaluate and extend a machinelearning method for DRAM fault characterization introduced inprior work by Baseman et al. at Los Alamos National Laboratory. We report on the usefulness of a variety of training sets, usinga set of production-relevant metrics to evaluate the method ondata from a leadership-class supercomputing facility. We observean increase in percent of faults successfully mitigated as well asa decrease in percent of wasted blacklisted pages, regardless oftraining set, when using the learned algorithm as compared to ahuman-expert, deterministic, and rule-based approach.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles