Abstract
The context of this work is the online characterization of errors in large scale systems. In particular, we address the following question: Given two successive configurations of the system, can we distinguish massive errors from isolated ones, the former ones impacting a large number of nodes while the second ones affect solely a small number of them, or even a single one? The rationale of this question is twofold. First, from a theoretical point of view, we characterize errors with respect to their neighbourhood, and we show that there are error scenarios for which isolated and massive errors are indistinguishable from an omniscient observer point of view. We then relax the definition of this problem by introducing unresolved configurations, and exhibit necessary and sufficient conditions that allow any node to determine the type of errors it has been impacted by. These conditions only depend on the close neighbourhood of each node and thus are locally computable. We present algorithms that implement these conditions, and show through extensive simulations, their performances. Now from a practical point of view, distinguishing isolated errors from massive ones is of utmost importance for networks providers. For instance, for Internet service providers that operate millions of home gateways, it would be very interesting to have procedures that allow gateways to self distinguish whether their dysfunction is caused by network-level errors or by their own hardware or software, and to notify the service provider only in the latter case.