Improved Fault Localization Using Transfer Learning and Language Modeling


Word embeddings provide efficient vector representations of words that capture the syntactic and semantic relationships in a corpus. Within an IT infrastructure, event data can be interpreted as sequences of tokens and can be represented in a continuous vector space. Similar tokens cluster together in the vector space, which can provide insight into patterns of failures and enable detection of actionable incidents. Fault localization techniques need to be adaptable without the requirement of building knowledge bases from scratch to account for new services or hardware deployed on existing infrastructures, or semantically equivalent incidents described by different lexicons across different infrastructures. Using the paradigm of transfer learning, word embeddings can be built and incrementally updated to introduce new vocabulary and alter the relationships of existing tokens, whilst persisting the general contextual information of the initial embedding. Features in event data procured from IT infrastructures are typically sparsely distributed, with many events being duplicated with minor character mutations. We use unsupervised clustering techniques to analyse the vector representation of the event data. Our analysis shows that clustering vector representations of event data based on semantic similarity produce interpretable categories, which can be used to improve fault localization and identification of root cause.