5 steps to detect inconsistencies in evolving knowledge graphs

Alena Schmickl
11 min readFeb 7, 2021

Knowledge graphs provide a promising way to enhance the digitalization of many areas, such as the healthcare industry. However, changes to the underlying data pose a thread to this development, as correct knowledge deduction can be obstructed. IDA - the Inconsistency Detection Algorithm - provides a 5-step approach to tackle this problem. IDA facilitates targeted graph corrections and incremental updates and eliminates the need for graph reconstructions by enabling cross-period graph verification.

Background and Reasoning for the Necessity of IDA

Consider a knowledge graph from the medical domain that is used to assist medical coders in their daily work. Coders scan patient files and determine the appropriate medical codes which are used for the billing of the medical services. The graph contains information on coding catalogs, that is on ICD-10 (diagnoses), CHOP (procedures), and DRG (billing classes) codes. Moreover, it comprises information on snapshots of patient cases that have been coded by medical coders in the past. These cases are linked to the corresponding code representations in the graph. On the basis of this past knowledge, it is possible to predict the relevant codes which should be assigned to a new case. Much like Amazon recommendations, graph queries would indicate which codes are relevant for a case based on similar cases from the past — “Patients like this have also been assigned these codes…”. If a knee surgery patient with arthritis was assigned certain codes in the past, it is very likely that a new patient with the same diagnosis and surgery should be assigned the same codes.

BUT, here comes the challenge. The medical coding catalogs are updated (bi-)annually. What is the result of that? As the graph is subject to constant changes the goal of retrieving insights from it is compromised. ICD-10, CHOP, or DRG codes determined for a patient case in 2021 may not be consistent with the new coding guidelines for 2022 anymore. Therefore, knowledge about patient cases from past periods would partially become erroneous, resulting in the need for updating the outdated information to allow for a cross-period deduction of knowledge.

There are two options for performing such an update:

  1. Replace outdated patient cases with data from the new period.
  2. Update the outdated knowledge about codes for patient cases in the graph based on the new coding guidelines.

The first option would render the system inoperable until a sufficient number of newly coded cases becomes available. This is undesirable. Therefore, an algorithm is required to identify inconsistent elements in the graph in order to update the knowledge graph as described by option two. This prevents knowledge loss (about the coded patient cases) and decreases the inoperability time of the system at the beginning of a new period. The solution is IDA.

You might ask yourself if something like IDA doesn’t already exist. Indeed, many methods for the detection of errors in knowledge graphs exist already. They all consider a graph at a specific state. However, the aforementioned issue is not concerned with the identification of errors but rather aims for the identification of graph elements that are valid in one period and invalid in another. Also, the majority of methods require large data sets to be effective. As this is often not the case in healthcare use cases the new approach is independent of the size of the graph it is applied to.

Basic Definitions & the notion of inconsistency

Let’s dive deeper into the topic and clarify some basic definitions first. A graph is a collection of nodes and edges, just like a network. We will call the graph G, its nodes/vertices the set V, and its connections/arcs the set A. Vertices can have multiple properties and arcs can be labeled.

A periodic update on a graph can consist of multiple changes to its vertices and/or arcs. The regular updates on the medical coding catalogs represent such a composition of changes. Graphs like that are evolving graphs, given their constant transformation. An evolving graph contains an updatable (potentially subject to changes) and a static part (never subject to changes). This is also true for our exemplary graph which can be split into a subgraph capturing the medical coding domain (updatable) and one for the patient cases (static). Patients can of course change, but the graph only captures an immutable snapshot of their hospitalization.

Let’s move on to the term inconsistency. Something is considered inconsistent if a conflict between two or more values is present. The occurrence of inconsistency in an evolving graph is the result of a change to an updatable graph element. As we trust the publishers not to introduce inconsistencies to the coding catalogs by their updates, we assumed that periodic updates don’t render the updatable subgraph inconsistent. So, inconsistencies are only examined for static elements. In a nutshell: An inconsistency reflects the establishment of a value conflict for elements of the static subgraph that is triggered by a change of the updatable counterpart.

Coming back to the example, this means that coded patient cases become inconsistent if updates to the coding catalogs invalidate the assignment of a code to a case. Consider that a code is deprecated in 2021. Recommending that code for cases in 2022 would be suboptimal.

IDA explained

As all necessary concepts & terms have been described now, it is time to dig deeper and break down what the algorithm does! Simply spoken: IDA compares the states of a graph before and after a periodic update and identifies static vertices that were rendered inconsistent.

Stage sequence of inconsistency detection approach with number range in brackets indicating corresponding lines of IDA specification (see below).

IDA is based on the following concept: Potential inconsistency sources are identified. It is then checked if they materialize as the result of a change. Affected static vertices are identified as being inconsistent and their inconsistency type (basically the cause of the inconsistency) is defined.

Preparation stage

A loop iterates over all static vertices in the considered graph as IDA is intended to be performed iteratively for every single static vertex n rather than for all static vertices at once. This is due to runtime related performance issues caused by the calculation of adjacency matrices in the first stage. Also, the iterative approach allows for parallelization of the evaluation.

In this stage, those elements are identified which in the case of a change pose sources of inconsistency. The subgraphs Gₙ,ₚ and Gₙ,ₚ₊₁ for the periods p and p+1 capture these elements. Stages 1 to 4 use only these elements for the identification of added, removed, and updated vertices as well as added and removed arcs between them. This step requires the knowledge of subject matter experts.

Stage 1: Identification of added and removed arcs

The first stage of the inconsistency detection algorithm serves the purpose of identifying arcs that were removed or added in the course of a periodic update. Adjacency matrices are used to compare the existing arcs in the two compared periods p and p+1.

Two adjacency matrices Mₙ,ₚ and Mₙ,ₚ₊₁ are created for the subgraph states in each period. The matrix representing the later period is subtracted from the other to determine discrepancies. The subtraction is only feasible if the matrices are of the same dimension and represent the same vertices. This is, however, not given in the case of added or removed vertices. Therefore, the matrices are built upon the union of the vertices from both periods. An entry in row i and column j of the matrices takes the value 0 if no arc (i,j) exists and 1 if it does.

The values in the result matrix indicate if an arc was added or removed during a periodic update. The arc from i to j was removed if the value is 1. It is -1 if it was added and 0 if it wasn’t changed.

Stage 2: Identification of added and removed vertices

The second stage of IDA deals with the identification of vertices that were added or removed during a periodic update. To identify vertex additions and deletions, the sets of relevant vertex IDs for both periods are subtracted from each other. All added vertices are identified by subtracting the set of vertices from the later period from those from the earlier one, while the reversed subtraction yields all removed vertices.

Stage 3: Identification of updated vertices

The third stage of IDA identifies updated vertices. Only those vertices that were not added or removed during the update and that are part of the updatable subgraph present potential subjects of an update. Therefore, only these are considered by the algorithm and used by the following analysis.

The property values of the corresponding vertex are retrieved for both periods excluding the ID property which is already known. The property values are then hashed by a hash function and concatenated to create a hashed representation of every vertex for both periods. If the two created hash values are not equal, then there must be a difference in the property value indicating that an update of the vertex took place. The sequence in which the property values are hashed and concatenated should be based on an alphabetic ordering of the property keys to ensure that an update is not wrongly identified.

The hash values for these vertices won’t be equal, thereby indicating a change.

Stage 4: Inconsistency evaluation

In the last stage of the algorithm, the vertex of the current iteration is assessed with respect to its consistency. Therefore, the information which was gained in the previous stages is leveraged. Moreover, the type of inconsistency is determined for the evaluated vertex. The vertex is inconsistent if any of the incoming or outgoing arcs that connect it to an updatable vertex are deleted or invalidated. Given the selection of the relevant subgraph in the preparation stage and the application of the algorithm to only those elements, all changes which were identified in any of the previous stages render the vertex inconsistent.

If the vertex was identified as inconsistent, it is added to the result set. In the next step, the types of inconsistency are determined.

  • Type DD (deletion direct): An adjacent vertex was removed
  • Type DI (deletion indirect): A vertex that is not adjacent was removed
  • Type AI (addition indirect): A vertex or arc was added
  • Type UI (update indirect): A vertex property was updated

Check out the full algorithm below ⇩

Formal representation of IDA

An illustrative run-through of IDA

After the general introduction of IDA, let’s use an illustrative example to better understand each step.

Consider a small knowledge graph capturing the case of a patient called Francis (17 years old) that suffers from arthritis and has undergone knee surgery in 2021. Given an imaginary update of the coding catalogs in 2022, the rule for applicability of the billing code used in the case of Francis was changed to require a minimum age of 18 and the presence of a more specific type of arthritis. For easier readability, all these concepts which are mapped to vertices as depicted below are hereafter referenced by lower case ID (b,o,p,s1,s2) values. The static vertex is p, all others are updatable.

Visual representation of the graph as used in running example of the patient case of Francis.

Preparation stage

The algorithm loops over every single static vertex n in the graph G. Gₙ₂₀₂₁ and Gₙ₂₀₂₂ define those triples (vertex, arc, vertex) whose changes cause inconsistency of n. The subsequent stages use their vertices Vₙ₂₀₂₁ and Vₙ₂₀₂₂. In our example, the only static vertex is p (Francis). Therefore, only one iteration of IDA is performed, namely for p. In the example, all vertices are potential sources of inconsistency. Therefore, Vₙ₂₀₂₁ = {b,o,p,s1} and Vₙ₂₀₂₂ = {b,o,p,s2}.

Stage 1

Two adjacency matrices M₂₀₂₁ and M₂₀₂₂ are created in this stage. The latter is subtracted from the former to determine added and removed arcs. The adjacency matrices have the dimension 5x5, given:
Vunion = V₂₀₂₁ ∪ Vₚ,₂₀₂₂ = {b,o,p,s1} ∪ {b,o,p,s2} = {b,o,p,s1,s2}

Adjacency matrices for 2021 and 2022 with the value 1indicate existing arcs.

The subtraction of the matrices yields MΔ. As MΔ[b][s2] = -1, it is possible to deduce that the arc (b,s2) was added. The arc (b,s1) and (p,s1) were removed given that MΔ[b][s1] = MΔ[p][s1] = 1. The sets of added and removed arcs are defined as follows:

RemovedA= {(b,s1),(p,s1)}; AddedA= {(b,s2)}

Stage 2

Vertices that were added or removed during the periodic update are updated here. The following calculations show the determination of the added and removed vertices for the illustrative example. The result shows that vertex s1 was removed while vertex s2 was added.

RemovedVₚ = Vₚ₂₀₂₁ - Vₚ₂₀₂₂ = {b,o,p,s1} - {b,o,p,s2} = {s1}
AddedVₚ = Vₚ₂₀₂₂ - Vₚ₂₀₂₁ = {b,o,p,s2} - {b,o,p,s1} = {s2}

Stage 3

Vintersection defines the updatable vertices present in both periods. Its calculation gives that only p must be considered for the update analysis.

Vintersection = (V₂₀₂₁ ∩ V₂₀₂₂) - Vstatic = ({b,o,p,s1} ∩ {b,o,p,s2}) - {p} = {b,o}

propertyString₂₀₂₁, propertyString₂₀₂₂ define the concatenated property values of vertices. Given the values of the key MinAge of vertex b:

propertyString₂₀₂₁ = “15”; propertyString₂₀₂₂ = “18”

The comparison of the hash values of both property strings shows that b was updated as hash(propertyString₂₀₂₁) ≠ hash(propertyString₂₀₂₂). Vertex o has no properties so the set of updated vertices is: UpdatedV = {b}

Stage 4

In the last stage vₚ is assessed regarding its consistency. It is inconsistent if RemovedA, AddedA, RemovedV, AddedVₚ, or UpdatedVₚ is a non-empty set. In the running example, this evaluation is given by the following:

RemovedA ∪ AddedA ∪ RemovedV ∪ AddedV ∪ UpdatedV = {(b,s1),(p,s1)} ∪ {(b,s2)} ∪ {s1} ∪ {s2} ∪ {b} = {(b,s1),(p,s1),(b,s2),s1,s2,b} ≠ ∅

In this case, the union of all sets is not an empty set so vₚ is identified as inconsistent. The type DD is applicable as RemovedVₚ contains a vertex adjacent to vₚ. However, DI is not applicable, as no vertex in RemovedVₚ and no vertex referenced by an arc from RemovedAₚ is not adjacent to vₚ. As the sets AddedV, AddedA, UpdatedVₚ are not empty, the types AI and UI are also applicable. The InconcistencyTypes and InconsistentV are therefore given by:

InconcistencyTypes = {DD,AI,UI}
InconsistentV = {v
,{DD,AI,UI}}

IDA was implemented for the medical coding graph and the results confirmed that the inconsistency detection can be automated! This enables systems to leverage past knowledge even if the underlying data is updated on a regular basis.

Thanks for reading!

--

--