One of the best parts of interactive data visualizations is that they make it possible for uses to tell their own data story. By making a visualization interactive, you can select and interrogate your data, ask your own questions, and build your own narrative of what the data is telling you. Galvanize recently partnered with DataWeek to produce the first graph visualization of product integrations in the technology sector. Here’s how we did it:
Using the Integrate platform, the team at DataWeek compiled a list of products and their associated integrations. An integration might be as simple as two products talking to each other over an API, such as how Salesforce and Zendesk work together, or something more complex. First, the data was cleaned, duplicate entries were removed, and then transformed into a list of nodes and edges using Pandas. Each node represents a product or company, while each edge describes the integration between each product.
Loading the resulting dataset into Gephi, it becomes clear that there are a few products that have no associated edges—these “island” nodes, such as Verizon, Zebra, and Pulse.io, are products that don’t have any integrations with other products in the dataset. We want to remove these from the final visualization for aesthetic and computational reasons, so we employ the “Giant Component” filter and select the largest connected component in the graph.
From here, we want to get a sense of which products contribute most heavily to the underlying structure of the graph. To do this, we run a version of the HITS (Hyperlink-Induced Topic Search) algorithm (which was initially used in determining the relative importance of web pages in search results) to determine how we should scale the size of elements within the graph.
The algorithm does the following: for each node, it computes a hub score and an authority score. Nodes that have many edges to other nodes have high hub scores, while nodes that have many edges to other hubs have high authority scores. This is similar to how websites that have many outgoing links have high hub scores, while websites with many incoming links have high authority scores.
In general, the more authority a given node has, the more influential or ‘important’ that product is to the overall graph. To show that visually, the size of each node is scaled by its authority score. Nodes with more authority appear larger. So in our example, we can easily pick out Salesforce, Mailchimp, and Atlassian as nodes with the highest authority scores. But why?
It turns out that nodes with high hub and authority scores typically sit within highly connected communities. In order to find out how many communities we have, if any, we use a measure called modularity. Modularity is used to measure the structure within a graph, specifically for clusters or communities of nodes, or in our case, products. Graphs with high modularity have dense connection between the nodes within a community, but sparse connections between nodes in other communities. From this analysis, we detect a fair modularity (0.615) and a total of 11 communities embedded within the graph. We incorporate this information into the visualization by coloring each node according to the community it belongs to, also know as its modularity class.
Now that we have computed the hub and authority scores, as well as various metrics on the graph, the next step is laying it out in a way that makes visual sense. What we want to do is keep clusters of companies that have a lot of integrations together while pushing other things apart so its not one big hairball. The best and most common way to do this is use ForceAtlas.
ForceAtlas is a vector graph layout algorithm, which is a fancy way of saying that it models the graph like a physical system. Nodes repulse each other (like opposing magnets) while edges attract the nodes they connect to (like a stretched spring). In the end, ForceAtlas creates a movement that converges to a balanced, final state. Indeed, if we compare the graph layout created by ForceAtlas and the modularity classes identified earlier, we see that they match as expected.
The Integrate platform is a useful tool for companies in a particular space to identify and and navigate which other companies they should be integrating with in order to reach new customers. By representing the data visually, it’s much easier for a company to identify these relationships than attempting to navigate the huge and complex original list of data. You can explore the full interactive visualization here.