Clustering Authoring Tools by Feature

An experiment during my PhD required me to evaluate authoring workflow in various authoring tools with different interface and user experience paradigms. Obviously it is impractical to evaluate each single tool both in terms of time and participant count. The solution I needed was to take all of the candidate tools and cluster them based on features, of which I could then select a few that broadly represent the capabilities of the tools.

In a recent publication of mine, Define ‘Authoring Tool’: A Survey of Interactive Narrative Authoring Tools, I discussed this process at length in the context of discovering what authoring tools are, i.e., how do they differ from one another. For technical details and results, please see the paper.

In this article, I will briefly discuss the simple yet effective methods that I used to create this hierarchical clustering of authoring tool features.

The Data

Gathering the data was done by manually exploring each tool, and in the case where it was not available directly, using any other resources such as papers, presentations, videos, and so on, to determine the capabilities of the programs. A list of features were determined and analyzed for each of the tools. Imputation was not used for missing data as to not assume capabilities of a program.

After a lot of analysis and preprocessing, the results were stored in a one-hot format in a CSV file. Each row represented a single tool’s observations and each column was one of the encoded features with either a 1 (has the feature) or 0 (does not have the feature).

MCA and HCPC

Once the data was loaded in, the FactoMineR package was used to perform Multiple Correspondence Analysis (MCA) followed by Hierarchical Clustering on Principal Components (HCPC) to determine relationships and clusters.

Simple MCA with FactoMineR looks like this:

tools.mca <- MCA(tools.data, ncp = Inf, graph = FALSE)

Where tools.data contains the loaded and preprocessed CSV. Two variables of interest are ind and var. In this case, the latter is the actual tools themselves (i.e., the rows).

Next we can perform HCPC on the output of the MCA step. By default, nb.clust is 0 which gives us interactive tree cutting, but in the version of RStudio I was using at the time, this froze the program. Setting it to a given positive integer would solve this, but then we’d have to decide the number of clusters using, for example, the inertia gain shown on the plots. Instead if it’s set to -1, the algorithm will cut it for you, which is what I did.

tools.cluster_count <- -1
tools.hcpc = HCPC(tools.mca,
  nb.clust = tools.cluster_count,
  nb.par = Inf,
  graph = F,
  consol = FALSE
)
if (tools.cluster_count == -1) {
	tools.cluster_count	<- max(as.integer(tools.hcpc$data.clust$clust))
}

Above, I use tools.cluster_count to control the number of clusters, and if it’s -1, update it to the automatically chosen number of clusters.

Also of importance is the setting of consol to false, due to what I believe was a bug with the version of factoextra I was using at the time. Using HCPC with consol as true worked fine, but when passing the output to an fviz_* function, all outputs were drawn as if consol was false. It is possible that I used the library incorrectly, but the safest bet was to set consol to false.

I then generated some lovely graphs using factoextra. Before calling any of the fviz_* functions, I manually updated the cluster count on the HCPC object as described here.

tools.hcpc$call$t$nb.clust <- tools.cluster_count

Finally, I generate the actual graphs. For dendrograms:

fviz_dend(tools.hcpc,
  repel = T,
  rect = T, rect_fill = T, lower_rect = -0.1,
  palette = "jco", rect_border = "jco",
  main = "", xlab = ""
)

And for cluster maps:

colors <- c("#cd534c", "#7aa6dc", "#868686", "#efc000")
fviz_cluster(tools.hcpc,
  repel = TRUE, show.clust.cent = TRUE,
  palette = "jco", ggtheme = theme_minimal(),
  main = "", stand = F) +
scale_color_manual(values = colors) +
scale_fill_manual(values = colors) +
theme(legend.title = element_blank())

Outcomes

With clusters identified and authoring tools assigned to them, the next step was to find out why. Further analysis was done on the relationship between the factors themselves (i.e., the input features; my columns) and their presence (or lack of presence) within a given cluster. Two percentages were analyzed: the percentage of all tools that have this feature that are in this cluster, and the percentage of tools in this cluster that have this feature. More details can be found in the paper linked at the beginning of this post if you’re interested in learning more.

Finally, the clusters were used to pick a selection of authoring tools that exhibit a broad sample of distinct feature sets. This means I can reduce the number of tests from the number of tools to 3 or 4.