Jump to: Page Content, Section Navigation, Site Navigation, Site Search, Account Information, or Site Tools.
|
|
Technical CommentsResponse to Comment on "Clustering by Passing Messages Between Data Points"
Affinity propagation (AP) can be viewed as a generalization of the vertex substitution heuristic (VSH), whereby probabilistic exemplar substitutions are performed concurrently. Although results on small data sets (
Department of Electrical and Computer Engineering, University of Toronto, 10 King's College Road, Toronto, Ontario M5S 3G4, Canada.
900 points) demonstrate that VSH is competitive with AP, we found VSH to be prohibitively slow for moderate-to-large problems, whereas AP was much faster and could achieve lower error.
* To whom correspondence should be addressed. E-mail: frey{at}psi.toronto.edu
Affinity propagation (AP) is an algorithm that clusters data and identifies exemplar data points that can be used for summarization and subsequent analysis (1). Dozens of clustering algorithms have been invented in the past 40 years, but in (1) we compared AP with three commonly used methods and found that AP could find solutions with lower error and do so much more quickly. Brusco and Köhn (2) compared AP with the best of 20 runs of a randomly initialized vertex substitution heuristic (VSH) described in 1997 (3), which is based on a previously introduced method (4). They found that for some small data sets ( As explained in our original report (1), regardless of whether or not the measure of data similarity is symmetric, "[e]xactly minimizing [AP's cost function] is computationally intractable, because a special case of this minimization problem is the NP-hard k-median problem" (also known as the p-median model, or PMM). [Also see (6).] Consequently, it is expected that different algorithms may work better for different data sets. In fact, we reported results using a brute force method that can exactly solve trivially small problems (e.g., 70 points and six clusters) in a minute or two on any modern computer. We also pointed out that linear programming relaxations (i.e., Lagrangian relaxations) have been used when the data set is not too large (7). We were curious about how close AP and VSH could get to the best possible (exact) solution, so we studied the original Olivetti face data set (n = 400 data points) for which the exact clustering solution could be found. Because the error for VSH varies depending on the random initialization, in all of our experiments we used the best of 20 runs. Figure 1A plots squared error versus the number of exemplars (k) for AP (single run), VSH (best of 20 runs), k-centers clustering (all of one million runs), and the exact solution (7). Both AP and VSH performed substantially better than k-centers clustering and, for practical purposes, achieved the exact solution (for k < 135, the difference in error between AP or VSH and the exact solution is less than the error reduction obtained by including an additional exemplar) (5).
The results presented by Brusco and Köhn (2) demonstrate that for small data sets (
To further compare AP and VSH, we studied six additional, diverse data sets: a randomly generated 400 x 400 nonmetric similarity matrix (8), an extended circuit-board data set consisting of 1272 drill-hole coordinates (9), 1965 video frames containing faces (10), 10,000 of the 75,066 putative exons from (1), 10,000 of the 22,709 gene expression patterns from (11), and 17,770 Netflix movies (12). We applied AP and VSH to these data sets, and we report error (measured relative to the minimum error, because n was too large to compute the lowest possible error exactly) and central processing unit (CPU) time in Table 1. These results show that although VSH sometimes achieves lower error than AP, the converse is frequently true. These results indicate that AP has a considerable advantage over VSH in terms of speed, especially as the number of data points (n) or exemplars (k) grows. Although using more VSH runs may decrease its error, doing so will require additional CPU time. Conversely, using fewer than 20 runs would likely increase the VSH error; in the one case in which VSH achieved lower error than AP, it did so in only two of 20 runs (if VSH were run once, the AP error would be lower with
To more closely investigate how the computation times of AP and VSH scale with problem size, we varied n from 1000 to 17,770 for the Netflix movie data (taking nested data sets) while fixing the preference at –1 to obtain moderately sized clusters with 10 to 40 points. Figure 1B shows that AP is much faster than VSH for large problems, where a speedup of roughly 100 is achieved. Because VSH explicitly tracks k exemplars, whereas AP does not, we expected that VSH's CPU time would scale more poorly with k. For all data (n = 17,770), Fig. 1C shows that the CPU time required by VSH grows linearly with the number of exemplars (k), whereas the CPU time required by AP is largely independent of k (13). What is the relationship between AP and VSH and how they reorganize clusters to better account for the data? VSH starts with a randomly selected set of exemplars and then examines every nonexemplar data point in turn and considers replacing a current exemplar with that point. It is well known that VSH can get stuck in a suboptimal solution and that randomly initializing VSH works well only if chances are good that at least one random initialization is close to a good solution (14). As a simple example, if the current exemplars are the "suboptimal exemplars" shown in Fig. 1D, replacing one of these with either of the two other points will not reduce the total squared error. If two initial exemplars are chosen at random, VSH will get stuck in that suboptimal solution 17% of the time (Fig. 1E). Interestingly, AP can be viewed as a generalization of VSH, wherein all points simultaneously compete for replacing every exemplar, but in a probabilistic manner. AP's availabilities (1) represent a "soft exemplar set," and responsibilities provide probabilistic evidence that a candidate exemplar should replace an existing exemplar. Indeed, AP always avoids the poor solution in the above example. If a data set contains many similar data point configurations, chances are high that VSH will get stuck in a suboptimal solution because of exemplars involved in some of those configurations. When we replicated the data from Fig. 1D 100 times with horizontal and vertical shifts drawn uniformly from [–50, +50], for k = 168, the best error in 20 VSH runs was 0.951% (relative to optimal), whereas the best error using AP was 0.102%. Another difference is that AP automatically determines the number of clusters (k) by taking into account exemplar costs (negative preferences), whereas VSH finds a prespecified number of clusters. Automatic model selection is widely considered to be advantageous (15). Many tasks in science, engineering, and medicine require that the solution satisfy a stringency threshold based on a desired false detection rate, a Bayesian prior probability (15), or an information-theoretic coding cost (16). In AP, the exemplar cost exactly accounts for this stringency threshold. Based on the results we obtained on moderate-to-large problems and AP's ease of implementation and extendability, we believe that AP offers considerable advantages over existing methods.
Received for publication 5 November 2007. Accepted for publication 15 January 2008.
The editors suggest the following Related Resources on Science sites:In Science Magazine
|
Science. ISSN 0036-8075 (print), 1095-9203 (online)