Improved Techniques for The K-clique Densest Subgraph Problem

This is a class project.

Abstract

Extracting dense subgraph has wide application and one of the core research topics in data mining. For Example, it’s applied to DNA motif search [2], social net [3] e.t.c. In general, however, finding maximum clique is NP-hard and we cannot get them within practical computational time [4]. Because of that, many other formulations have been proposed. Especially, Densest Subgraph Problem [5] is a polynomial-time solvable formulation and applied to many applications [6]. And in recent years, the new formulation has emerged: K-clique Densest Subgraph Problem. This problem is also solvable in polynomial time. Moreover, this formulation can extract dense subgraph nearer clique than Densest Subgraph Problem. In this work, we propose methods for previous exact/approximation algorithm which improve speed and memory efficiency.

Introduction

Let $c_k(S)$ be the number of k-cliques in a subgraph induced by a vertex set $S$ of simple undirected graph $G$, and define k-clique density $\tau(S)$ as $\frac{c(S)}{|S|}$. K-clique Densest Subgraph Problem is a problem which finds a vertex set is G that achieves the maximum k-clique density. Especially, when $k=3$, the problem is called Triangle Densest Subgraph Problem (TDS). [1] proposes polynomial time exact algorithm and $1/k$-approximation algorithm to this problem. We describe each of them here. Though we explain about TDS, you can extend them to general $k$.

First, we describe the exact solution. Here, we use improved one in [9]. Using triangle listing algorithm [7][8], list all triangle in the instance. Then construct the following directed bipartite graph $H$. $V(H) = {s}\cup{t}\cup V(G)\cup E(G)$. Add arcs to $H$ as follows.

• from $s$ to each vertex $v \in V(G)$ with capacity $t(v)$, where $t(v)$ is the number of triangles $v$ belongs
• from each vertex $v\in V(G)$ to edge $e\in E(G)$ with capacity one if $v$ and $e$ form a triangle
• from each edge $e \in E(G)$ to its end points with infinite capacity
• from each vertex $v \in V(G)$ to $t$ with capacity $3\alpha$

In this graph, “minimum cut of s and t is equal to $({s}$,${t}\cup V(G)\cup E(G))$” and “no subgraph with larger triangle density than $\alpha$” are proved to be equivalent. Using this fact, you can do a binary search. The iteration stops in $O(\log V)$, and you can get the exact solution in about $\tilde{O}(c_k(V))$ times and $O(c_k(V))$ memory.

Next, we describe the approximation algorithm. Again, list the all triangles and memo to each vertex the number of triangles that the vertex belongs. Then remove the vertex with the minimum number of triangles. After the removal, check all edge pairs having the vertex as its endpoint and update other vertices’ triangle count. Keep going until remained vertices are less than three and output the subgraph which achieved the maximum triangle density between all time points. Computational time is about $O(nm)$, and this produces $1/3$-approximation.

The sampling method is proposed to the scheme of the above exact algorithm [9]. With the method, you can gain tens of times of speed up in TDS and tens of thousands of speed up in larger k. However, it no more can calculate the exact solution. In this work, we introduce techniques which achieve increased speed and efficient memory usage without reducing accuracy.

Proposed Methods

With an intuition that only “probably dense” subgraph should be checked in the algorithm, reduce vertices, edges, cliques to consider and increase efficiency. For the purpose, we prove several lemmas.

Lemma 1:
If an optimal solution $S$ has the k-clique density larger than $D$, $S$ does not contain any vertices which belong less than $D$ k-cliques.
proof:
Let $S$ be an arbitral optimal solution. Assume that $S$ has a vertex $v$ which belongs to up to $d’(<D)$ k-cliques. Let $d(<d’)$ be the number of k-cliques which $v$ belongs and all vertices are in $S$. By assumption, $d\leq d’< D \leq \frac{c_k(S)}{|S|}$. Thus,
\begin{align} \frac{c_k(S)}{|S|}-\frac{c_k(S-{v})}{|S|-1} &= \frac{c_k(S)}{|S|}-\frac{c_k(S)-d)}{|S|-1} &= \frac{|S|d - c_k(S)}{|S|(|S|-1)} &< 0 \end{align}
This contradicts the optimality of S.

Lemma 2:
There exists an optimal solution which is contained by a connected component.</p>
proof:
Assume an optimal solution $S$ can be decomposed to $S_1$ and $S_2$ which have no edges between them. W.l.o.g. assume that $\frac{c_k(S_1)}{|S_1|}\leq \frac{c_k(S_2)}{|S_2|}$. Now,
\begin{align} \frac{c_k(S)}{|S|}-\frac{c_k(S_1)}{|S_1|} &= \frac{c_k(S_1)+c_k(S_2)}{|S_1|+|S_2|} - \frac{c_k(S_1)}{|S_1|} &= \frac{|S_1|c_k(S_2)-|S_2|c_k(S_1)}{|S||S_1|} &\leq 0 \end{align}
Because $S$ is optimal, $S_1$ is also an optimal solution. $S$ contains finite vertices and $|S_1|<|S|$, thus if we repeatedly apply above discussion to $S_1$, we can gain a solution which is contained by a connected component. Lemma 1, 2 shows that it’s enough that we first list all k-cliques and remove unnecessary vertices, then consider each connected component at a time. This removal of unncecessary vertices can gain more effect if we apply this method several times. Because we can reduce much vertices, edges, k-cliques to consider at a time, it’s expected that we can improve both speed and memory. In binary search, the initial maximum can be improved by previous $\frac{(V-1)(V-2)}{6}$[1].

Lemma 3:
Let M be the maximum number of k-cliques which a vertex belongs to within all vertices. Optimal solution has k-clique density less than or equal to $\frac{M}{k}$.</p>
proof:
Let $S$ be an arbitral optimal solution. The sum of the number k-cliques which the vertices in $S$ belongs is at most $M|S|$. Because we count each k-clique k times, $c_k(S)$$is at most \frac{M|S|}{k}$. Thus, S’ k-clique density is at most $\frac{M}{k}$. Next, we describe approximation method. In previous method, cliques are recounted at each time any vertex is removed. However, if we keep all k-cliques when we listed them, we can reduce computational time about $\tilde{O}(c_k(V))$ except for the time to list all k-cliques. Note that this method comsumes $O(c_k(V))$ memory. Now, we apply the same method as exact algorithm and reduce k-cliques to be considered at a time. In order to apply the method, we have to change the approximation to use the framework of the binary search.

Lemma 4:
Consider that after removing all vertices with less than $\alpha$ k-cliques we applyed the approximation algorithm to each connected component. If the best output had k-clique density larger than $\alpha/k$, the output achieves $1/k$-approximation.</p>
proof:
If the k-clique density of optimal solutions are less than $\alpha$, the claim is trivial. Consider not. Then from lemma 1 and 2, there exist a connected component which contains a optimal solution after removal. Because we applied $1/k$-approximation algorithm to the connected component, the lemma follows. Now, it’s expected that if we decreasing threshold $\alpha$ as $\alpha=(\alpha+\min)/2$ and apply the approximation algorithm with our method until the condition of lemma 4 is satisfied, we can gain efficient algorithm. Moreover, this new approximation algorithm can be easily combined with another work [9].

Experiments

The experiments was performed on a single machine, with an Intel Corei7 6700K CPU. We measured the time from the program enter the while loop and exit, which was the previous bottleneck. All experiments were that of TDS, that is, k=3. We userd [10] for triangle listing and [11] for max flow. There are no reason why we didn’t use [12] as [9] except for our laziness. We evaluated with the following 7 datasets. karate[13], C-elegans[14], polblogs[15], CA-HepTh[16], soc-slashdot0902[17], dblp[18], web-Google[19].

speed:

dataset V E n_triangles privious exact method (ms) our exact method (ms) previous approximation method (ms) our approximation method (ms)
karate 35 78 45 2 0 0 0
C-elegans 131 687 639 60 42 2 1
polblogs 1,224 16,715 101,043 9,397 2,329 377 35
CA-HepTh 9,877 25,973 28,339 1,841 112 104 3
soc-slashdot0902 82,168 504,230 602,592 137,071 12,410 23,906 116
dblp 317,080 1,049,866 2.22438E+06 165,402 6,243 9,026 278
web-google 875,713 4,322,051 1.33919E+07 / 72,152 251,130 564

the number of triangles which was considered at the same time:

dataset n_triangles our exact method (ms) our approximation method (ms)
karate 45 28 16
C-elegans 639 547 326
polblogs 101,043 62,451 27,461
CA-HepTh 28,339 4,960 4,960
soc-slashdot0902 602,592 213,490 101,234
dblp 2.22438E+06 240,464 240,464

$3\times$ (triangle density of the output):

dataset exact solution our approximation method (ms)
karate 8.0041 8
C-elegans 19.4999 18.7059
polblogs 986.471 882.405
CA-HepTh 465 465
soc-slashdot0902 2383.6 2172.52
dblp 6328 6328

memory consumption (approximation via ps command):

dataset previous exact method our exact method previous approximation method fast approximation method our approximation method
web-google 10.4 GB (killed before the end) 3.65 GB 336 MB 3.65 GB 384 MB

comparison of each technique in exact algorithm (dataset: dblp):

naive only connected component decomposition only removal our method
speed (ms) 150,563 171,394 9,393 6,286
triangles considered 2.22438E+06 2.22438E+06 412,164 240,464

comparison of each technique in approximation algorithm (dataset: web-Google):

naive only connected component decomposition only removal our method
speed (ms) 81,181 84,312 9,393 6,286
triangles considered 1.33919E+07 1.33563E+07 189,442 38,319
3 * triangle density 1262.28 1253.04 1253.04 1334.17

Approximation accuracy was improved in all other datasets. Producing more than one candidate of the output and checking each connected component one by one is considered to be contributing this byproduct.

Conclusion

We proposed improved method for the K-clique Densest Subgraph Problem and verified the efficiency of the method via experiments. As future works, better bound of complexity with restricted classes of graph, better way of reducing the threshold in our approximation algorithm, accuracy garantee of our approximation algorithm combined with a graph sparsifier [9] are open problem.

References

[1] C. E. Tsourkakis. The K-clique Densest Subgraph Problem. In WWW 2015.
[2] E. Fratkin, B. T. Naughton, D. Brutlag, S. Batzoglou. Motifcut: regulatory motifs finding with maximum density subgraphs. Bioinformatics, 22(14):e150–e157, 2006
[3] A. Gionis, F. Junqueira, V. Leroy, M. Serafini, I. Weber. Piggybacking on social networks. Proceedings of the VLDB Endowment, 2013
[4] Karp, Richard M. Reducibility among combinatorial problems. In Miller, R. E.; Thatcher, J. W., Complexity of Computer Computations, New York: Plenum, pp. 85–103, 1972
[5] A. V. Goldberg. Finding a maximum density subgraph. Technical report, University of California at Berkeley, 1984.
[6] E. Cohen, E. Halperin, H. Kaplan, and U. Zwick. Reachability and distance queries via 2-hop labels. In SODA, 2002.
[7] A. Itai and M. Rodeh. Finding a minimum circuit in a graph. SIAM Journal on Computing, 7(4):413–423, 1978.
[8] A. Bjo ̈rklund, R. Pagh, V. Williams Vassilevska, and U. Zwick. Listing triangles. In Proceedings of 41st International Colloquium on Automata, Languages and Programming (ICALP), 2014.
[9] M. Mitzenmacher, J. Pachocki, R. Peng, C. E. Tsourkakis, and S. C. Xu. Scalable Large Near-Clique Detection in Large-Scale Networks via Sampling.
[10] N.Chiba and T.Nishizeki. Arbocity and subgraph listing algorithms. In Society for Industrial and Applied Mathematics 1985.
[11] E. A. Dinic. Algorithm for solution of a problem of maximum flow in a network with power estimation. Doklady Akademii nauk SSSR. 11: 1277–1280. 1970.
[12] A. V. Goldberg and R. E. Tarjan. A new approach to the maximum-flow problem. In Journal of the ACM (JACM), 35(4), pages 921–940, 1988.
[13] http://konect.uni-koblenz.de/networks/ucidata-zachary
[14] https://snap.stanford.edu/data/C-elegans-frontal.html
[15] https://networkdata.ics.uci.edu/data.php?id=102
[16] https://snap.stanford.edu/data/ca-HepTh.html
[17] http://snap.stanford.edu/data/soc-Slashdot0902.html
[18] https://snap.stanford.edu/data/com-DBLP.html