18 Jun 2017 by Yusuke Tsuzuku

Improved Techniques for The K-clique Densest Subgraph Problem

This is a class project.

Abstract

Extracting dense subgraph has wide application and one of the core research topics in data mining. For Example, it’s applied to DNA motif search [2], social net [3] e.t.c. In general, however, finding maximum clique is NP-hard and we cannot get them within practical computational time [4]. Because of that, many other formulations have been proposed. Especially, Densest Subgraph Problem [5] is a polynomial-time solvable formulation and applied to many applications [6]. And in recent years, the new formulation has emerged: K-clique Densest Subgraph Problem. This problem is also solvable in polynomial time. Moreover, this formulation can extract dense subgraph nearer clique than Densest Subgraph Problem. In this work, we propose methods for previous exact/approximation algorithm which improve speed and memory efficiency.

Introduction

Let $c_k(S)$ be the number of k-cliques in a subgraph induced by a vertex set $S$ of simple undirected graph $G$, and define k-clique density $\tau(S)$ as $\frac{c(S)}{|S|}$. K-clique Densest Subgraph Problem is a problem which finds a vertex set is G that achieves the maximum k-clique density. Especially, when $k=3$, the problem is called Triangle Densest Subgraph Problem (TDS). [1] proposes polynomial time exact algorithm and $1/k$-approximation algorithm to this problem. We describe each of them here. Though we explain about TDS, you can extend them to general $k$.

First, we describe the exact solution. Here, we use improved one in [9]. Using triangle listing algorithm [7][8], list all triangle in the instance. Then construct the following directed bipartite graph $H$. $V(H) = {s}\cup{t}\cup V(G)\cup E(G)$. Add arcs to $H$ as follows.

from $s$ to each vertex $v \in V(G)$ with capacity $t(v)$, where $t(v)$ is the number of triangles $v$ belongs
from each vertex $v\in V(G)$ to edge $e\in E(G)$ with capacity one if $v$ and $e$ form a triangle
from each edge $e \in E(G)$ to its end points with infinite capacity
from each vertex $v \in V(G)$ to $t$ with capacity $3\alpha$

In this graph, “minimum cut of s and t is equal to $({s}$,${t}\cup V(G)\cup E(G))$” and “no subgraph with larger triangle density than $\alpha$” are proved to be equivalent. Using this fact, you can do a binary search. The iteration stops in $O(\log V)$, and you can get the exact solution in about $\tilde{O}(c_k(V))$ times and $O(c_k(V))$ memory.

Next, we describe the approximation algorithm. Again, list the all triangles and memo to each vertex the number of triangles that the vertex belongs. Then remove the vertex with the minimum number of triangles. After the removal, check all edge pairs having the vertex as its endpoint and update other vertices’ triangle count. Keep going until remained vertices are less than three and output the subgraph which achieved the maximum triangle density between all time points. Computational time is about $O(nm)$, and this produces $1/3$-approximation.

The sampling method is proposed to the scheme of the above exact algorithm [9]. With the method, you can gain tens of times of speed up in TDS and tens of thousands of speed up in larger k. However, it no more can calculate the exact solution. In this work, we introduce techniques which achieve increased speed and efficient memory usage without reducing accuracy.

Proposed Methods

With an intuition that only “probably dense” subgraph should be checked in the algorithm, reduce vertices, edges, cliques to consider and increase efficiency. For the purpose, we prove several lemmas.

Lemma 1:
If an optimal solution $S$ has the k-clique density larger than $D$, $S$ does not contain any vertices which belong less than $D$ k-cliques.
proof:
Let $S$ be an arbitral optimal solution. Assume that $S$ has a vertex $v$ which belongs to up to $d’(<D)$ k-cliques. Let $d(<d’)$ be the number of k-cliques which $v$ belongs and all vertices are in $S$. By assumption, $d\leq d’< D \leq \frac{c_k(S)}{|S|}$. Thus,
$\begin{align} \frac{c_k(S)}{|S|}-\frac{c_k(S-{v})}{|S|-1} &= \frac{c_k(S)}{|S|}-\frac{c_k(S)-d)}{|S|-1}
&= \frac{|S|d - c_k(S)}{|S|(|S|-1)}
&< 0 \end{align}$
This contradicts the optimality of S.

Lemma 2:
There exists an optimal solution which is contained by a connected component.</p>
proof:
Assume an optimal solution $S$ can be decomposed to $S_1$ and $S_2$ which have no edges between them. W.l.o.g. assume that $\frac{c_k(S_1)}{|S_1|}\leq \frac{c_k(S_2)}{|S_2|}$. Now,
$\begin{align} \frac{c_k(S)}{|S|}-\frac{c_k(S_1)}{|S_1|} &= \frac{c_k(S_1)+c_k(S_2)}{|S_1|+|S_2|} - \frac{c_k(S_1)}{|S_1|}
&= \frac{|S_1|c_k(S_2)-|S_2|c_k(S_1)}{|S||S_1|}
&\leq 0 \end{align}$
Because $S$ is optimal, $S_1$ is also an optimal solution. $S$ contains finite vertices and $|S_1|<|S|$, thus if we repeatedly apply above discussion to $S_1$, we can gain a solution which is contained by a connected component. Lemma 1, 2 shows that it’s enough that we first list all k-cliques and remove unnecessary vertices, then consider each connected component at a time. This removal of unncecessary vertices can gain more effect if we apply this method several times. Because we can reduce much vertices, edges, k-cliques to consider at a time, it’s expected that we can improve both speed and memory. In binary search, the initial maximum can be improved by previous $\frac{(V-1)(V-2)}{6}$[1].

Lemma 3:
Let M be the maximum number of k-cliques which a vertex belongs to within all vertices. Optimal solution has k-clique density less than or equal to $\frac{M}{k}$.</p>
proof:
Let $S$ be an arbitral optimal solution. The sum of the number k-cliques which the vertices in $S$ belongs is at most $M|S|$. Because we count each k-clique k times, $c_k(S)$$ is at most \frac{M|S|}{k}$. Thus, S’ k-clique density is at most $\frac{M}{k}$. Next, we describe approximation method. In previous method, cliques are recounted at each time any vertex is removed. However, if we keep all k-cliques when we listed them, we can reduce computational time about $\tilde{O}(c_k(V))$ except for the time to list all k-cliques. Note that this method comsumes $O(c_k(V))$ memory. Now, we apply the same method as exact algorithm and reduce k-cliques to be considered at a time. In order to apply the method, we have to change the approximation to use the framework of the binary search.

Lemma 4:
Consider that after removing all vertices with less than $\alpha$ k-cliques we applyed the approximation algorithm to each connected component. If the best output had k-clique density larger than $\alpha/k$, the output achieves $1/k$-approximation.</p>
proof:
If the k-clique density of optimal solutions are less than $\alpha$, the claim is trivial. Consider not. Then from lemma 1 and 2, there exist a connected component which contains a optimal solution after removal. Because we applied $1/k$-approximation algorithm to the connected component, the lemma follows. Now, it’s expected that if we decreasing threshold $\alpha$ as $\alpha=(\alpha+\min)/2$ and apply the approximation algorithm with our method until the condition of lemma 4 is satisfied, we can gain efficient algorithm. Moreover, this new approximation algorithm can be easily combined with another work [9].

Experiments

The experiments was performed on a single machine, with an Intel Corei7 6700K CPU. We measured the time from the program enter the while loop and exit, which was the previous bottleneck. All experiments were that of TDS, that is, k=3. We userd [10] for triangle listing and [11] for max flow. There are no reason why we didn’t use [12] as [9] except for our laziness. We evaluated with the following 7 datasets. karate[13], C-elegans[14], polblogs[15], CA-HepTh[16], soc-slashdot0902[17], dblp[18], web-Google[19].

speed:

dataset	V	E	n_triangles	privious exact method (ms)	our exact method (ms)	previous approximation method (ms)	our approximation method (ms)
karate	35	78	45	2	0	0	0
C-elegans	131	687	639	60	42	2	1
polblogs	1,224	16,715	101,043	9,397	2,329	377	35
CA-HepTh	9,877	25,973	28,339	1,841	112	104	3
soc-slashdot0902	82,168	504,230	602,592	137,071	12,410	23,906	116
dblp	317,080	1,049,866	2.22438E+06	165,402	6,243	9,026	278
web-google	875,713	4,322,051	1.33919E+07	/	72,152	251,130	564

the number of triangles which was considered at the same time:

dataset	n_triangles	our exact method (ms)	our approximation method (ms)
karate	45	28	16
C-elegans	639	547	326
polblogs	101,043	62,451	27,461
CA-HepTh	28,339	4,960	4,960
soc-slashdot0902	602,592	213,490	101,234
dblp	2.22438E+06	240,464	240,464
web-google	1.33919E+07	134,076	38,319

$3\times$ (triangle density of the output):

dataset	exact solution	our approximation method (ms)
karate	8.0041	8
C-elegans	19.4999	18.7059
polblogs	986.471	882.405
CA-HepTh	465	465
soc-slashdot0902	2383.6	2172.52
dblp	6328	6328
web-google	1337.45	1334.17

memory consumption (approximation via ps command):

dataset	previous exact method	our exact method	previous approximation method	fast approximation method	our approximation method
web-google	10.4 GB (killed before the end)	3.65 GB	336 MB	3.65 GB	384 MB

comparison of each technique in exact algorithm (dataset: dblp):

	naive	only connected component decomposition	only removal	our method
speed (ms)	150,563	171,394	9,393	6,286
triangles considered	2.22438E+06	2.22438E+06	412,164	240,464

comparison of each technique in approximation algorithm (dataset: web-Google):

	naive	only connected component decomposition	only removal	our method
speed (ms)	81,181	84,312	9,393	6,286
triangles considered	1.33919E+07	1.33563E+07	189,442	38,319
3 * triangle density	1262.28	1253.04	1253.04	1334.17

Approximation accuracy was improved in all other datasets. Producing more than one candidate of the output and checking each connected component one by one is considered to be contributing this byproduct.

Conclusion

We proposed improved method for the K-clique Densest Subgraph Problem and verified the efficiency of the method via experiments. As future works, better bound of complexity with restricted classes of graph, better way of reducing the threshold in our approximation algorithm, accuracy garantee of our approximation algorithm combined with a graph sparsifier [9] are open problem.

References

[1] C. E. Tsourkakis. The K-clique Densest Subgraph Problem. In WWW 2015.
[2] E. Fratkin, B. T. Naughton, D. Brutlag, S. Batzoglou. Motifcut: regulatory motifs finding with maximum density subgraphs. Bioinformatics, 22(14):e150–e157, 2006
[3] A. Gionis, F. Junqueira, V. Leroy, M. Serafini, I. Weber. Piggybacking on social networks. Proceedings of the VLDB Endowment, 2013
[4] Karp, Richard M. Reducibility among combinatorial problems. In Miller, R. E.; Thatcher, J. W., Complexity of Computer Computations, New York: Plenum, pp. 85–103, 1972
[5] A. V. Goldberg. Finding a maximum density subgraph. Technical report, University of California at Berkeley, 1984.
[6] E. Cohen, E. Halperin, H. Kaplan, and U. Zwick. Reachability and distance queries via 2-hop labels. In SODA, 2002.
[7] A. Itai and M. Rodeh. Finding a minimum circuit in a graph. SIAM Journal on Computing, 7(4):413–423, 1978.
[8] A. Bjo ̈rklund, R. Pagh, V. Williams Vassilevska, and U. Zwick. Listing triangles. In Proceedings of 41st International Colloquium on Automata, Languages and Programming (ICALP), 2014.
[9] M. Mitzenmacher, J. Pachocki, R. Peng, C. E. Tsourkakis, and S. C. Xu. Scalable Large Near-Clique Detection in Large-Scale Networks via Sampling.
[10] N.Chiba and T.Nishizeki. Arbocity and subgraph listing algorithms. In Society for Industrial and Applied Mathematics 1985.
[11] E. A. Dinic. Algorithm for solution of a problem of maximum flow in a network with power estimation. Doklady Akademii nauk SSSR. 11: 1277–1280. 1970.
[12] A. V. Goldberg and R. E. Tarjan. A new approach to the maximum-flow problem. In Journal of the ACM (JACM), 35(4), pages 921–940, 1988.
[13] http://konect.uni-koblenz.de/networks/ucidata-zachary
[14] https://snap.stanford.edu/data/C-elegans-frontal.html
[15] https://networkdata.ics.uci.edu/data.php?id=102
[16] https://snap.stanford.edu/data/ca-HepTh.html
[17] http://snap.stanford.edu/data/soc-Slashdot0902.html
[18] https://snap.stanford.edu/data/com-DBLP.html
[19] http://snap.stanford.edu/data/web-Google.html

Experiment3 2nd