SNAP Library 2.0, User Reference
2013-05-13 16:33:57
SNAP, a general purpose, high performance system for analysis and manipulation of large networks
|
Local Spectral Clustering algorithm. More...
#include <ncp.h>
Public Member Functions | |
TLocClust (const PUNGraph &GraphPt, const double &AlphaVal) | |
int | Len () const |
Returns the support of the approximate random walk, the number of nodes with non-zero PageRank score. | |
int | GetRndWalkSup () const |
Returns the support of the approximate random walk, the number of nodes with non-zero PageRank score. | |
int | GetNId (const int &NodeN) const |
Returns the ID of the NodeN-th node in the sweep vector. | |
int | GetVol (const int &Nodes) const |
Returns the volume of the set of first NodeN nodes in the sweep vector. | |
int | GetCut (const int &Nodes) const |
Returns the size of the cut of the first Nodes nodes in the sweep vector. | |
double | GetPhi (const int &ValId) const |
Returns the conductance of the cut separating the first Nodes nodes in the sweep vector from the rest of the graph. | |
const TIntV & | GetNIdV () const |
Vector of node IDs sorted in the random walk order. | |
const TIntV & | GetVolV () const |
Volume (sum of the degrees) of first K-nodes in the node-id vector (GetNIdV()). | |
const TIntV & | GetCutV () const |
Edges cut to separate the first K-nodes in the node-id vector (GetNIdV()) from the rest of the graph. | |
const TFltV & | GetPhiV () const |
Conducatance of the cut separating first K-nodes in the node-id vector (GetNIdV()) from the rest of the graph. | |
int | BestCut () const |
Index K of the cut of the minimum conductance around the seed node. | |
int | BestCutNodes () const |
Number of nodes inside the 'best' (minimum conductance) cut. | |
int | GetCutEdges () const |
Number of edges in the 'best' (minimum conductance) cut. | |
int | GetCutVol () const |
Volume of the 'best' (minimum conductance) cut. | |
double | GetCutPhi () const |
Conductance of the 'best' (minimum conductance) cut. | |
int | ApproxPageRank (const int &SeedNode, const double &Eps) |
Computes Approximate PageRank from the seed node SeedNId and with tolerance Eps. | |
void | SupportSweep () |
After the function ApproxPageRank() has been run the SupportSweep() computes the volume, cut size, node ids, conductance vectors. | |
void | FindBestCut (const int &SeedNode, const int &ClustSz, const double &MinSizeFrac=0.2) |
Finds minimum conductance cut in the graph around the seed node. | |
void | PlotVolDistr (const TStr &OutFNm, TStr Desc=TStr()) const |
Plots the cluster volume vs. cluster size K (cluster is composed of nodes NIdV[1...K]). | |
void | PlotCutDistr (const TStr &OutFNm, TStr Desc=TStr()) const |
Plots the cluster cut size vs. cluster size K (cluster is composed of nodes NIdV[1...K]). | |
void | PlotPhiDistr (const TStr &OutFNm, TStr Desc=TStr()) const |
Plots the cluster conductance vs. cluster size K (cluster is composed of nodes NIdV[1...K]). | |
void | SavePajek (const TStr &OutFNm) const |
Saves the network in the Pajek format so it can be visualized. Red node represents the seed and color the cluster membership. | |
Static Public Member Functions | |
static void | DrawWhiskers (const PUNGraph &Graph, TStr FNmPref, const int &PlotN) |
Draws the 'whiskers' of the graph. Whiskers are small sub-graphs that are attached to the rest of the graph via a single edge. | |
static void | GetCutStat (const PUNGraph &Graph, const TIntV &NIdV, int &Vol, int &Cut, double &Phi, int GraphEdges=-1) |
For a given Graph and a set of nodes NIdV the function returns the Volume, CutSize and the Conductance of the cut. | |
static void | GetCutStat (const PUNGraph &Graph, const TIntSet &NIdSet, int &Vol, int &Cut, double &Phi, int GraphEdges=-1) |
For a given Graph and a set of nodes NIdV the function returns the Volume, CutSize and the Conductance of the cut. | |
static void | PlotNCP (const PUNGraph &Graph, const TStr &FNm, const TStr Desc="", const bool &BagOfWhiskers=true, const bool &RewireNet=false, const int &KMin=10, const int &KMax=Mega(100), const int &Coverage=10, const bool &SaveTxtStat=false, const bool &PlotBoltzman=false) |
Static Public Attributes | |
static bool | Verbose = true |
Private Attributes | |
PUNGraph | Graph |
int | Nodes |
int | Edges2 |
TIntFltH | ProbH |
TIntFltH | ResH |
TIntQ | NodeQ |
double | Alpha |
int | SeedNId |
TIntV | NIdV |
TIntV | VolV |
TIntV | CutV |
TFltV | PhiV |
int | BestCutIdx |
Friends | |
class | TLocClustStat |
Local Spectral Clustering algorithm.
The code implements the PageRank Nibble local clustering algorithm of Andersen, Chung and Lang. Given a single starting seed node, the algorithm will then find the clusters around that node. This is achieved by the algorithm finding the approximate personalized PageRank score of every node with respect to the Seed node. Nodes are then ordered by the PageRank score and the idea is then that by 'sweeping' the vector of PageRank scores one can find communities around the chosen seed node. The idea is to try out K = 1...N/2 and then for a set of {node_1 ... node_K} test the value of the conductance (Phi). If the conductance at certain value of K achieves a local minima, then we found a good cut in the graph. This method is also used for computing the Network Community Profile plots. See: Local Graph Partitioning using PageRank Vectors by R. Andersen, F. Chung and K. Lang URL: http://www.math.ucsd.edu/~fan/wp/localpartition.pdf
TLocClust::TLocClust | ( | const PUNGraph & | GraphPt, |
const double & | AlphaVal | ||
) | [inline] |
int TLocClust::ApproxPageRank | ( | const int & | SeedNode, |
const double & | Eps | ||
) |
Computes Approximate PageRank from the seed node SeedNId and with tolerance Eps.
The algorithm basically sets the PageRank scores of nodes with score <Eps to zero. So the lower the value of Eps the longer the algorithm will run.
Definition at line 9 of file ncp.cpp.
{ for (int i = 0; i < ProbH.Len(); i++) { ProbH[i]=0.0; } for (int i = 0; i < ResH.Len(); i++) { ResH[i]=0.0; } ProbH.Clr(false, -1, false); ResH.Clr(false, -1, false); ResH.AddDat(SeedNode, 1.0); int iter = 0; double OldRes = 0.0; NodeQ.Clr(false); NodeQ.Push(SeedNode); TExeTm ExeTm; while (! NodeQ.Empty()) { const int NId = NodeQ.Top(); NodeQ.Pop(); const TUNGraph::TNodeI& Node = Graph->GetNI(NId); const int NIdDeg = Node.GetOutDeg(); const double PushVal = ResH.GetDat(NId) - 0.5*Eps*NIdDeg; const double PutVal = (1.0-Alpha) * PushVal / double(NIdDeg); ProbH.AddDat(NId) += Alpha*PushVal; ResH.AddDat(NId) = 0.5 * Eps * NIdDeg; for (int e = 0; e < NIdDeg; e++) { const int DstNId = Node.GetOutNId(e); const int DstDeg = Graph->GetNI(DstNId).GetOutDeg(); double& ResVal = ResH.AddDat(DstNId).Val; OldRes = ResVal; ResVal += PutVal; if (ResVal >= Eps*DstDeg && OldRes < Eps*DstDeg) { NodeQ.Push(DstNId); } } iter++; if (iter % Mega(1) == 0) { printf(" %d[%s]", NodeQ.Len(), ExeTm.GetStr()); if (iter/1000 > Graph->GetNodes() || ExeTm.GetSecs() > 4*3600) { // more than 2 hours printf("Too many iterations! Stop to save time.\n"); return iter; } } } // check that the residuals are sufficiently small /*for (int i =0; i < ProbH.Len(); i++) { const int Deg = Graph->GetNI(ResH.GetKey(i)).GetOutDeg(); IAssert(ResH[i] < Eps*Deg); } //*/ return iter; }
int TLocClust::BestCut | ( | ) | const [inline] |
Index K of the cut of the minimum conductance around the seed node.
This means that the set of GetNId(0)...GetNId(K) forms the best cut around the seed node.
Definition at line 73 of file ncp.h.
{ return BestCutIdx; }
int TLocClust::BestCutNodes | ( | ) | const [inline] |
Number of nodes inside the 'best' (minimum conductance) cut.
Definition at line 75 of file ncp.h.
{ return BestCutIdx+1; }
void TLocClust::DrawWhiskers | ( | const PUNGraph & | Graph, |
TStr | FNmPref, | ||
const int & | PlotN = 10 |
||
) | [static] |
Draws the 'whiskers' of the graph. Whiskers are small sub-graphs that are attached to the rest of the graph via a single edge.
Definition at line 172 of file ncp.cpp.
{ TCnComV CnComV; TSnap::Get1CnCom(Graph, CnComV); CnComV.Sort(false); // plot size distribution { TIntH SzCntH; for (int i = 0; i < CnComV.Len(); i++) { SzCntH.AddDat(CnComV[i].Len()) += 1; } TGnuPlot::PlotValCntH(SzCntH, "whiskSz."+FNmPref, TStr::Fmt("%s. G(%d, %d)", FNmPref.CStr(), Graph->GetNodes(), Graph->GetEdges()), "Whisker size (Maximal component connected with a bridge edge)", "Count", gpsLog10XY, false); } // draw them int BrNodeId = -1; for (int c = 0; c < TMath::Mn(CnComV.Len(), PlotN); c++) { const PUNGraph BrClust = TSnap::GetSubGraph(Graph, CnComV[c].NIdV); for (TUNGraph::TNodeI NI = BrClust->BegNI(); NI < BrClust->EndNI(); NI++) { if (NI.GetOutDeg() != Graph->GetNI(NI.GetId()).GetOutDeg()) { BrNodeId=NI.GetId(); break; } } TIntStrH ClrH; ClrH.AddDat(BrNodeId, "red"); TSnap::DrawGViz(BrClust, gvlNeato, TStr::Fmt("whisk-%s-%02d.gif", FNmPref.CStr(), c+1), TStr::Fmt("Bridge node id: %d, n=%d, e=%d", BrNodeId, BrClust->GetNodes(), BrClust->GetEdges()), false, ClrH); } }
void TLocClust::FindBestCut | ( | const int & | SeedNode, |
const int & | ClustSz, | ||
const double & | MinSizeFrac = 0.2 |
||
) |
Finds minimum conductance cut in the graph around the seed node.
Function first computes the ApproxPageRank(), initializes the SupportSweep() and then find the minimum conductance cluster. Parameter ClustSz controls the expected cluster size and is used to determine the tolerance (Eps) of the approximate PageRank calculation.
Definition at line 81 of file ncp.cpp.
{ double MaxPhi = TFlt::Mx; BestCutIdx = -1; SeedNId = SeedNode; // calculate pagerank and cut sets ApproxPageRank(SeedNId, 1.0/double(ClustSz)); for (int i = 0; i < ProbH.Len(); i++) { ProbH[i] /= Graph->GetNI(ProbH.GetKey(i)).GetOutDeg(); } ProbH.SortByDat(false); SupportSweep(); // find best cut NIdV.Clr(false); for (int i = 0; i < PhiV.Len(); i++) { const double Phi = PhiV[i]; NIdV.Add(ProbH.GetKey(i)); if (Phi < MaxPhi) { MaxPhi = Phi; BestCutIdx = i; } } }
int TLocClust::GetCut | ( | const int & | Nodes | ) | const [inline] |
int TLocClust::GetCutEdges | ( | ) | const [inline] |
double TLocClust::GetCutPhi | ( | ) | const [inline] |
void TLocClust::GetCutStat | ( | const PUNGraph & | Graph, |
const TIntV & | NIdV, | ||
int & | Vol, | ||
int & | Cut, | ||
double & | Phi, | ||
int | GraphEdges = -1 |
||
) | [static] |
For a given Graph and a set of nodes NIdV the function returns the Volume, CutSize and the Conductance of the cut.
Definition at line 192 of file ncp.cpp.
{ TIntSet NIdSet(NIdV.Len()); for (int i = 0; i < NIdV.Len(); i++) { NIdSet.AddKey(NIdV[i]); } GetCutStat(Graph, NIdSet, Vol, Cut, Phi, GraphEdges); }
void TLocClust::GetCutStat | ( | const PUNGraph & | Graph, |
const TIntSet & | NIdSet, | ||
int & | Vol, | ||
int & | Cut, | ||
double & | Phi, | ||
int | GraphEdges = -1 |
||
) | [static] |
For a given Graph and a set of nodes NIdV the function returns the Volume, CutSize and the Conductance of the cut.
Definition at line 198 of file ncp.cpp.
{ const int Edges2 = GraphEdges>=0 ? 2*GraphEdges : Graph->GetEdges(); Vol=0; Cut=0; Phi=0.0; for (int i = 0; i < NIdSet.Len(); i++) { if (! Graph->IsNode(NIdSet[i])) { continue; } TUNGraph::TNodeI NI = Graph->GetNI(NIdSet[i]); for (int e = 0; e < NI.GetOutDeg(); e++) { if (! NIdSet.IsKey(NI.GetOutNId(e))) { Cut += 1; } } Vol += NI.GetOutDeg(); } // get conductance if (Vol != Edges2) { if (2*Vol > Edges2) { Phi = Cut / double(Edges2-Vol); } else if (Vol == 0) { Phi = 0.0; } else { Phi = Cut / double(Vol); } } else { if (Vol == Edges2) { Phi = 1.0; } } }
const TIntV& TLocClust::GetCutV | ( | ) | const [inline] |
int TLocClust::GetCutVol | ( | ) | const [inline] |
int TLocClust::GetNId | ( | const int & | NodeN | ) | const [inline] |
const TIntV& TLocClust::GetNIdV | ( | ) | const [inline] |
double TLocClust::GetPhi | ( | const int & | ValId | ) | const [inline] |
const TFltV& TLocClust::GetPhiV | ( | ) | const [inline] |
int TLocClust::GetRndWalkSup | ( | ) | const [inline] |
int TLocClust::GetVol | ( | const int & | Nodes | ) | const [inline] |
const TIntV& TLocClust::GetVolV | ( | ) | const [inline] |
int TLocClust::Len | ( | ) | const [inline] |
Returns the support of the approximate random walk, the number of nodes with non-zero PageRank score.
Definition at line 42 of file ncp.h.
{ return GetRndWalkSup(); }
void TLocClust::PlotCutDistr | ( | const TStr & | OutFNm, |
TStr | Desc = TStr() |
||
) | const |
Plots the cluster cut size vs. cluster size K (cluster is composed of nodes NIdV[1...K]).
Definition at line 113 of file ncp.cpp.
{ TFltPrV RankValV(CutV.Len(), 0); for (int i = 0; i < CutV.Len(); i++) { RankValV.Add(TFltPr(i+1, CutV[i].Val)); } if (Desc.Empty()) { Desc = OutFNm; } TFltPrV NewV; TLocClustStat::MakeExpBins(RankValV, NewV); TGnuPlot GP(OutFNm+"-CUT", TStr::Fmt("CUT SIZE. Seed node %d.", SeedNId, Desc.CStr())); GP.AddPlot(NewV, gpwLinesPoints, ""); //, "pointtype 6 pointsize 1.5" GP.SetXYLabel("Rank", "Cut size"); //GP.SetScale(gpsLog10X); GP.SavePng(); }
void TLocClust::PlotNCP | ( | const PUNGraph & | Graph, |
const TStr & | FNm, | ||
const TStr | Desc = "" , |
||
const bool & | BagOfWhiskers = true , |
||
const bool & | RewireNet = false , |
||
const int & | KMin = 10 , |
||
const int & | KMax = Mega(100) , |
||
const int & | Coverage = 10 , |
||
const bool & | SaveTxtStat = false , |
||
const bool & | PlotBoltzman = false |
||
) | [static] |
Plots the Network Community Profile (NCP) of a given graph Graph. The NCP plot of a network captures the global community structure of the network. The NCP plot of a network captures the global community structure of the network. Refer to 'Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters by J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. Internet Mathematics 6(1) 29--123, 2009' for the explanation of how to read these plots. URL: http://arxiv.org/abs/0810.1355
Definition at line 219 of file ncp.cpp.
{ const double Alpha = 0.001, KFac = 1.5, SizeFrac = 0.001; //const int KMin = 2, KMax = Mega(100), Coverage = 10; TLocClustStat ClusStat1(Alpha, KMin, KMax, KFac, Coverage, SizeFrac); ClusStat1.Run(Graph, false, PlotBoltzman, SaveTxtStat); if (BagOfWhiskers) { ClusStat1.AddBagOfWhiskers(); } TLocClustStat ClusStat2(Alpha, KMin, KMax, KFac, Coverage, SizeFrac); ClusStat1.ImposeNCP(ClusStat2, FNm, Desc, "ORIGINAL", "REWIRED"); // plot before rewiring if (SaveTxtStat) { // for every piece size store modularity ClusStat1.SaveTxtInfo(FNm, Desc, false); } if (PlotBoltzman) { ClusStat1.PlotBoltzmanCurve(FNm, Desc); } if (RewireNet) { ClusStat2.Run(TSnap::GenRewire(Graph, 50, TInt::Rnd), false, false, false); if (BagOfWhiskers) { ClusStat2.AddBagOfWhiskers(); } ClusStat1.ImposeNCP(ClusStat2, FNm, Desc, "ORIGINAL", "REWIRED"); // if rewire, plot again } }
void TLocClust::PlotPhiDistr | ( | const TStr & | OutFNm, |
TStr | Desc = TStr() |
||
) | const |
Plots the cluster conductance vs. cluster size K (cluster is composed of nodes NIdV[1...K]).
Definition at line 126 of file ncp.cpp.
{ TFltPrV RankValV(PhiV.Len(), 0); for (int i = 0; i < PhiV.Len(); i++) { RankValV.Add(TFltPr(i+1, PhiV[i])); } if (Desc.Empty()) { Desc = OutFNm; } TFltPrV NewV; TLocClustStat::MakeExpBins(RankValV, NewV); TGnuPlot GP(OutFNm+"-PHI", TStr::Fmt("CONDUCTANCE (Cut size / Volume). Seed node %d.", SeedNId, Desc.CStr())); GP.AddPlot(NewV, gpwLinesPoints, ""); //, "pointtype 6 pointsize 1.5" GP.SetXYLabel("Rank", "Conductance (Cut size / Volume)"); //GP.SetScale(gpsLog10X); GP.SavePng(); }
void TLocClust::PlotVolDistr | ( | const TStr & | OutFNm, |
TStr | Desc = TStr() |
||
) | const |
Plots the cluster volume vs. cluster size K (cluster is composed of nodes NIdV[1...K]).
Definition at line 100 of file ncp.cpp.
{ TFltPrV RankValV(VolV.Len(), 0); for (int i = 0; i < VolV.Len(); i++) { RankValV.Add(TFltPr(i+1, VolV[i].Val)); } if (Desc.Empty()) { Desc = OutFNm; } TFltPrV NewV; TLocClustStat::MakeExpBins(RankValV, NewV); TGnuPlot GP(OutFNm+"-VOL", TStr::Fmt("VOLUME. Seed node %d.", SeedNId, Desc.CStr())); GP.AddPlot(NewV, gpwLinesPoints, ""); //, "pointtype 6 pointsize 1.5" GP.SetXYLabel("Rank", "Volume"); //GP.SetScale(gpsLog10X); GP.SavePng(); }
void TLocClust::SavePajek | ( | const TStr & | OutFNm | ) | const |
Saves the network in the Pajek format so it can be visualized. Red node represents the seed and color the cluster membership.
Definition at line 139 of file ncp.cpp.
{ FILE *F = fopen(TStr::Fmt("%s.net", OutFNm.CStr()).CStr(), "wt"); TIntH NIdToIdH(Graph->GetNodes(), true); TIntH ClustSet(BestCut()+1); const int BucketSz = BestCutNodes()/ 4; TStrV ClrV = TStrV::GetV("Black", "Gray80", "Gray60", "Gray40", "Gray20", "RedViolet"); for (int a = 0; a < BestCutNodes(); a++) { ClustSet.AddDat(NIdV[a], (a+1)/BucketSz); } fprintf(F, "*Vertices %d\n", Graph->GetNodes()); int i = 0; for (TUNGraph::TNodeI NI = Graph->BegNI(); NI < Graph->EndNI(); NI++) { const int NId = NI.GetId(); if (NId == SeedNId) { fprintf(F, "%d \"%d\" diamond x_fact 2 y_fact 2 ic Green fos 10\n", i+1, NId); } else if (ClustSet.IsKey(NId)) { //fprintf(F, "%d \"%d\" box x_fact 1 y_fact 1 ic Red fos 10\n", i+1, NId); } fprintf(F, "%d \"%d\" box x_fact 2 y_fact 2 ic %s fos 10\n", i+1, NId, ClrV[ClustSet.GetDat(NId)].CStr()); } else { fprintf(F, "%d \"%d\" ellipse x_fact 2 y_fact 2 ic Blue fos 10\n", i+1, NId); } NIdToIdH.AddDat(NId, i+1); i++; } fprintf(F, "*Arcs %d\n", Graph->GetEdges()); // arcs are directed, edges are undirected for (TUNGraph::TNodeI NI = Graph->BegNI(); NI < Graph->EndNI(); NI++) { const int NId = NIdToIdH.GetDat(NI.GetId()); for (int e = 0; e < NI.GetOutDeg(); e++) { fprintf(F, "%d %d %g c Tan\n", NId, NIdToIdH.GetDat(NI.GetOutNId(e)).Val, 1.0); } } fclose(F); }
void TLocClust::SupportSweep | ( | ) |
After the function ApproxPageRank() has been run the SupportSweep() computes the volume, cut size, node ids, conductance vectors.
Definition at line 52 of file ncp.cpp.
{ TExeTm ExeTm; VolV.Clr(false); CutV.Clr(false); PhiV.Clr(false); if (ProbH.Empty()) { return; } const int TopNId = ProbH.GetKey(0); const int TopNIdDeg = Graph->GetNI(TopNId).GetOutDeg(); int Vol = TopNIdDeg, Cut = TopNIdDeg; double Phi = Cut/double(Vol); VolV.Add(Vol); CutV.Add(Cut); PhiV.Add(1.0); for (int i = 1; i < ProbH.Len(); i++) { const int NId = ProbH.GetKey(i); const TUNGraph::TNodeI& Node = Graph->GetNI(NId); const int OutDeg = Node.GetOutDeg(); int CutSz = OutDeg; // edges outside for (int e = 0; e < OutDeg; e++) { const int Rank = ProbH.GetKeyId(Node.GetOutNId(e)); if ( Rank > -1 && Rank < i) { CutSz -= 2; } } Vol += OutDeg; Cut += CutSz; if (Vol < Edges2) { if (2*Vol > Edges2) { Phi = Cut / double(Edges2-Vol); } else { Phi = Cut / double(Vol); } } else { Phi = 1.0; } IAssert((Phi+1e-6) >= double(1.0)/double(i*(i+1)+1)); // conductance is worse than the best possible VolV.Add(Vol); CutV.Add(Cut); PhiV.Add(Phi); }}
friend class TLocClustStat [friend] |
double TLocClust::Alpha [private] |
int TLocClust::BestCutIdx [private] |
TIntV TLocClust::CutV [private] |
int TLocClust::Edges2 [private] |
PUNGraph TLocClust::Graph [private] |
TIntV TLocClust::NIdV [private] |
TIntQ TLocClust::NodeQ [private] |
int TLocClust::Nodes [private] |
TFltV TLocClust::PhiV [private] |
TIntFltH TLocClust::ProbH [private] |
TIntFltH TLocClust::ResH [private] |
int TLocClust::SeedNId [private] |
bool TLocClust::Verbose = true [static] |
TIntV TLocClust::VolV [private] |