Clusters

From eotcd

Jump to: navigation, search

Clustering the EOTCD

LinLog Results

Here are our cluster results when applying LinLog clustering as discussed in the paper Energy Models for Graph Clustering (http://jgaa.info/accepted/2007/Noack2007.11.2.pdf). This work developed two energy models, node-repulsion LinLog and edge-repulsion LinLog, that are based on clustering according to density of the cut and normalized cut. The clustering method is briefly discussed in the LinLogLayout source code in OptimizerModularity.java:

 The Modularity measure is generalized to arbitrary node weights;
*   it is recommended to set the weight of each node to its degree,
*   i.e. the total weight of its edges, as Newman did.
* For more information on the (used version of the) Modularity measure, see
*   M. E. J. Newman: "Analysis of weighted networks", 
*   Physical Review E 70, 056131, 2004.
* For the relation of Modularity to the LinLog energy model, see
*   Andreas Noack: "Energy Models for Graph Clustering",
*   Journal of Graph Algorithms and Applications 11(2):453-480, 2007.
* Freely available at http://jgaa.info/

In this first set of results we have the clusters from running the LinLog algorithm on the edges where the source and target are both in our EOTCD collection. Here we are using the ratio of outlinks from a source to a specific target over all outlinks from that source as weights. Here we end up with 20 clusters.

Linlog clusters ratio weights.txt

In this second set of results we have the clusters from running the LinLog algorithm on the edges where the source and target are both in our EOTCD collection. Here the weights on edges is the actual number of occurrences of a link between source and target. Here we end up with 18 clusters.

Linlog clusters not ratio weights.txt

Using the LinLog method, we end up with some clusters that are larger than perhaps expected where we would have liked to have seen more clusters breaking out from these large groups. We end up with less than half the number of clusters we hope for based on the number of top level government author agencies.

LinLog with Author Agency Domains Marked
LinLog with Author Agency Domains Marked (same as above, but with only authors labeled)

Linlog Coordinates With Agglomerative Hierarchical Clustering

Here we make use of Linlog layout's force-directed layout techniques for weighted graphs to map our web graph to Euclidean space. We then determine clusters using the agglomerative hierarchical clustering algorithm and Euclidean distance. As most popular clustering algorithms make use of Euclidean distance for their distance measure, this allows us to create clusters based on distance in a geometric space. Linial, London, and Rabinovich's The Geometry of graphs and some of its Algorithmic Applications states "if the metric is non-Euclidean, reliable clustering is a notoriously difficult problem" (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.54.190&rep=rep1&type=pdf, p. 10).

We find that clustering in geometric space can be problematic when the the graph is highly linked and its density is highly varied throughout. Laying out such a graph gives varied shapes and distance from what we would like to see as our centroids. In the EOTCD data trying to achieve clusters that are each representative of a single author agency is difficult because the size of those agencies, their number and size of sub agencies, and the amount that they publish differs widely.

Running the code to generate clusters, we specify a desired 60 clusters, but once determined, any clusters made up of a single node are moved to the nearest cluster. We end up with 59 clusters. When writing out to the file below, we are listing the cluster members under the member node with the highest total link degree. Not based on a physically central node.

Clusters_linlog_agglom_euclid.txt‎

This method depends on the LinLog visualization to produce its clusters. These visualizations, however, are different each time you produce them with the same data set because initial positioning of nodes is random. Here are two more cluster sets based on a LinLog visualization calculated with values:

 new MinimizerBarnesHut(nodes, edges, -0.2, 1.2, 0.1).minimizeEnergy(nodeToPosition, 450);

Also after running several attempts of this clustering method, we decide not to force single nodes to nearest clusters as there is a good chance they end up in the wrong place.

Here are the groupings with a limit of 55 clusters set.

Clusters_linlog_agglom_euclid_55.txt‎

Here are the groupings with a limit of 75 clusters set.

Clusters_linlog_agglom_euclid_75.txt‎

This is perhaps our most successful clustering method.

Normalized Google Distance Results

In our further attempts at clustering the EOTCD we have leveraged the normalized Google distance measure discussed in The Google Similarity Distance (http://eprints.pascal-network.org/archive/00002784/01/tkde06.pdf). While this is actually a semantic similarity measure, we have found that it translates well to our study of link analysis. Typically used to measure the semantic similarity of search terms used in a Google search, the distance between two terms is smaller when the terms are often found on the same page. If the terms are found on separate pages, but never the same pages, than their normalized Google distance is infinite.

Ngd formula.png

In our application of this formula, we measure the distance between government domains based on the similarity of their outlinks.

In our utilization of the formula we have the following:

x and y are domains

M is the total number of domains in the graph

f(x) and f(y) are, respectively, the number of outlinks from x and y

f(x, y) is the number of domains to which both x and y link

To experiment, we begin with a web graph of edges between government domains that are present in our EOTCD collection. Because Google distance looks only at the concurrent presence of two terms and not the number of times they occur together on a single page, we must set a threshold of which links to include in our calculations. Thus, we consider only those edges with percentage of outlinks from a source to a specific target over all outlinks from that source greater than or equal to 1%. We will assume that links less common than that are incidental and should not be accepted as an endorsement of one domain by another. Using this graph, we calculate the normalized Google distance based on the intersection of their sets of outlinks. For those nodes without outlinks of a ratio at 1% or greater and for nodes where the two sets have no intersection, the grouping cannot be assigned in this way (currently we will not group these until we improve our methods), though this occurs for only 3% of the domains. When initial distances are computed, domains are clustered with their nearest neighbor. Then we continually combine clusters based on scores until there are no more changes to groupings.

The first experiment results are given here, but further work on better utilizing the NGD for clustering is in the works.

ngd_clusters_1.txt


Strongest Outlinks and Majority Inlinks

In this method, our starting point is our weighted web graph where the weights are the ratio of the source's outlinks to a target over its total outlinks. The web graph excludes links with weights less than 1%. Clusters are initialized with a node belonging to a cluster with centroid that is the target to which it is most highly outlinked. For example, nutrition.gov links to usda.gov 67% of the time, so it is put into a cluster with centroid usda.gov. usda.gov, while a centroid, links most highly to whitehouse.gov at 33%, so it goes into whitehouse.gov's cluster (still remaining a centroid). We then determine what to merge by running through centroids of the clusters for multiple iterations until no changes occur. For each cluster, we look at the inlinks of the centroid and the inlinks of the node to which it most highly links (the cluster it would belong to if it wasn't currently named a centroid), which we call the parent. If more inlinks of the centroid name its parent as most highly linked (and are thus in the parent's cluster), the centroid's cluster is merged with its parent's cluster.

For nodes without outlinks weighted above 1%, we have fallen back on initializing them to clusters based on strongest inlink. When there is a tie for most strongly outlinked of a node, the tie is won by whichever competitor shares the greatest percentage of outlinks with the node being assigned.

Here are the initial clusters: Initial_clusters_strongest_outlink.txt

By initializing with strongest outlinked clusters in this way we have [unfortunately] already eliminated these author agencies as centroids:

 nlrb.gov - National Labor Relations Board
 opic.gov - Overseas Private Investment Corporation
 nmb.gov - National Mediation Board
 rrb.gov - U.S. Railroad Retirement Board
 cfa.gov - U.S. Commission of Fine Arts
 fmc.gov - Federal Martime Commission
 fca.gov - Farm Credit Administration
 fhfa.gov - Federal Housing Finance Agency
 mspb.gov - Merit Systems Protection Board
 nea.gov - National Endowment for the Arts--this is a weird case because the National Foundation
   on the Arts and the Humanities is separately represented on the web by nea.gov and neh.gov. 
   neh.gov is initialized to a cluster.
 ncpc.gov - National Capital Planning Commission
 fmcs.gov - Federal Mediation and Conciliation Service
 usccr.gov - U.S. Commission on Civil Rights

The above agencies tend to be relatively small independent agencies with very specific missions. the dearth of sub agencies and the links that they would bring make it difficult for these to be recognized as independent agencies using the methods here.

The following author agencies domains lack .gov or .mil, so they started out of scope from the beginning of our work.

 si.edu - Smithsonian Institution
 usps.com - United States Postal Service

Unclusterable Domains

These sites have no outlink data available for the following reasons:

 cars.gov - home page only/not crawled
 contractdirectory.gov - just a search interface for database/no outlinks
 dodig.mil - home page only/not crawled
 dodvclips.mil - no html
 eftps.gov - home page only/not crawled
 employeeexpress.gov - not crawled
 esgr.mil - home page only/not crawled
 firstresponder.gov - home page only/not crawled
 freedomaward.mil - home page only/not crawled
 gpc.gov - domain just points to site at cfoc.gov
 helpcommission.gov - no outlinks on site (only internal)
 jfmip.gov - domain just points to fsio.gov
 mtbs.gov - home page only/not crawled
 peacecorp.gov - domain points to peacecorps.gov
 sam.gov - requires login/not crawled
 smartcard.gov - not crawled

Because we don't have outlink data, they should be removed from the cluster calculations and counted/listed as not clustered.

This method leaves us with 139 clusters, but the clusters look well related.

Who thinks this method is too reliant on heuristics?

Clusters outlinks lauren webgraph.txt

Web Communities

Related to paper Efficient Identification of Web Communities by Flake, Lawrence, Giles.

Once again, in this method, our starting point is our weighted web graph where the weights are the ratio of the source's outlinks to a target over its total outlinks. The web graph excludes links with weights less than 1%, which then lowers the number of nodes being cluster from 1133 to 1079. Clusters are initialized with a node belonging to a cluster with centroid that is the target to which it is most highly outlinked. Then we examine if a centroid has more ties (unweighted outlinks and inlinks) with the members of its cluster or the members of the potential parent cluster, where the parent cluster is the cluster to which the centroid is most strongly linked. This makes 122 good clusters when run.

Clusters_communities_centroid_based.txt

Visualization of clusters

Strongest Outlinks/Majority Inlinks clusters by color
Linlog assigned clusters by color
Clusters with Agglomerative hierarchical method and Euclidean distance
Clusters when SuDoc authors are supplied


Full Subdomain Results

Clustering of full subdomains derived using agglomerative hierarchical method and Euclidean distance.

75 clusters:

Full_subdomain_75_cluster_output.txt‎

Full_subdomain_clusters_75.csv

250 clusters:

250_cluster_output.txt‎

Full_subdomain_clusters_250.csv

500 clusters:

500_cluster_output.txt‎

Full_subdomain_clusters_500.csv