The Trajectory Cluster Analysis window has the series of tasks necessary for running a trajectory cluster analysis. This differs from most, if not all, other HYSPLIT GUI windows that only run one program or do one task. Given a set of trajectories beginning at one location, the cluster analysis will objectively result in sub-sets of trajectories, called clusters, that are each different from the other sub-sets. The program will usually produce at least one possible outcome set of clusters. If more than one outcome is given, the user must then subjectively choose one for the final result.
Step 1. Inputs.
Run ID - A label to identify each run. The label ends at the first blank space. The other numeric inputs may be part of the label. For instance if trajectories during 2004 from Ohio were clustered, the Run_ID could be Ohio_2004. If you used 48-h trajectories, hourly endpoints, and every other trajectory, a label of Ohio_2004_48_1_2 could be used. If you later decided to only use the first 36-h of the trajectory Ohio_2004_36_1_2 might be used.
Hours to cluster - Trajectory durations up to the given hour are used. Must be a positive number. Time is from trajectory origin whether backward or forward. Trajectories terminating befor the given hour will not be included in the clustering. Premature terminations commonly result from missing meteorology data or the trajectory reaching the meteorological grid horizontal or top edge.
Time interval - Identifies which endpoints along a trajectory to use. Typically every hourly endpoint is used.
Trajectory skip - Identifies which trajectories in a folder to use. A value of 1 means every trajectory will be used; 2 means every other trajectory; 5 every 5th trajectory, etc. Useful with very large sets of trajectories.
Endpoints parent folder - With the default endpoints parent folder of ../endpts, the trajectory files would be in the /hysplit4/endpts/Run_ID folder.
Archive parent folder - With the default of ./cluster_output, the archive folder would be /hysplit4/working/cluster_output/Run_ID.
Step 2. Run Cluster Program. Possible solutions to the cluster analysis are available at the end of this step.
Rename web tdumps Trajectories created on the web with autotraj have filenames of format tdump.[Julian Day]. These can be used as is OR run this rename script to rename the files to the format of the HYSPLIT PC- Run Daily with the year-month-day-hour in the filename.
Make INFILE. Trajectories must have been run previously, such as via TRAJECTORY / Special Simulations / Run Daily. All the trajectory endpoints files need to be in one folder (directory) and each must have the name tdump within its filename. In this step, a file, INFILE, listing all the tdump files will be created in ../working.
Run Cluster The cluster analysis program is run here given the INFILE file, the manually input number of hours from the beginning of the trajectories in which to base the trajectory clustering on, the time interval along the trajectory at which to read the endpoints information, and using the trajectories specified by the "skip" parameter. Typical values are 36 hours to cluster and reading the hourly endpoints. For very long trajectories, a longer time interval may give adequate results.
The cluster process is as follows: Initially, each trajectory is defined to be a cluster. There are N trajectories and N clusters. The cluster process is an iterative process consisting of N-1 passes through all the clusters. In the 1st pass, the two closest clusters (trajectories) are paired, resulting in N-1 clusters. Similarly in the 2nd pass, the two closest clusters are paired, resulting in N-2 clusters. In this case, either the cluster having two trajectories could be paired with another trajectory or two clusters, each with one trajectory, could be paired. This process continues until all trajectories are in 1 cluster. The cluster program produces six output files:
CLUSTER trajectory start date/time and endpoints (tdump) filename for all the trajectories in INFILE; then for each pass, a listing of the trajectories in each cluster.
DELPCT the change in total spatial variance of all the clusters from one pass to the next.
TCLUS the filenames of the endpoints files clustered
TNOCLUS the filenames of the endpoints files not clustered (i.e. trajectory terminates before the hours to cluster duration given in Step 1.
CLUSTERno the filename and trajectory start date/time of trajectories, if any, not clustered; used to create cluster results (CLUSLIST)
CMESSAGE diagnostics output file
On a typical PC, a cluster run with 365 trajectories, 36-h duration, and using every hourly endpoint, will take a couple minutes. Going beyond several years of trajectories will result in a run that will take a long time and/or use much memory. A warning message is given for larger runs, but there is no simple way to identify how large a cluster job is possible.
Display plot shows the percent change in total spatial variance (TSV) for the final 30 passes through the cluster program. This data is from the file DELPCT. Generally there can be seen at least one time when there is a large increase in the total spatial variance, indicating that different, rather than similar, clusters are being paired and that the cluster process should stop before that occurs.
View possible final number of clusters. Typically a pairing of "different" clusters is indicated by a 30% change in the percent change in total spatial variance (see Step 2, Display plot). Run lists the possible final cluster numbers. If the 30% criterion does not identify any, the 20% criterion may be chosen.
Step 3 Get Results. This step may be repeated using different numbers of clusters. If you exit the GUI, but have not archived your results, enter the Run_ID and the Archive parent folder again from Step 1, then continue with Step 3. If you have already archived the results, but want to try a different number of clusters, manually copy everything from the archive directory to /hysplit/working, then enter the number of clusters, etc.
Number of clusters Enter the final number of clusters, one of the values listed in Step 2, Run. In general, use the value where the plot from Step 2,Display plot shows a sharp upward turn.
Text creates a text file listing the trajectory start date/times and filenames in each cluster (CLUSLIST_N, where N is the final number of clusters). Note Cluster #0 is for trajectories not clustered.
Display Means produces one map with the mean trajectories of all the clusters for the given final number of clusters.
Cluster produces one map for each cluster, showing the trajectories in each cluster.
Trajectories not used are those input to the cluster program, i.e. in the endpoints directory, and at the given skip interval, that terminate before the trajectory duration equal to the Step 1, Hours to cluster. This displays the plot showing the trajectories not used and the cluster-mean trajectories for the trajectories not used (cluster #0) and all the other clusters. Note the plot showing the trajectories must have been previously created in Step 3, Display Clusters.
Archive All files are moved, not copied, to the given directory. Files created in Step 3 contain the final number of clusters in the filename so Archive can be run once after all analysis is complete.