Sandalphon Tutorial (Criminal Incidents)

Sandalphon is simMachines’ clustering algorithm; it classifies data based on groupings of similar objects found in that data. In addition to classifying, Sandalphon’s clustering capacities allow for insightful visualizations of data groups to be easily created through the simMachines platform.

    • Tutorial Description

      This tutorial will demonstrate how to use simMachines’ Sanalphon Angel for cluster identification and visualization on the simMachines platform. If you have a simMachines account and wish to follow the tutorial step-by-step, the data file can be found in this link.

      We are going to use Sandalphon to identify clusters of similar crimes in a city. To identify patterns of similar criminal acts, Sandalphon will examine the location where a crime occurred, the type of crime, and what weapon (if any) was used. We will also use Sandalphon to create visualizations of the results. This knowledge could then be used to guide the distribution of police resources or influence other policy responses.

  • Getting Started

    Create a new folder for this tutorial and upload the data. If you are unfamiliar with this process, please see the Platform Navigation Tutorial.

    From the “Create Angel” window select Sandalphon, then choose the folder in which the data is saved.

    Once selected you will see the first few rows of the data. Directly above the data is the columns’ header and a column type selection dropdown.

  • File Specifications

    Next we tell Sandalphon about the structure of our data. This ensures that similarities are identified in the correct way. Sandalphon accepts the following specifications:

    ID A field which uniquely identifies each object.
    REAL Numerical values.
    NOMINAL Values that do not bear a quantitative relationship with each other (i.e., strings and numbers which represent non-numerical information).
    MULTI_PLAIN Multiple NOMINAL values separated by spaces. Non-language specific.
    MULTI_ENGLISH Multiple NOMINAL values separated by spaces. The text is English language.
    MULTI_SPANISH Multiple NOMINAL values separated by spaces. The text is Spanish language.
    MULTI_JAPANESE Multiple NOMINAL values separated by spaces. The text is Japanese language.
    ITEM_SET A series of values with weights. (Formatted as item1:weight1;item2:weight2;item3:weight3)
    IGNORE The column shall be ignored by the program.
    META This column is for metadata and shall be ignored by the program, but information will be retained in the output.

    Giving Sandalphon the correct specifications to work with is critical for ensuring that you get ideal results. Let’s look at a sample of the data and use it to assign the correct specifications.


    05079932 8/28/2012 12:00:00 AM SEX ABUSE OTHERS 1A FOURTH 2 003200 1
    07156111 9/18/2012 12:00:00 AM HOMICIDE GUN 8A SEVENTH 28 007504 1
    08075756 9/21/2012 12:00:00 AM HOMICIDE GUN 8A SEVENTH 28 007503 1
    08254628 10/6/2012 12:00:00 AM SEX ABUSE OTHERS 2A SECOND 5 005600 3
    09074624 4/25/2012 12:00:00 AM SEX ABUSE OTHERS 6C FIRST 25 010600 2
    10123633 2/29/2012 12:00:00 AM SEX ABUSE OTHERS 6C FIFTH 25 010600 1
    10146732 6/8/2012 12:00:00 AM SEX ABUSE OTHERS 3C SECOND 15 000501 2
    11102619 5/14/2012 12:00:00 AM HOMICIDE GUN 7F SIXTH 32 007703 3
    11141272 6/25/2012 12:00:00 AM HOMICIDE OTHERS 8B SEVENTH 36 007502 2
    11142230 8/23/2012 12:00:00 AM SEX ABUSE OTHERS 7F SIXTH 32 007703 1
    11158196 1/5/2012 12:00:00 AM HOMICIDE OTHERS 7D SIXTH 29 009601 1
    11190860 1/1/2012 12:00:00 AM HOMICIDE OTHERS 7D SIXTH 30 007803 1
    12000005 1/1/2012 12:10:00 AM THEFT F/AUTO OTHERS 1A THIRD 2 003200 3
    12000041 1/1/2012 12:58:00 AM ROBBERY KNIFE 5D FIFTH 23 008803 1
    12000056 1/1/2012 12:20:00 AM ASSAULT W/DANGEROUS WEAPON OTHERS 6D FIRST 27 007200 1

    This file contains data about criminal incidents reported in Washington DC over a two year period. This table shows the first fifteen lines of the file.

    Each row of this file corresponds to a criminal incident. Each entry is comprised of 8 columns:

    1. 1.CCN Criminal Complaint Number. It is a unique numerical code assigned to the event. As we are creating a visualization we do not need to assign an ID field, so we will set CCN to IGNORE.
    2. 2. REPORTDATETIME The date and time. As we are visualizing the location, method, and type of crime, we will IGNORE this field.
    3. 3. OFFENSE The type of crime which was committed. This is a single string (text) value which represents information of a classification, so we will label it NOMINAL.
    4. 4. METHOD What type of weapon (if any) was used in the crime. Like OFFENSE, this is NOMINAL.
    5. 5. ANC The Advisory Neighborhood Commission zone in which the crime occurred. We will use DISTRICT as our means of grouping crimes by their location, but this would be an acceptable alternative. IGNORE.
    6. 6. DISTRICT The district in which the crime occurred. This is a single string (text) value which represents information of a classification, so we will label it NOMINAL
    7. 7. NEIGHBORHOODCLUSTER Another method of classifying the location of the crime. IGNORE.
    8. 8. BLOCK_GROUP Another method of classifying the location of the crime. IGNORE.


  • Angel Specifications and Creation

    Angels also accept specifications, which can modify their behavior and access to server resources. For the purposes of this tutorial, and for most circumstances, the default parameters are fine, but let’s go over the options.

    Storage Units Specifies the amount of memory devoted by the server to this Angel. Larger files or Angels with more strict search parameters may require additional memory. Each unit is 2 GB. (Range 1 to 6, default 1)
    Parallelism Specifies the number of servers redundantly running the Angel. (Default 2)
    Ramiel K Specify the number of results for the nearest neighbor search. (Default 10)
    Pivots The number of primary search points in the engine. (Range 256 to 1024, default 256)
    Probability Minimum accepted probability for the results, any result with lower probability will be discarded.(Range 0 to 1, default .95)
    Accepted Error Maximum accepted difference in distance between returned objects and the query object. (Minimum 1, default 1.2)
    Sandalphon Range Maximum accepted distance between the center of a cluster and a given element. (Range 0 to 1, default .2)
    Sandalphon Iterations Number of passes taken by Sandalphon to identify new cluster centers. (Minimum 1, default 3)
    Sandalphon Percentage Proportion of elements used to identify cluster centers for each pass. (Range 0 to 1, default .5)

    Once all desired specifications have been entered and you have given your Angel a name, click “Create”.

    You will be taken to the “Your Angels” page where you can see the status of the Angels that you have created. Once your Angel’s status is “COMPLETED” it is ready to generate visualizations. Depending on the file size it may take a few seconds to a few minutes for an Angel to complete initialization.

  • Creating Visualizations

    You now have a functional Sandalphon Angel which has identified clusters in the data and is ready to create visualizations. Let’s try it out!

    Sandalphon outputs information in several ways: A table with each datapoint assigned to a cluster can be viewed through the simMachines platform or downloaded as a csv file, or the data can be viewed as a visualization depicting the identified clusters. Both options are accessed from the “Angel Actions” dropdown on the “Your Angels” screen.

    Downloaded results provide all data points with a “clusterID” field denoting their assigned cluster.

    To view existing visualizations, or create new ones, select “Visualizations” from the “Angel Actions” dropdown on the “Your Angels” screen.

    A visualization is produced when a Sandalphon Angel is first generated. This visualization includes all data and clusters. Additional visualizations can be created using only selected fields, or with varied visualization settings.

    New visualizations can be produced at the bottom of the page. Columns can be turned off by toggling “USE” to “IGNORE”.

  • Reading Visualizations

    Click “View Visualization” to see a visualization of the data. There are three views: Circular Graph, Sunburst, and Cluster List.

    Circular Graph

    The outermost segments of a Circular Graph visualization represent a cluster composed of the elements of that segment and the segments nearer the center to which it is attached.

    Here we see that there is a large cluster of thefts without a knife or gun occurring in district two. We can mouse over each cluster to see its details, or click on it to be brought to a chart detailing the cluster’s elements. Let’s select a cluster with overlapping elements (these will be the ones extending into additional rings).

    This graph shows a cluster of sex abuse crimes occurring primarily in the seventh district without a weapon, but also shows that there are shared traits with crimes committed with knives and with crimes in the first district.

    A list of the elements in the cluster, with a measure of distance between the element and the center of the cluster, on a scale of 0 to 1, is at the bottom of the page.

    Sunburst Graph

    A Sunburst Graph is similar in form to the Circular Graph, but has the ability to zoom into a keyword and show which clusters are associated with it. To switch to a Sunburst Graph, click “Sunburst Graph” on the top right of the “Visualization” page.

    As with the Circular Graph, the outermost ring of the Sunburst Graph represents the clusters.

    Click into one of the inner rings to only see the clusters associated with that keyword.

    Cluster List

    A list of all clusters can be seen by clicking the “All Clusters” button at the top right of the page. Selecting “View” on a listed cluster will bring you to the same chart accessed by clicking a cluster in the visualization.

  • Danny ShaymanSandalphon Tutorial