Leliel Grid Tutorial (Adult Income Census)

In this tutorial we are going to create a classifier system that is going to predict if a person earns more or less than 50k per year based on their characteristics. In this case we are going to create a list of experiments to find the parameters that improve the score of the predictions. If you want to follow the tutorial step by step you can download the example data file in this link.

Angel Description

A classifier is a machine learning algorithm that receives an object and attaches “tags” or “labels” to it. Leliel is a classifier built on top of the Ramiel similarity engine and provides unmatched performance and accountability features. Also with this version of the classifier you can execute a list of experiments with differents parameters to found the parameters that increase the prediction score.


File Specification:

REAL Real values.
NOMINAL String values or what you would consider an ENUM.
MULTI_ENGLISH Free form text. The text is english language.
MULTI_SPANISH Free form text. The text is spanish language.
IGNORE The column shall be ignored by the program.
ID Your internal object ID, useful when returning results.
META You can store arbitrary information in a string. Currently the meta data is a reserved
column type.
CLASS Column that will be classified, we only accept one class by instance.

 

The file should consider these specifications:

  • The file should be in tsv format, which is a tab separated values = ‘\t’
  • We expect to have a header with the same column names that were specified.
  • We are going to read all files in the folder, all of them should follow this format.
  • TSV Quote Character = ‘ ” ‘
  • TSV Line End = ‘\n’
  • TSV Escape Character= ‘\’

Example:

This file contains data about people information, such as age, occupation, education, marital-status among others. This data was collected by a bank to know if a person earns more or less than 50k per year. The table shows a few lines of the file.

IDENTIFICATION AGE WORKCLASS EDUCATION MARITAL-STATUS OCCUPATION RELATIONSHIP RACE SEX HOURS-PER-WEEK COUNTRY INCOME
2 50 Self-emp-not-inc Bachelors Married-civ-spouse Exec-managerial Husband White Male 13 United-States <=50K
3 38 Private HS-grad Divorced Handlers-cleaners Not-in-family White Male 40 United-States <=50K
4 53 Private 11th Married-civ-spouse Handlers-cleaners Husband Black Male 40 United-States <=50K
5 28 Private Bachelors Married-civ-spouse Prof-specialty Wife Black Female 40 Cuba <=50K
6 37 Private Masters Married-civ-spouse Exec-managerial Wife White Female 40 United-States <=50K
7 49 Private 9th Married-spouse-absent Other-service Not-in-family Black Female 16 Jamaica <=50K
8 52 Self-emp-not-inc HS-grad Married-civ-spouse Exec-managerial Husband White Male 45 United-States >50K
9 31 Private Masters Never-married Prof-specialty Not-in-family White Female 50 United-States >50K
10 42 Private Bachelors Married-civ-spouse Exec-managerial Husband White Male 40 United-States >50K
11 37 Private Some-college Married-civ-spouse Exec-managerial Husband Black Male 80 United-States >50K
12 30 State-gov Bachelors Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 40 India >50K
13 23 Private Bachelors Never-married Adm-clerical Own-child White Female 30 United-States <=50K
14 32 Private Assoc-acdm Never-married Sales Not-in-family Black Male 50 United-States <=50K
15 40 Private Assoc-voc Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male 40 Jamaica >50K

This file has 16 specific columns. The first column is the person ID. The second column is the person age. The third column specify the type of work that the person has. The fourth column specify the education grade of the person. The fifth column is the person marital status. The sixth column specify the ocupation. The seventh column shows the relationship within of the family. The eighth specify the person race. The ninth specify the sex. The tenth column specify the hours per week worked by the person. The eleventh column specify the country of the person and the last column show the earns of the person in the year.

The purpose of this tutorial is predict if a person earns more or less than 50k per year.


Columns Specs:

Our cloud shows the next specs columns and types, but this is a recommendation, you can choose a different types for each column depending on your interests:

IDENTIFICATION ID
AGE REAL
WORKCLASS NOMINAL
EDUCATION NOMINAL
MARITAL-STATUS NOMINAL
OCCUPATION NOMINAL
RELATIONSHIP NOMINAL
RACE NOMINAL
SEX NOMINAL
HOURS-PER-WEEK REAL
COUNTRY NOMINAL
INCOME CLASS

Angel Parameters Specification:

These are the parameters needed for the angel creation:

Storage Units Specify the angel unit size reserved for creation.
Parallelism Specify the number of replications for the angel that you want to create.
Ramiel K Specify the number of results for the nearest neighbor search.
Pivots The number of primary search points in the engine.
Probability Minimum accepted probability for the results, any result with lower probability will be discarded.
Accepted Error Accepted search error from the distance calculated by the engine and the real distance.

  • Create Folder

    • Click on “Create Folder” to create a container for your csv, tsv or json files that our similarity engine will search.

    • Provide the folder name and click on “Create Folder”.

    • Once the folder is created you will return to a folder list view.

  • Upload File(s)

    • In the “Folder” that you created click on “Upload File” to see the next modal.

    • After choosing your files click on “Upload File”.

    • You can see the progress bar while the file is being uploaded.

    • Once the files are uploaded you will return to a folder list view.

  • Create your Angel

    • Go to “Create Angel” section to choose the angel that works for your project.

    • For this example we are going to create a Leliel Grid. Click “Create” on Leliel image.

    • The next step is choosing the folder containing the files that you want to use to train the angel. When you choose the folder you can see a preview of the files.

    • If you want to change the type of any column you can do it by choosing an option in the list of types. For Leliel is required that exists a column with type ID and a column with type CLASS.

      Then click on “Next” to continue the creation.

    • The next step required that you choose the tab “Create Grid”. After that fill the Leliel Grid parameters (default parameters will work fine) and choose the name for your angel. In this step you also need to specify a list of values separated by ‘;’ to each parameter.

      Click on “Create” to start the creation of the angel.

    • Once the creation started you can see a table with your current angels and the progress of the creation. When the state is processing the angel will start the execution of the experiments.

  • Monitoring Your Experiments

    • Once the angel is processing you can access the action Grid Results, this action can be accessed from “Your Angels” screen.

    • Grid Results

      In this page you can see the list of experiments that are being executed.

    • You also can see the progress of a specific experiment. To do it click on “View Fold” in the specific row.

    • After a experiment is completed you can download the results or view a visualization of this results. To download the results click on “Download Results” and to see the visualization click on “Visualization”.

    • Also you can see in the rest of the row a resume of the results by percentage. In this example we show the results to the P0 percentage.

    • How we mentioned after, you can see the results of a experiment in a visualization. To do it click on “Visualization” in the row of the experiment that you want to display. The next section describe what the visualization shows.

    • Visualization

      The first part of the visualization show a graphic of the score by percentage. Also this graphic show the number of objects processed in each percentage.

    • The next part shows the confusion matrix for the experiment. We show a confusion matrix to each percentage and each matrix show the score obtained to each class in the experiment. To see the score of a class put the mouse over the corresponding cell of the matrix.

    • The last part of the visualization show each line of the k (k is the number of folds that you choose in the creation) tests used in the experiments. Each line have the prediction obtained in the column “PREDICTED_CLASS” and the real value in the column “ORIGINAL_CLASS”.

Manfred Calvogrid tutorial