Leliel Tutorial (PolyA Signals)

In this tutorial we are going to create a classifier system that is going to predict polyadenylation signals (PAS) in human mRNA sequences. If you want to follow the tutorial step by step you can download the example data file in this link.

Angel Description

A classifier is a machine learning algorithm that receives an object and attaches “tags” or “labels” to it. Leliel is a classifier built on top of the Ramiel similarity engine and provides unmatched performance and accountability features.

 


File Specification:

ID A mandatory field which uniquely identifies each object.
CLASS A mandatory field which specifies the field to be classified.
REAL Numerical values.
NOMINAL Values that do not bear a quantitative relationship with each other (i.e., strings and numbers which represent non-numerical information).
MULTI_PLAIN Multiple NOMINAL values separated by spaces. Non-language specific.
MULTI_ENGLISH Multiple NOMINAL values separated by spaces. The text is English language.
MULTI_SPANISH Multiple NOMINAL values separated by spaces. The text is Spanish language.
MULTI_JAPANESE Multiple NOMINAL values separated by spaces. The text is Japanese language.
ITEM_SET A series of values with weights. (Formatted as item1:weight1;item2:weight2;item3:weight3)
IGNORE The column shall be ignored by the program.
META This column is for metadata and shall be ignored by the program, but information will be retained in the output.

 

The file should consider these specifications:

  • The file should be in tsv format, which is a tab separated values = ‘\t’
  • We expect to have a header with the same column names that were specified.
  • We are going to read all files in the folder, all of them should follow this format.
  • TSV Quote Character = ‘ ” ‘
  • TSV Line End = ‘\n’
  • TSV Escape Character= ‘\’

Example:

This data set is collected from sequence data and aims to predict the polyadenylation signals (PAS) in human mRNA sequences. The table shows the first fifteen lines and only eleven columns of the file.

Identification UP_A DOWN_A UP_C DOWN_C UP_G DOWN_G UP_T DOWN_T UP_AA DOWN_AA UP_AC DOWN_AC UP_AG DOWN_AG UP_AT DOWN_AT UP_CA DOWN_CA UP_CC DOWN_CC UP_CG DOWN_CG UP_CT DOWN_CT UP_GA DOWN_GA UP_GC DOWN_GC UP_GG DOWN_GG UP_GT DOWN_GT UP_TA DOWN_TA UP_TC DOWN_TC UP_TG DOWN_TG UP_TT DOWN_TT UP_AAA DOWN_AAA UP_AAC DOWN_AAC UP_AAG DOWN_AAG UP_AAT DOWN_AAT UP_ACA DOWN_ACA UP_ACC DOWN_ACC UP_ACG DOWN_ACG UP_ACT DOWN_ACT UP_AGA DOWN_AGA UP_AGC DOWN_AGC UP_AGG DOWN_AGG UP_AGT DOWN_AGT UP_ATA DOWN_ATA UP_ATC DOWN_ATC UP_ATG DOWN_ATG UP_ATT DOWN_ATT UP_CAA DOWN_CAA UP_CAC DOWN_CAC UP_CAG DOWN_CAG UP_CAT DOWN_CAT UP_CCA DOWN_CCA UP_CCC DOWN_CCC UP_CCG DOWN_CCG UP_CCT DOWN_CCT UP_CGA DOWN_CGA UP_CGC DOWN_CGC UP_CGG DOWN_CGG UP_CGT DOWN_CGT UP_CTA DOWN_CTA UP_CTC DOWN_CTC UP_CTG DOWN_CTG UP_CTT DOWN_CTT UP_GAA DOWN_GAA UP_GAC DOWN_GAC UP_GAG DOWN_GAG UP_GAT DOWN_GAT UP_GCA DOWN_GCA UP_GCC DOWN_GCC UP_GCG DOWN_GCG UP_GCT DOWN_GCT UP_GGA DOWN_GGA UP_GGC DOWN_GGC UP_GGG DOWN_GGG UP_GGT DOWN_GGT UP_GTA DOWN_GTA UP_GTC DOWN_GTC UP_GTG DOWN_GTG UP_GTT DOWN_GTT UP_TAA DOWN_TAA UP_TAC DOWN_TAC UP_TAG DOWN_TAG UP_TAT DOWN_TAT UP_TCA DOWN_TCA UP_TCC DOWN_TCC UP_TCG DOWN_TCG UP_TCT DOWN_TCT UP_TGA DOWN_TGA UP_TGC DOWN_TGC UP_TGG DOWN_TGG UP_TGT DOWN_TGT UP_TTA DOWN_TTA UP_TTC DOWN_TTC UP_TTG DOWN_TTG UP_TTT DOWN_TTT polyA_Dis PAS
1 21 12 37 39 35 21 7 28 1 2 2 1 16 9 1 0 8 5 16 16 8 2 5 16 12 3 15 7 7 4 1 7 0 2 4 14 3 6 0 5 0 0 1 0 0 2 0 0 0 0 2 1 0 0 0 0 1 0 13 5 1 2 1 2 0 0 1 0 0 0 0 0 0 1 0 1 7 3 0 0 3 3 4 6 8 1 1 6 3 0 0 0 5 0 0 2 0 2 2 6 3 4 0 3 1 1 1 0 9 2 1 0 4 2 9 3 0 0 2 2 6 0 1 2 0 1 0 1 0 0 1 5 0 1 0 1 0 0 0 0 0 2 0 0 1 0 1 6 0 0 2 8 2 3 0 0 1 1 0 2 0 0 0 3 0 1 0 1 Y pos
2 27 31 17 20 23 13 33 36 9 11 4 6 7 5 7 9 6 6 4 3 0 1 7 10 8 3 0 3 7 2 8 4 4 11 9 8 9 4 10 13 3 6 1 1 1 2 4 2 1 1 1 0 0 1 2 4 3 2 0 1 3 0 1 2 0 3 3 2 2 1 2 3 2 1 1 2 2 0 1 3 2 2 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 3 3 2 1 2 1 3 4 0 1 0 3 1 0 2 0 1 0 0 0 0 0 2 2 0 0 1 2 0 3 0 2 2 1 0 4 0 1 2 0 4 1 3 1 2 2 2 3 2 2 2 0 0 4 4 3 0 0 1 2 2 4 1 1 3 2 4 2 1 5 5 N pos
3 17 16 37 32 31 32 15 20 3 1 7 6 6 7 1 1 5 7 12 9 14 6 6 10 6 6 13 7 7 15 5 4 3 1 4 10 4 4 3 5 0 0 1 0 2 1 0 0 2 1 1 1 2 1 2 3 2 2 2 2 0 3 2 0 0 0 0 1 1 0 0 0 1 0 2 3 2 3 0 1 2 3 3 2 6 1 1 3 1 0 9 0 3 4 1 2 2 0 2 5 0 4 2 1 2 1 2 3 1 1 1 0 0 1 6 2 6 3 1 1 2 4 2 3 2 8 1 0 1 0 0 3 3 0 1 1 0 0 2 0 1 1 0 0 1 2 1 4 0 1 2 3 1 0 0 2 2 0 1 2 0 1 2 1 0 0 0 3 N pos
4 19 16 18 13 16 47 47 24 5 4 3 3 4 5 7 3 2 4 5 2 0 1 10 6 3 7 1 4 5 28 7 8 8 1 9 4 7 13 23 6 0 1 0 1 3 1 2 1 0 1 1 1 0 0 1 1 2 0 1 0 0 5 1 0 0 0 1 0 0 0 6 3 0 1 1 1 0 1 1 0 0 0 1 0 0 0 4 2 0 0 0 0 0 1 0 0 2 0 1 1 3 5 4 0 2 2 0 1 1 3 0 1 0 1 0 1 0 1 1 1 1 6 0 2 2 15 2 5 2 0 1 2 1 6 3 0 3 0 1 0 0 0 4 1 2 2 3 0 0 0 4 2 0 1 0 2 3 7 4 3 4 1 6 1 3 1 10 3 Y pos
5 16 9 37 42 29 24 18 25 2 0 5 2 4 6 5 1 7 4 16 15 8 6 5 16 5 3 11 12 11 7 2 2 2 1 4 13 6 5 6 6 0 0 0 0 1 0 1 0 0 0 3 1 0 0 2 1 0 1 2 5 2 0 0 0 2 0 1 1 1 0 1 0 1 0 2 2 2 2 2 0 4 2 8 5 3 0 1 8 1 0 4 3 2 2 1 1 0 0 0 11 3 3 2 2 1 0 2 0 1 2 1 1 3 2 5 4 1 2 2 4 2 2 4 3 5 2 0 0 0 0 2 0 0 0 0 2 0 0 1 0 0 1 1 0 0 0 0 5 3 4 0 3 2 0 1 1 2 3 1 1 0 1 1 1 2 2 3 2 Y pos
6 26 25 21 25 23 24 30 26 10 7 3 10 7 2 6 5 4 7 7 6 0 1 9 11 4 6 6 2 4 10 9 6 7 5 5 7 12 10 6 4 3 1 1 1 2 1 4 4 1 2 1 3 0 1 1 4 0 0 1 0 1 2 5 0 1 2 1 1 4 1 0 1 2 3 1 2 1 1 0 0 2 5 2 0 0 0 3 1 0 0 0 0 0 0 0 1 0 1 3 3 3 5 3 2 2 2 0 3 2 0 0 1 0 0 2 1 0 0 3 1 2 4 0 0 0 4 2 2 3 2 1 1 4 2 1 1 2 1 1 4 2 0 2 0 1 0 2 2 0 0 2 5 2 2 5 2 3 4 2 2 3 0 0 2 1 2 2 0 Y pos
7 36 27 12 12 11 20 41 41 13 11 6 1 5 3 11 12 7 4 0 2 0 1 5 5 6 2 2 5 0 6 3 6 10 10 4 4 6 10 21 17 6 2 1 1 3 1 3 7 4 0 0 0 0 0 2 1 3 0 1 1 0 1 1 1 2 5 1 0 2 4 6 3 2 1 1 0 2 1 2 2 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 1 3 3 1 2 2 0 0 0 2 0 0 1 0 2 0 0 2 2 0 1 0 2 0 1 0 1 1 2 0 1 0 3 2 0 4 6 2 0 0 1 4 3 3 2 0 0 0 1 1 1 3 1 1 2 0 4 2 3 5 3 2 2 4 2 10 10 N pos
8 24 28 14 10 15 20 47 42 5 9 5 3 4 7 10 9 8 5 1 2 0 0 5 3 3 4 3 1 1 3 8 12 7 9 5 4 10 10 24 18 1 4 0 0 0 3 4 2 4 2 1 0 0 0 0 1 1 3 1 0 0 1 2 3 1 3 1 1 3 3 5 2 1 1 2 1 2 1 3 2 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 1 2 2 1 1 1 1 1 1 0 1 1 0 0 1 0 0 2 0 0 0 0 1 0 0 1 2 3 3 1 1 1 2 3 6 2 3 2 1 1 2 2 3 2 2 0 1 0 0 3 1 2 1 2 0 1 2 5 7 3 3 3 2 4 4 14 8 Y pos
9 43 30 11 20 12 9 34 41 19 8 3 7 5 4 16 10 6 8 1 6 0 0 3 6 5 2 4 1 2 0 1 6 12 11 3 6 5 5 14 19 8 3 2 1 4 1 5 2 2 1 0 3 0 0 1 3 3 1 2 1 0 0 0 2 7 3 1 0 3 2 5 5 1 1 0 1 1 0 4 6 1 3 0 2 0 0 0 1 0 0 0 0 0 0 0 0 1 2 0 1 1 1 1 2 2 0 0 1 0 1 3 0 2 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 1 0 1 0 3 7 4 1 3 0 2 4 2 1 3 0 1 0 0 2 2 2 1 1 0 2 0 0 4 3 5 2 4 1 1 8 9 Y pos
10 31 21 7 19 21 21 41 39 12 7 2 2 8 4 9 7 1 2 2 5 0 0 4 12 5 7 1 3 4 5 11 6 13 4 2 9 8 12 17 14 6 2 0 0 2 2 4 2 1 0 0 0 0 0 1 2 1 2 1 1 2 0 4 1 3 3 1 0 2 3 3 1 0 0 0 1 1 0 0 1 0 1 0 1 0 0 2 3 0 0 0 0 0 0 0 0 1 0 0 3 1 4 2 5 1 3 0 1 4 1 0 2 0 0 1 0 0 0 0 3 2 1 0 2 0 0 2 2 5 0 0 2 2 3 4 1 5 2 2 0 1 1 5 1 0 1 1 4 0 0 1 4 2 4 0 0 1 5 5 3 4 1 1 4 3 2 8 7 N pos
11 22 34 17 14 26 11 35 41 6 17 5 6 8 1 2 9 2 4 4 4 1 0 10 6 5 3 4 0 7 5 10 3 9 9 4 4 9 5 13 23 1 8 4 3 1 0 0 6 1 3 2 1 0 0 2 2 1 0 2 0 2 1 3 0 0 4 0 0 1 0 1 5 1 4 0 0 0 0 1 0 1 1 1 0 1 0 1 3 0 0 0 0 0 0 1 0 2 0 2 1 2 2 4 3 1 1 0 0 4 0 0 1 0 0 0 0 0 0 4 0 2 0 1 0 0 2 4 3 3 1 0 0 3 0 4 2 3 3 1 3 3 1 1 2 0 0 1 3 0 0 3 1 2 3 1 0 4 2 2 0 4 4 2 3 3 3 4 13 Y pos
12 28 30 15 23 13 17 44 30 7 13 8 4 4 8 9 5 6 5 0 7 0 1 9 10 3 6 0 4 1 3 9 3 12 5 7 8 7 5 17 12 1 4 0 3 2 4 4 2 4 0 0 1 0 1 4 2 0 3 0 1 1 2 3 2 4 2 2 1 0 1 3 1 2 2 1 0 1 1 2 2 0 2 0 2 0 0 0 3 0 0 0 0 0 1 0 0 1 2 1 2 1 1 6 5 0 3 2 1 0 2 1 0 0 2 0 1 0 0 0 1 0 1 0 2 0 0 1 0 3 1 2 0 2 1 1 1 4 3 5 0 1 1 2 1 2 1 0 3 0 0 5 4 3 2 0 1 0 0 4 1 4 0 2 5 4 2 7 5 N pos
13 23 34 16 12 14 17 47 37 6 5 5 5 2 3 10 21 1 5 4 2 2 1 9 4 5 5 4 2 2 1 3 9 10 19 3 3 8 12 25 2 1 3 2 1 1 0 2 1 0 2 0 1 2 1 3 1 0 2 0 0 0 0 2 1 3 13 0 1 4 6 2 0 1 1 0 0 0 0 0 4 0 1 1 0 0 0 3 1 1 0 1 1 0 0 0 0 4 2 0 0 1 0 4 2 0 0 1 3 0 1 4 1 1 1 1 0 0 0 2 1 2 0 0 1 0 0 0 0 0 3 0 2 0 4 3 0 4 1 2 1 1 2 3 15 0 1 2 1 0 0 1 1 2 3 3 0 2 1 1 8 3 1 3 0 3 1 16 0 Y pos
14 13 13 32 29 33 24 22 34 2 3 3 5 7 4 1 1 6 4 15 8 2 1 9 15 5 5 5 6 17 7 5 6 0 1 8 9 7 12 7 12 0 1 2 1 0 1 0 0 0 0 1 1 0 0 2 3 1 0 3 2 3 2 0 0 0 0 1 0 0 0 0 1 0 0 1 2 4 2 1 0 3 1 7 4 2 1 3 2 0 1 0 0 1 0 1 0 0 0 2 5 4 7 3 3 2 1 0 2 3 1 0 1 0 0 4 1 0 0 1 5 4 2 2 1 10 2 0 2 0 1 3 1 0 1 2 3 0 1 0 0 0 0 0 0 3 3 3 2 0 0 2 4 0 2 0 3 3 3 4 4 0 0 2 3 3 4 2 5 N pos

 

This file has 170 specific columns. The first 168 columns describes the mRNA sequences that we want to know if has or not the polyadenylation signals. The last column show if the mRNA sequences has the PAS or not.

The purpose of this tutorial is to predict the polyadenylation signals (PAS) in human mRNA sequences.


Columns Specs:

Our cloud shows the next specs columns and types, but this is a recommendation, you can choose a different types for each column depending on your interests:

Identification ID
UP_A REAL
DOWN_A REAL
UP_C REAL
DOWN_C REAL
UP_G REAL
DOWN_G REAL
UP_T REAL
DOWN_T REAL
UP_AA REAL
DOWN_AA REAL
UP_AC REAL
DOWN_AC REAL
UP_AG REAL
DOWN_AG REAL
UP_AT REAL
DOWN_AT REAL
UP_CA REAL
DOWN_CA REAL
UP_CC REAL
DOWN_CC REAL
UP_CG REAL
DOWN_CG REAL
UP_CT REAL
DOWN_CT REAL
UP_GA REAL
DOWN_GA REAL
UP_GC REAL
DOWN_GC REAL
UP_GG REAL
DOWN_GG REAL
UP_GT REAL
DOWN_GT REAL
UP_TA REAL
DOWN_TA REAL
UP_TC REAL
DOWN_TC REAL
UP_TG REAL
DOWN_TG REAL
UP_TT REAL
DOWN_TT REAL
UP_AAA REAL
DOWN_AAA REAL
UP_AAC REAL
DOWN_AAC REAL
UP_AAG REAL
DOWN_AAG REAL
UP_AAT REAL
DOWN_AAT REAL
UP_ACA REAL
DOWN_ACA REAL
UP_ACC REAL
DOWN_ACC REAL
UP_ACG REAL
DOWN_ACG REAL
UP_ACT REAL
DOWN_ACT REAL
UP_AGA REAL
DOWN_AGA REAL
UP_AGC REAL
DOWN_AGC REAL
UP_AGG REAL
DOWN_AGG REAL
UP_AGT REAL
DOWN_AGT REAL
UP_ATA REAL
DOWN_ATA REAL
UP_ATC REAL
DOWN_ATC REAL
UP_ATG REAL
DOWN_ATG REAL
UP_ATT REAL
DOWN_ATT REAL
UP_CAA REAL
DOWN_CAA REAL
UP_CAC REAL
DOWN_CAC REAL
UP_CAG REAL
DOWN_CAG REAL
UP_CAT REAL
DOWN_CAT REAL
UP_CCA REAL
DOWN_CCA REAL
UP_CCC REAL
DOWN_CCC REAL
UP_CCG REAL
DOWN_CCG REAL
UP_CCT REAL
DOWN_CCT REAL
UP_CGA REAL
DOWN_CGA REAL
UP_CGC REAL
DOWN_CGC REAL
UP_CGG REAL
DOWN_CGG REAL
UP_CGT REAL
DOWN_CGT REAL
UP_CTA REAL
DOWN_CTA REAL
UP_CTC REAL
DOWN_CTC REAL
UP_CTG REAL
DOWN_CTG REAL
UP_CTT REAL
DOWN_CTT REAL
UP_GAA REAL
DOWN_GAA REAL
UP_GAC REAL
DOWN_GAC REAL
UP_GAG REAL
DOWN_GAG REAL
UP_GAT REAL
DOWN_GAT REAL
UP_GCA REAL
DOWN_GCA REAL
UP_GCC REAL
DOWN_GCC REAL
UP_GCG REAL
DOWN_GCG REAL
UP_GCT REAL
DOWN_GCT REAL
UP_GGA REAL
DOWN_GGA REAL
UP_GGC REAL
DOWN_GGC REAL
UP_GGG REAL
DOWN_GGG REAL
UP_GGT REAL
DOWN_GGT REAL
UP_GTA REAL
DOWN_GTA REAL
UP_GTC REAL
DOWN_GTC REAL
UP_GTG REAL
DOWN_GTG REAL
UP_GTT REAL
DOWN_GTT REAL
UP_TAA REAL
DOWN_TAA REAL
UP_TAC REAL
DOWN_TAC REAL
UP_TAG REAL
DOWN_TAG REAL
UP_TAT REAL
DOWN_TAT REAL
UP_TCA REAL
DOWN_TCA REAL
UP_TCC REAL
DOWN_TCC REAL
UP_TCG REAL
DOWN_TCG REAL
UP_TCT REAL
DOWN_TCT REAL
UP_TGA REAL
DOWN_TGA REAL
UP_TGC REAL
DOWN_TGC REAL
UP_TGG REAL
DOWN_TGG REAL
UP_TGT REAL
DOWN_TGT REAL
UP_TTA REAL
DOWN_TTA REAL
UP_TTC REAL
DOWN_TTC REAL
UP_TTG REAL
DOWN_TTG REAL
UP_TTT REAL
DOWN_TTT REAL
polyA_Dis NOMINAL
PAS CLASS

Angel Parameters Specification:

These are the parameters needed for the angel creation:

Storage Units Specify the angel unit size reserved for creation.
Parallelism Specify the number of replications for the angel that you want to create.
Ramiel K Specify the number of results for the nearest neighbor search.
Pivots The number of primary search points in the engine.
Probability Minimum accepted probability for the results, any result with lower probability will be discarded.
Accepted Error Accepted search error from the distance calculated by the engine and the real distance.

  • Create Folder

    • Click on “Create Folder” to create a container for your csv, tsv or json files that our similarity engine will search.

    • Provide the folder name and click on “Create Folder”.

    • Once the folder is created you will return to a folder list view.

  • Upload File(s)

    • In the “Folder” that you created click on “Upload File” to see the next modal.

    • After choosing your files click on “Upload File”.

    • You can see the progress bar while the file is being uploaded.

    • Once the files are uploaded you will return to a folder list view.

  • Create your Angel

    • Go to “Create Angel” section to choose the angel that works for your project.

    • For this example we are going to create a Leliel. Click “Create” on Leliel image.

    • The next step is choosing the folder containing the files that you want to use to train the angel. When you choose the folder you can see a preview of the files.

    • If you want to change the type of any column you can do it by choosing an option in the list of types. For Leliel is required that exists a column with type ID and a column with type CLASS.

      Then click on “Next” to continue the creation.

    • The next step is to fill the Leliel parameters (default parameters will work fine) and to choose the name for your angel.

      Click on “Create” to start the creation of the angel.

    • Once the creation started you can see a table with your current angels and the progress of the creation. When the state is running the angel is ready to answer queries.

  • Query Your Angel

    • For the query you have two options, Execute Query and Batch Query, both options can be accessed from “Your Angels” screen.

    • Execute Query

      Provide the values for the object that you want to query and then click on “Execute Query”.

    • Other option is to choose a folder containing the files that you want to use to query the angel.

    • Then you can click on “Fill Query Fields” to choose a row from the file to fill the query object. In this example we fill the fields with the sixth row showed in the preview of the data.

    • Now you only need to click on “Execute Query” to obtain your result. On the query results, the first value is the class, the second value is a score (higher is better), the third value is the confidence of the result and the last value specify the probability that the class is the correct between the results (the query can return more than one class).

    • Batch Query

      First choose a folder containing the files that you want to use to query the angel, and click on “Execute Query”, this is going to create a batch process.

    • Once the batch query has started you can see a table with your current batch files and the progress of the execution. When the state is completed the batch is ready for download.

Manfred CalvoLeliel Tutorial PolyA Signals