PCTLearn: Difference between revisions

From Jstacs
Jump to navigationJump to search
(page created)
 
No edit summary
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
by Ralf Eggeling, Ivo Grosse, and Mikko Koivisto.
by Ralf Eggeling, Ivo Grosse, and Mikko Koivisto.
== Paper ==
If you use PCTLearn, please cite
R. Eggeling, I. Grosse, M. Koivisto. [https://link.springer.com/article/10.1007/s10994-018-5770-9 Algorithms for learning parsimonious context trees]. ''Machine Learning'', 2018; doi: 10.1007/s10994-018-5770-9
== Description ==
Parsimonious context trees, PCTs, provide a sparse parameterization of conditional probability distributions, but learning them from data is computationally hard due to the combinatorial explosion of the space of model structures as the number of predictor variables grows. Here, we propose new algorithmic ideas, which can ignificantly expedite the standard dynamic programming algorithm. Specifically, we introduce a memoization technique, which exploits regularities within the predictor variables by equating different contexts associated with the same data subset, and a bound-and-prune technique, which exploits regularities within the response variable by pruning parts of the search space based on score upper bounds. 
The software '''PCTLearn''' is a lightweight Java application that implements these ideas and can be used to learn a single PCT of user-specified maximal depth from data.
== Download ==
The application is available as a single runnable .jar [https://www.cs.helsinki.fi/u/eggeling/PCTLearn/PCTLearn.jar PCTLearn].
== Input ==
The application requires as input a plain text file, which can consist of basic Latin characters a-z and A-Z (case sensitive) and Arabic numerical 0-9.
The number of different characters in the input file determines the alphabet size for PCT optimization.
Each line in the input file is chopped into overlapping k-mers, where k to the desired maximal PCT depth + 1. The PCT is learned on these resulting k-mers with the convention that the last symbol in each k-mer denotes the response variable.
== Running PCTLearn ==
The application has one mandatory and various optional arguments.
A shorter list of arguments can be provided, in which case all missing arguments are considered to assume default values.
Run with
<code>java -jar PCTLearn.jar inputFile maximalDepth scoringFunction memoization pruning fineBound memoLimit lookaheadDepth</code>
where the arguments have the following semantics:
<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
        <td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td><font color="green">inputFile</font></td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the input data. </td>
</tr>
<tr>
<td><font color="green">maximalDepth</font></td>
<td>Integer</td>
        <td>2</td>
<td>The maximal depth of the learned PCT.</td>
</tr>
<tr>
<td><font color="green">scoringFunction</font></td>
<td>String</td>
        <td>BIC</td>
<td>The used scoring function. Permitted values are "BIC" and "AIC".</td>
</tr>
<tr>
<td><font color="green">memoization</font></td>
<td>Boolean</td>
        <td>TRUE</td>
<td>Enabling memoization.</td>
</tr>
<tr>
<td><font color="green">pruning</font></td>
<td>Boolean</td>
        <td>TRUE</td>
<td>Enabling pruning.</td>
</tr>
<tr>
<td><font color="green">fineBound</font></td>
<td>Boolean</td>
        <td>TRUE</td>
<td>Use fine upper bound instead of coarse. Is ignored if pruning is set to FALSE.</td>
</tr>
<tr>
<td><font color="green">memoLimit</font></td>
<td>Integer</td>
        <td>1</td>
<td>Memoization limit that stops storing subtrees width given distance from the leaves. Is ignored if memoization is set to FALSE.</td>
</tr>
<tr>
<td><font color="green">lookaheadDepth</font></td>
<td>Integer</td>
        <td>1</td>
<td>The used lookahead depth. Is ignored if pruning is set to FALSE.</td>
</tr>
</table>
== Output ==
The tool writes some statistics about the optimization, such optimal score, number of visited node, and required running time to stdout.
It addition it creates (i) a graphViz file of the learned PCT structure and (ii) a file with conditional probability parameters (MLE) for each leaf.

Latest revision as of 17:00, 13 November 2018

by Ralf Eggeling, Ivo Grosse, and Mikko Koivisto.

Paper

If you use PCTLearn, please cite

R. Eggeling, I. Grosse, M. Koivisto. Algorithms for learning parsimonious context trees. Machine Learning, 2018; doi: 10.1007/s10994-018-5770-9


Description

Parsimonious context trees, PCTs, provide a sparse parameterization of conditional probability distributions, but learning them from data is computationally hard due to the combinatorial explosion of the space of model structures as the number of predictor variables grows. Here, we propose new algorithmic ideas, which can ignificantly expedite the standard dynamic programming algorithm. Specifically, we introduce a memoization technique, which exploits regularities within the predictor variables by equating different contexts associated with the same data subset, and a bound-and-prune technique, which exploits regularities within the response variable by pruning parts of the search space based on score upper bounds. The software PCTLearn is a lightweight Java application that implements these ideas and can be used to learn a single PCT of user-specified maximal depth from data.


Download

The application is available as a single runnable .jar PCTLearn.

Input

The application requires as input a plain text file, which can consist of basic Latin characters a-z and A-Z (case sensitive) and Arabic numerical 0-9. The number of different characters in the input file determines the alphabet size for PCT optimization. Each line in the input file is chopped into overlapping k-mers, where k to the desired maximal PCT depth + 1. The PCT is learned on these resulting k-mers with the convention that the last symbol in each k-mer denotes the response variable.

Running PCTLearn

The application has one mandatory and various optional arguments. A shorter list of arguments can be provided, in which case all missing arguments are considered to assume default values. Run with

java -jar PCTLearn.jar inputFile maximalDepth scoringFunction memoization pruning fineBound memoLimit lookaheadDepth

where the arguments have the following semantics:

name type default comment

inputFile String -- The location of a text file containing the input data.
maximalDepth Integer 2 The maximal depth of the learned PCT.
scoringFunction String BIC The used scoring function. Permitted values are "BIC" and "AIC".
memoization Boolean TRUE Enabling memoization.
pruning Boolean TRUE Enabling pruning.
fineBound Boolean TRUE Use fine upper bound instead of coarse. Is ignored if pruning is set to FALSE.
memoLimit Integer 1 Memoization limit that stops storing subtrees width given distance from the leaves. Is ignored if memoization is set to FALSE.
lookaheadDepth Integer 1 The used lookahead depth. Is ignored if pruning is set to FALSE.

Output

The tool writes some statistics about the optimization, such optimal score, number of visited node, and required running time to stdout.

It addition it creates (i) a graphViz file of the learned PCT structure and (ii) a file with conditional probability parameters (MLE) for each leaf.