PCTLearn

From Jstacs
Revision as of 13:06, 14 December 2017 by Eggeling (talk | contribs)
Jump to navigationJump to search

by Ralf Eggeling, Ivo Grosse, and Mikko Koivisto.

Description

Parsimonious context trees, PCTs, provide a sparse parameterization of conditional probability distributions, but learning them from data is computationally hard due to the combinatorial explosion of the space of model structures as the number of predictor variables grows. Here, we propose new algorithmic ideas, which can ignificantly expedite the standard dynamic programming algorithm. Specifically, we introduce a memoization technique, which exploits regularities within the predictor variables by equating different contexts associated with the same data subset, and a bound-and-prune technique, which exploits regularities within the response variable by pruning parts of the search space based on score upper bounds. The software PCTLearn is a lightweight Java application that learns a single PCT of user-specified maximal depth from data.


Download

The application is available as a single runnable .jar PCTLearn.

Input

The application requires as input a plain text file, which can consist of basic Latin characters a-z and A-Z (case sensitive) and Arabic numerical 0-9. The number of different characters in the input file determines the alphabet size for PCT optimization. Each line in the input file is chopped into overlapping k-mers, where k to the desired maximal PCT depth + 1. The PCT is learned on these resulting k-mers with the convention that the last symbol in each k-mer denotes the response variable.

Running PCTLearn

The application has one mandatory and various optional arguments. A shorter list of arguments can be provided, in which case all missing arguments are considered to assume default values. Run with

java -jar PCTLearn.jar inputFile maximalDepth scoringFunction memoization pruning fineBound memoLimit lookaheadDepth

where the arguments have the following semantics:

name type default comment

inputFile String -- The location of a text file containing the input data.
maximalDepth Integer 2 The maximal depth of the learned PCT.
scoringFunction String BIC The used scoring function. Permitted values are "BIC" and "AIC".
memoization Boolean TRUE Enabling memoization.
pruning Boolean TRUE Enabling pruning.
fineBound Boolean TRUE Use fine upper bound instead of coarse. Is ignored if pruning is set to FALSE.
memoLimit Integer 1 Memoization limit that stops storing subtrees width given distance from the leaves. Is ignored if memoization is set to FALSE.
lookaheadDepth Integer 1 The used lookahead depth. Is ignored if pruning is set to FALSE.

Output

The tool writes some statistics about the optimization, such optimal score, number of visited node, and required running time to stdout.

It addition it creates (i) a graphViz file of the learned PCT structure and (ii) a file with conditional probability parameters (MLE) for each leaf.