*************************************************
    README file for WSDGate v 0.05
*************************************************

The purpose of this document is to give the readear a brief
introduction to the WSDGate framework.

WSDGate is an end-to-end Supervised Word Sense Disambiguation (WSD) framework 
developed by making use of existing resources such as GATE (General 
Architecture for Text Engineering) and WEKA (Waikato Environment for Knowledge
Analysis). It also makes use of NSPGate, which is a GATE processing resource
that acts as a wrapper around the Ngram Statistics Package (NSP).

The aim of WSDGate is to facilitate batch mode experiments of supervised
WSD, using GATE and NSPGate for feature identification and extraction and 
WEKA for machine learning, to perform several cross-validation experiments.

Typical supervised WSD approach requires identification of the best set of
features and the best machine learning algorithms for a given set of data.
WSDGate is intended to facilitate exactly this requirement.

More background about the system is available in the following Intelligent 
Systems demo paper:

Mahesh Joshi, Serguei Pakhomov, Ted Pedersen, Richard Maclin and Christopher 
Chute. An End-to-end Supervised Target-Word Sense Disambiguation System. To 
appear in Proceedings of the Twenty-First National Conference on Artificial
Intelligence, Intelligent Systems Demonstrations, 2006 (AAAI-06).

The following sections provide background regarding a typical target-word 
disambiguation task described in the USAGE document, where we have several 
manually labeled instances of a set of ambiguous words and machine learning 
models are to be learnt for disambiguating their future instances.

In a typical supervised WSD task, two factors are important:

1. Choosing good features
2. Choosing good machine learning algorithms (with good parameters)

Below is a description of the various components used in the WSDGate system
with respect to their parameters and the features they provide for a WSD
task. In the GATE machine learning framework, features are essentially
identified from the annotations created on documents or attributes of created
annotations. Each of the following plugins create different types of 
annotations (and attributes on them), which can be used as features by WSDGate.

------------------------------------------------------------------------------
Brief overview of what the plugins do and the features we get to use as of now
------------------------------------------------------------------------------

** NSP Wrapper ** 
-----------------
(PLUGIN FROM NSPGate, http://sourceforge.net/projects/nspgate/)

This plugin is a wrapper for the Ngram Statistics Package (NSP). Please refer
to its README document for explanation of its parameters. It produces
ngram annotations according to the parameters specified. The new annotations
that it creates are:

    1gram - for unigrams
    2gram - for bigrams
    3gram - for trigrams, and so on.

All these are created in the specified output annotation set, which is
a parameter taken by NSPGate. In abscence of any specification for this
parameter, the GATE framework has a "default" annotation set which does
not have any name.


** ANNIE **
-----------
(Existing plugin in GATE, we only use it) 

We use 3 components of ANNIE - Tokenizer, Sentence Splitter and Part of Speech
Tagger. Parameters for these are documented in the GATE documentation at
http://gate.ac.uk/sale/tao/index.html. The annotation types created by these 
components in the default annotation set are:

    Token, SpaceToken - by Tokenizer 
    Sentence - by Sentence Splitter
    "category" attribute on a Token annotation - by POS Tagger (only enhances 
    the Token annotations created by the Tokenizer by adding a "category" 
    attribute to them, no new annotations created)

  (Note: An "attribute" in XML terminology is something that gives additional
  information about an XML tag, e.g. in 
        
        <answer instance="1" senseid="river"/>
  
  "instance" and "senseid" are attributes of the "answer" tag.)

Below is the information of other Gate PRs which do not create any annotations
but make use of the annotations created by the PRs above, or integrate the
functionality of several PRs.

** WSDGate PR **
----------------
(Plugin made available as a part of WSDGate)

This is a wrapper plugin that combines the functionality of all of the
above plugins and integrates the process of running WEKA classifiers
on the ARFF files produced, after applying pre-processing filters etc.
It generates WEKA output files containing output from WEKA classifiers.

----------
PARAMETERS
----------

"corpus"

The source corpus containing XML files for instances of an ambiguous imported 
into the GATE environment.

"remove"

The frequency cutoff value for ngram features that is passed to the NSPGate PR.

"crossValCount"

The number of times cross-validation experiment is to be repeated per machine
learning algorithm.

"ngrams"

The value of N for various Ngrams to use, separated by the pipe symbol (|).
For example to use unigrams and bigrams, this should be set to "1|2" without
the quotes.

"nontoken"

The file that contains regular expressions for tokens to be discarded. This is
passed to the NSPGate PR.

"score"
                
The score cutoff value to be used by the statistics module of the Ngram 
Statistics Package (NSP). This is passed to the NSPGate PR.

"stop"

The file containing list of common stop words or functional words which should 
be discarded from the list of features. This is passed to NSPGate PR.

"token"

The file containing regular expressions which we want to be identified as 
tokens. This is passed to NSPGate PR.

"statModule"

The statistic module to be used by the statistic.pl program from NSP. This is
passed to NSPGate PR.

"datasetPath"

The path where WEKA dataset files in the ARFF format should be created.
            
"resultPath"

The path where the output of WEKA preprocessing filters and classifiers is to 
be stored.

"fileNamePrefix"

The prefix to be used while creating ARFF and output file names.

"wekaClassifiers"

The WEKA classifier(s) to use for performing cross-validation experiments, 
separated by the pipe symbol (|). For example, to run the Support Vector
Machine classifier and the naive Bayes classifier in weka, one would use
"weka.classifiers.functions.SMO|weka.classifiers.bayes.NaiveBayes" without the
quotes.

"wekaOptions"

Respective WEKA classifier options for each of the classifiers in the parameter
above, separated by the pipe symbol (|).

"wekaFilterWithOptions"

Any WEKA filter to use, with its options. Only one can be used at most.

"stringToNominalRange"

A list of attributes to convert from string to nominal type.

"mlConfigFilesDir"
                
Directory containing the Machine Learning XML configuration files. The files
are passed to the WSDMachineLearningPR, a new one for each file.

"mlConfigFilesFilter"
                
Wild card filters to choose any particular machine learning configuration 
files in the directory above. By default all XML file will be used.

"inputASName"
                
The input annotation set to use for extracting features.

"outputASName"

The output annotation set name to use for creating any new annotations, default 
is - "WSD Annotations" without the quotes.


** WSDMachineLearningPR **
--------------------------

This is a modified version of the machine learning PR from GATE. It adds
functionality to enable feature extraction from a "flexible distance" in the
sense that a feature position of -1 means "get the first feature to the left
of the ambiguous word, no matter how far it is", instead of the usual "get the 
first word to the left of the ambiguous word if its a feature, else return 
empty". 

----------
PARAMETERS 
----------

"document" 

The document to be processed, for which the features are to be extracted using
the annotations in the document.

"inputASName"

The name of the annotation set in the above document, to be used for input, 
that is for extracting features.

"configFileURL"

The URL for the Machine Learning configuration file that specifies what 
features are to be extracted from the above document.

"training" 

A boolean flag which is set to true for enabling training mode, else the PR
operates in testing mode, applying a previously generated model to the
input instances.


** SplitSval2Instances PR **
----------------------------

For efficiency reasons, it is best to have the input data files (that is the
instances of an ambiguous word) in multiple xml files -- one per instance
of the ambiguous word. This PR helps in breaking up a single file that is in
Senseval-2 format, into multiple files, one per instance in GATE xml format.

----------
PARAMETERS
----------

"document"

The source Senseval-2 file imported into GATE environment.

"outputPath"

The directory where the output GATE xml files, one per instance should be 
created.

==================
COMMAND LINE USAGE
==================

Note that all of the above components are GATE plugins and do not run
outside of GATE. To facilitate command line usage another Java class
has been developed that parses a configuration file containing parameters for
the plugins above and invokes these components after initializing GATE in 
non-UI mode. This class is gate.creole.wsd.WSDExperiments.

The paramters required for this Java class at command line can be given in 
two ways. 

The first way is used when one already has multiple input XML files (one
per instance) of an ambiguous word. They are as follows (in the specified 
order below):

1. TERM_FILE 

This is a plain text file containing the list of ambiguous words to be
experimented upon. Note that these are also the directory names in the
root level input data directory. Refer to USAGE document for input data
directory structure.

2. OPTIONS_FILE 

This is a plain text configuration file containing parameter specification for
NSPGate and WSDGate PRs. The format is PARAMETER_NAME:=VALUE or
PARAMETER_NAME:=VALUE, depending upon whether the parameter is global for
all experiments or local for a single experiment respectively. Details of
the parameter names and the file format are in the sample configuration file
samples/configtemplate.

3. INPUTS_DIR 

This is the top level root directory containing one directory per ambiguous
word. The names of these directories should be exactly the same as those
listed in the TERM_FILE above. Each of these directories contain multiple
XML files (one per instance) for the corresponding ambiguous word.

4. OUTPUTS_DIR

This is the top level root directory where the WEKA output should be stored.
Similar to the INPUTS_DIR, one directory per ambiguous word is created in
this output directory and the output files for an ambiguous word are stored in 
the corresponding sub-directory.


The second way to invoke gate.creole.wsd.WSDExperiments is when one has a
single Senseval-2 formatted file for an ambiguous word, containing all the
instances. The parameters to use in this case are as follows (in the given
order):

1. --singlefile

This is just a flag that indicates that what follows is the name of the
Senseval-2 formatted input file.

2. INPUT_FILE

This is the name of the Senseval-2 formatted input file for the ambiguous
word. Note that in this mode, only one word can be processed per invokation
of gate.creole.wsd.WSDExperiments.

3. OPTIONS_FILE

This is the same configuration file as described in the first method of
invoking gate.creole.wsd.WSDExperiments.


The manual creation of XML configuration files for feature extraction is a
tedious task. To facilitate some automation in the process and also to create
and experimental directory structure, WSDGate provides a utility Perl script
mkconfig.pl. Following is a description of the same.

mkconfig.pl
-----------

mkconfig.pl is the program that automatically creates experimental
directories, configuration files and scripts for running experiments. 

REQUIRED parameters to mkconfig.pl are:

    --configname  <config_name>

    This is the name given to the generated experiment. Usually this can be
    something that reflects the choice of features. e.g. If the features are
    unigrams and bigrams in a window of 5 and POS tags in a window of 2, then
    one possible configuration name can be "ub5p2". The name should not
    contain spaces, semi-colons or = sign.


    --javapath <full_path_to_java_binary>
    
    This is the full path to the Java virtual machine binary, e.g.
    /usr/local/j2sdk1.4.2_09/jre/bin/java

    
    --gatehome <full_path_to_GATE_base_dir>
    
    This is the full path to the home directory of the GATE installation
    on your machine, e.g. /usr/local/GATE3.0
    Note: If the GATE path contains a space, use double quotes to specify.

    
    --configtemplatefile <path_to_config_template>
    
    This is the path to the configuration template file (configtemplate) that 
    comes with the package and *WHICH YOU SHOULD CUSTOMIZE* for your set of 
    experiments. More information on how to customize the configuration
    template is available in the "configtemplate" file itself, the format
    is mentioned and each of the parameters are explained. 
    
    ####################
    ** IMPORTANT NOTE **
    ####################
    You *MUST* update all the /FULL/PATH/... related parameters in the sample
    configuration template to point to directories and files on your machine,
    (you can comment the ones that are not marked ** REQUIRED **).

    ############################
    ** ANOTHER IMPORTANT NOTE **
    ############################
    In the description below, the terminology used is as follows:

    **********************
    * Annotation = XML Tag
      
    e.g. in the string "<head>APC</head>", the annotation is "head" and it
    encloses the text "APC".

    ************************************
    * Attribute = Property of an XML Tag

    e.g. in the string "<answer id="12345" senseid="river"/>, the attributes
    are "id" and "senseid" and their values are "12345" and "river" 
    respectively. (The annotation is "answer", and it does not
    have any enclosed text, since the tag ends immediately).

    ****************************************************************
    * Feature = A property of an instance of ambiguous word which is
                used by Machine Learning algorithms, either as
                for prediction of some feature, or as the value to be
                predicted

    e.g. in the string 
    "<1gram string="fertile">fertile</1gram><head>bank</bank>"
    the value "fertile" of the attribute "string" of the annotation "1gram"
    can be a feature. Also, simply the presence or absence of the "1gram"
    annotation around the word "fertile" can be a feature. So features
    are defined based on values of attributes or presence of annotations.

    ###################################
    ** END OF ANOTHER IMPORTANT NOTE **
    ###################################

    The following parameters determine the features that will be used in
    the experiment. These decide the contents of the feature specification
    XML file. Keeping the above terminology convention in mind, the annotations
    in the feature specification XML file will be simply called XML tags.

    
    --inst <instance_annotation_name>
    
    This is the annotation in your input data files that encloses the 
    instance i.e. the occurrence of the ambiguous terms in the data files.

    --class <class_feature_parameters>
    
    The feature specification XML file consists of a <DATASET> tag, inside
    which there are several <ATTRIBUTE> tags which specify the details
    about the features to be extracted. One (and only one) of the features 
    is the CLASS feature (one which is to be predicted, based on
    knowledge of other features). This command line argument takes the
    details about the CLASS feature.

    The class feature parameters are to be specified in the following format:

    There should be exactly 6 parameters separated by comma (,) symbol. Where
    parameters values are not required, a comma should be repeated
    without any space after the previous comma e.g. a,b,,-1,Y,emptyvalues. 
    There should be absolutely no space anywhere in between the parameter 
    specification.
    
    NAME,ANNOTATION_TYPE,ATTRIBUTE,POSITION,FLOATING_FEATURE,VALUES

        NAME: Mandatory, cannot be empty.
        
        This is the name given to the feature, as it should seen in 
        the WEKA ARFF file, e.g. a name for the class feature of the target 
        word can be "Meaning". These names should be strictly alpha-numeric 
        with only minus sign (-) and parentheses where needed.

              
        ANNOTATION_TYPE: Mandatory, cannot be empty.

        This is the annotation type in the input data file
        using which the class feature is to be extracted.
        
        ATTRIBUTE: Optional, when empty only presence or absence of
        ANNOTATION_TYPE above is recorded.
        
        This is the attribute of the above annotation from which 
        the value for the CLASS feature is to be extracted.

        POSITION: Mandatory, cannot be empty.
        
        This parameter should specify the relative position of
        ANNOTATION_TYPE with respect to the instance annotation. 

        FLOATING_FEATURE: Optional, when empty defaults to "N".
        
        This parameter specifies whether the physical
        position of ANNOTATION_TYPE should be treated as a "floating" position
        such that even if the ANNOTATION_TYPE is not present exactly at the
        specified position (where a different annotation might be present),
        the search should continue until the specified occurrance of
        ANNOTATION_TYPE is found. If the end of the document is reached, 
        then a missing value (?) is returned. The value to be specified for
        this parameter should be either 'Y' if floating position is
        desired or 'N' if floating position is not desired.

        VALUES: Optional, when empty it defaults to "novalues".
        
        This parameter can take one of the following 3 values,
            
        1.  "novalues" - A set of possible values for the feature need
            not be specified. This is used when the feature is binary, where
            only the presence or absence of an annotation is to be recorded.
            
        2.  "emptyvalues" - Useful for creating features of datatype "string"
            in the WEKA ARFF format, where the specification means that
            the set of possible values for this attribute is not known
            in advance, but should rather be decided after all the instances
            are known.

        3.  A file containing possible nominal values: This is useful for
            creation of WEKA nominal features where the set of possible
            values of the feature is known in advance.

        For the purpose of most WSD experiments, it is suitable to select
        "emptyvalues" which means that the attributes will be created as
        string attributes in the WEKA ARFF file and then they can be
        converted to nominal or other required types using WEKA filters.


    --feat <feature_parameters>

    This argument specifies features other than the class feature in the 
    feature specification XML file. The difference from the class feature 
    parameters specification is that here we can have specification of 
    parameters for multiple attributes.

    The feature parameters are to be specified in the following format:

    There should be exactly 7 parameters for each feature, separated by 
    comma (,) symbol, and every feature should be separated by plus (+)
    symbol. Wherever parameters values are not required, a comma should be 
    repeated without any space after the previous comma e.g. 
        a,b,,-1:1,Y,N,emptyvalues.
    There should be absolutely no space in between the feature parameter
    specifications.
    
    NAME,ANNOTATION_TYPE,ATTRIBUTE,POSITION,ZERO,FLOATING_FEATURE,VALUES

        NAME: Mandatory, cannot be empty.
        
        This is the name given to the feature, as it should seen in 
        the WEKA ARFF file, e.g. a name for the feature that is one position
        to the left of the target word can be "U(-1)". These names should be 
        strictly alpha-numeric with only minus sign (-) and parentheses where 
        needed.
              
        ANNOTATION_TYPE:  Mandatory, cannot be empty.
        
        This is the annotation type in the the input file
        from which the feature is to be extracted. 

        ATTRIBUTE: Optional, whem empty only presence or absence of 
        ANNOTATION_TYPE is recorded.
        
        This is the attribute of the ANNOTATION_TYPE above 
        from which the value for the feature is to be extracted. 
        
        So a crucial part of this feature specification is knowing
        what annotations and attributes are created by which components
        and only then one is able to use them as features by specifying the
        required details about them to the mkconfig.pl file. Note that
        incorrect feature specification will lead to creation of incorrect
        feature specification files, and since those annotations or attributes
        may not be present in the data file, no feature values will be
        extracted, instead a missing values dataset will be created.
        Annotations produced by components that WSDGate uses are listed in
        the README document.

        POSITION:  Mandatory, cannot be empty.
        
        This parameter should specify the relative position of
        ANNOTATION_TYPE above with respect to the instance annotation. One 
        crucial difference of this parameter with
        respect to the corresponding class feature parameter is that this
        supports a RANGE of values. So one can specify that this feature 
        should be captured in a range of positions say -5 to 5. In such a
        case to keep the NAME of the feature unique, the NAME value
        is appended with this position information. The format for specifying
        a RANGE is 
        <lower_bound>:<upper_bound>, e.g. "-5:5" 
        IMPORTANT: See next two parameters that are related, and provide an 
        example.

        ZERO: Optional, defaults to "N" when empty.
        
        This parameter applies when a RANGE is specified for the
        POSITION parameter above. If the RANGE includes the position 0 (zero)
        then one might or might not want a feature for that position.
        Position 0 essentially means the instance annotation itself, i.e. the 
        target word. So, for example for Part of Speech (POS) features in a 
        window of 2 around the target word, one might or might not want the POS
        tag for the target term. This parameter can have 2 values: "Y" if one 
        wants the position 0 to be a feature, "N" if position 0 should not
        be a feature. 

        FLOATING_ATTRIBUTE: Optional, defaults to "N" when empty.
        
        This parameter specifies whether the physical
        position of ANNOTATION_TYPE should be treated as a "floating" position
        such that even if ANNOTATION_TYPE above is not present exactly at the
        specified position (where a different annotation might be present),
        the search should continue until the specified occurrance of the
        ANNOTATION_TYPE is found. If the end of the document is reached, then 
        a missing value is returned. 
        
        VALUES: Optional, defaults to "novalues" when empty.
        
        This parameter can take one of the following 3 values,
            
        1.  "novalues" - A set of possible values for the feature need
            not be specified.
            
        2.  "emptyvalues" - Useful for creating features of datatype "string"
            in the WEKA ARFF format, where the specification means that
            the set of possible values for this feature is not known
            in advance, but should rather be decided after all the instances
            are known.

        3.  A file containing possible nominal values: This is useful for
            creation of WEKA nominal features where the set of possible
            values of the feature is known in advance.

        For the purpose of most WSD experiments, it is suitable to select
        "emptyvalues" which means that the attributes will be created as
        string attributes in the WEKA ARFF file and then they can be
        converted to nominal or other required types using WEKA filters.

        An exception is the POS tag values, which we know in advance. A
        file pos_values.txt has been provided with the package. It contains
        a list of all the POS tags that can be marked up by the ANNIE POS
        tagger. This should be used as a VALUES parameter for POS attributes.

Optional parameters to mkconfig.pl are:

    --engine <engine_class>
    
    This is a parameter specific to GATE and decides which machine learning
    engine should be used. Currently, if used this should *ALWAYS* be set to
    gate.creole.wsd.WSDWekaWrapper and is therefore redundant. 


    --memory <heap_size>

    This *optional* argument can be used to modify the heap size that should
    be used by the java virtual machine, e.g. "--memory 1024M". The default
    heap size used to initialize the JVM is 400M.


** Questions? **
----------------

Contact Mahesh Joshi (joshi031@d.umn.edu) or Ted Pedersen (tpederse@d.umn.edu).


** Copyright Notice **
----------------------

Copyright (C) 2005-06, 

Mahesh Joshi
University of Minnesota, Duluth

Ted Pedersen
University of Minnesota, Duluth

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.