******************************************
    USAGE document for WSDGate v 0.05
******************************************

The purpose of this file is to show step by step, the process of setting up
and running command line interface based experiments with the WSDGate
framework.

BEFORE READING THIS DOCUMENT, PLEASE READ THE README FILE.
=========================================================

##########################################
NOTE: THIS IS A MEMORY INTENSIVE TOOLKIT!!
##########################################

The minimum required memory for this toolkit to function properly
is 512MB. At least 1GB of physical memory is recommended for carrying out 
large scale experiments. The java programs are run with a default heap size
of 400M. If you have more memory, you can use the optional argument
"--memory" of the mkconfig.pl script to specify the heap size that the java 
virtual machine can use. While this parameter can also be used to decrease
the JVM heap size, it is not recommended since that will slow down the
process to a great extent, specially for large datasets.

For the purpose of this document, we will proceed with a running example
that deals with setting up a simple experiment using a small sample of
data contained in the samples/ directory of wsdgate. All paths mentioned
in the running example are relative to the base directory of wsdgate-v0.05. The
sample data is an extract from the abbreviation data created by Dr. Hongfang
Liu from the University of Maryland, Baltimore County. (Refer to her
publication "H. Liu, V. Teller and C. Friedman. Multi-Aspect Comparison Study 
of Supervised Word Sense Disambiguation. JAMIA 2004" at 
http://userpages.umbc.edu/~hfliu/article/liujamia_2004.pdf). The data
was split into one GATE XML document per instance of the ambiguous 
abbreviation using the SplitSval2 PR mentioned in the README.

STEP 1 - Making sure that GATE sees WSDGate framework
-----------------------------------------------------

After running the INSTALL.sh as mentioned in the INSTALL document,

* Open GATE
* Click on the menu item: File -> Manage CREOLE plugins
* The dialog box that comes up shows list of plugins that GATE can see.
* The list should have the following entry:
    - WSD
* Each entry has 2 check-boxes in front of it. Selecting the first check box
  causes GATE to load that plugin for the current session. Selecting the
  second check box causes GATE to load the plugin for all subsequent sessions.
  Select both the check boxes for all of the above component and click
  on "OK".
* The GATE "Messages" tab in the main window should show a message saying
  "CREOLE plugin loaded: WSD". 

Successful execution of these steps ensures that GATE can talk with all
the new WSDGate plugin.


STEP 2 - Running mkconfig.pl
----------------------------

-----------------------------------------------------------------------
Expected structure of the directories containing the input data for WSD
-----------------------------------------------------------------------

The framework is meant for running "target word disambiguation" kind of tasks
where there is a fixed set of terms which we want to disambiguate. The
directory structure expected for this is as follows:

If there are 5 ambiguous words for which we have data, then the directories
are (indentation represents one level below in the hierarchy):

    root_data_dir/

        word1/
            word1 file(s)
            
        word2/
            word2 file(s)
            
        word3/
            word3 file(s)
            
        word4/
            word4 file(s)
            
        word5/
            word5 file(s)


where word1 through word5 are directories having the same name as the
ambiguous terms. The files inside each term directory are XML files containing
sense annotated instances for those terms. For efficiency reasons, it is
desirable to have one XML file per instance of the term. So if word1 has
100 instances then there should be 100 separate XML files for the term.

For the sample data provided, the top level root data directory is
samples/abbr. We have just one one ambiguous abbreviation APC so there
is only one sub-directory under samples/abbr. Inside this directory, there
are 20 XML files for 20 instances of the ambiguous abbreviation. While these
XML files are in GATE XML format and have been generated using Senseval-2 
format input XML files, it is not required that the data be in
particular XML format, it can be any valid XML containing desired sense
annotations.

The following note assumes that mkconfig.pl is run after creation of
a new "experiments" directory inside the extracted wsdgate-v0.05 root directory.
The current directory is this new "experiments" directory.

** Setting the various command line parameter values for mkconfig.pl

Let us assume that we are interested in disambiguating the given sample data
using the following features:

1. 5 unigrams to the left and right of the ambiguous acronym APC
2. 5 bigrams to the left and right of APC
3. 2 Part of Speech tags to the left and right of APC, including the POS tag
   of the ambiguous acronym itself.


--CONFIGNAME

Given our features, let us name our experimental configuration as "ub5p2".

==>> Therefore, the value for the --configname parameter is ub5p2 <<==


--JAVAPATH

For my machine, this is /usr/bin/java, you should use path to your java binary.

==>> Therefore, the value for the --javapath parameter is /usr/bin/java <<==


--GATEHOME

For my machine, this is /home/mahesh/GATE. Use path specific to your machine.

==>> Therefore, the value for the --gatehome parameter is /home/mahesh/GATE <<==


--CONFIGTEMPLATE

**********************
*** VERY IMPORTANT ***
**********************
Don't forget to update all the parameters that have value
/FULL/PATH/TO/YOUR/... in the configtemplate file to match your directory 
and file locations.

Assuming, I extracted wsdgate-v0.05.tar.gz in /home/mahesh/wsdgate-v0.05 and 
that the current directory is therefore /home/mahesh/wsdgate-v0.05/experiments,

==>> value for --configtemplate is ../samples/configtemplate <<==


##########################################################################
# TO UNDERSTAND THE PARAMETER VALUES AHEAD OF THIS, IT WILL BE USEFUL
# IF THE USER LOADS ONE OF THE SAMPLE XML FILES IN **GATE** IN THE  DOCUMENT
# VIEWER AND BRINGS UP THE ANNOTATION SETS TREEVIEW AND ANNOTATION LIST
##########################################################################

--INST

In the given sample data, the ambiguous instances of APC are marked using
the "head" annotations.

==>> Therefore, the value of --inst is head <<==


--CLASS

Let us name our class feature as "Sense".

So NAME is "Sense" (This class parameter is mandatory, cannot be empty)

In the case of our sample data, the class i.e. the sense of the
abbreviation is specified in the "answer" annotation. 

So ANNOTATION_TYPE is "answer" (This class parameter is mandatory, cannot be 
empty)

The attribute on "answer" annotation where the sense is present is
"senseid".

So ATTRIBUTE is "senseid" (This class parameter is optional, when empty only
the presense or absence of the ANNOTATIO_TYPE will be recorded)

Our instance annotation is the "head" annotation. The "answer" annotation for 
any instance is the first annotation of that type *before* the "head" 
annotation. So its position is negative with respect to ANNOTATION_TYPE. Since 
we want the *first* "answer" annotation to the left of "head" annotation, 
the position is -1.

So POSITION is -1 (This class parameter is mandatory, cannot be empty)

We know that the "answer" annotation is somewhere to the left of the "head" 
annotation. But its actual position depends upon the size of the context for 
that instance of APC. Hence the position of the "answer" tag with respect
to the "head" tag should be specified as floating. So FLOATING_FEATURE 
parameter should have a value of "Y" to indicate that our class feature
is a floating feature with respect to the instance annotation. 

So FLOATING_FEATURE is "Y" (This class parameter is optional, when empty it
defaults to "N")

Assuming we don't know in advane what all senses exist for APC in the given
data, we set VALUES to "emptyvalues".

So VALUES is "emptyvalues" (This class parameter is optional, when empty it
defaults to "novalues")

So, finally our "--class" argument looks like this:

==>> --class Sense,answer,senseid,-1,Y,emptyvalues <<==


--FEAT

We have three types of features - unigrams, bigrams and part of speech tags.

Let us name unigrams as U(x) where x is the position, bigrams as B(x) and
part of speech tags as POS(x).

So NAME will have values "U","B" and "POS" (This attribute parameter is 
mandatory, cannot be empty)

The annotation marked up by NSPGate for unigrams is "1gram", that for bigrams
is "2gram" and the annotation for POS tags is "Token" which is marked by
ANNIE POS Tagger PR in GATE.

So ANNOTATION_TYPE will have values "1gram", "2gram" and "Token" (This 
attribute parameter is mandatory, cannot be empty)


The attribute on 1gram and 2gram annotations which contains the n-gram
string is "string". For part of speech tags, it is "category" attribute
on the "Token" annotation that contains the POS tag value.

So ATTRIBUTE will have values "string", "string" and "category" (This attribute
parameter is optional, when empty only the presence or absence of the
ANNOTATION_TYPE is recorded as a feature)

We want position of the unigrams and bigrams in a window of 5 around the
abbreviation APC, i.e. in a range of -5 to 5. For part of speech tags
we want a window of 2, i.e. -2 to 2.

So POSITION will have values "-5:5", "-5:5" and "-2:2" (This attribute 
parameter is mandatory, cannot be empty)


For unigrams and bigrams, including the 0th position will get the ambiguous
acronym APC. Instead, we skip the zeroth position for them. But for POS
features, we want to keep the POS tag of the ambiguous acronym. So we
want the zeroth position in that case.

So ZERO will have values "N", "N" and "Y" (This attribute parameter is
optional, when empty will default to "N")

Assuming we want floating unigrams and bigrams, but non-floating POS tags:

So the FLOATING_FEATURE will have values "Y", "Y" and "N" (This attribute
parameter is optional, when empty it defaults to "N")

We don't know the possible values of unigrams and bigrams in advance, but
we know the part of speech tags assigned by the ANNIE system. The list is
provided with WSDGate in the file samples/pos_values.txt in the base WSDGate 
directory.

So the VALUES parameter will be "emptyvalues", "emptyvalues" and
"../samples/pos_values.txt" (This attribute parameter is
optional, when empty will default to "novalues")


Combining all the parameters of the 3 types of features, the final
value of the "--feat" option is:

==>> --feat U,1gram,string,-5:5,N,Y,emptyvalues+B,2gram,string,-5:5,N,Y,emptyvalues+POS,Token,category,-2:2,Y,N,../samples/pos_values.txt <<==


Let us keep the optional parameters "--engine" and "--memory" to their 
default values.


This completes the description of parameters of the mkconfig.pl file. So
finally, the sample command that you can use for creation of experimental
setup is as follows (recollect that the current directory is
/home/mahesh/wsdgate-v0.05/experiments):

# ../bin/mkconfig.pl --configname ub5p2 --javapath /usr/bin/java --gatehome /home/mahesh/GATE --configtemplatefile ../samples/configtemplate --inst head --class Sense,answer,senseid,-1,Y,emptyvalues --feat U,1gram,string,-5:5,N,Y,emptyvalues+B,2gram,string,-5:5,N,Y,emptyvalues+POS,Token,category,-2:2,Y,N,../samples/pos_values.txt

NOTE: You will need to replace path to java and home directory of GATE  with 
the correct paths on your machine.

AFTER RUNNING THE COMMAND AS ABOVE:

Here's the directory structure (recursive w.r.t. current directory) that 
should be produced for the sample data provided (assuming UNIX like operating 
system). The config_name is "ub5p2".

        .:
        arffs
        mlconfigs
        results
        scripts

        ./arffs:
        ub5p2

        ./mlconfigs:
        ub5p2

        ./mlconfigs/ub5p2:
        ub5p2.xml <--------- Feature specification XML file (for *all* terms)

        ./results:
        ub5p2

        ./scripts:
        ub5p2

        ./scripts/ub5p2:
        APC.term <---------------- Term file containing the string "APC"
        runtests.APC.ub5p2.sh <--- Executable script to run APC experiments
        ub5p2.config <------------- Configuration file (for *all* terms)

----------------------------------------------
Note on experimental directory structure above
----------------------------------------------

Along with requiring the data in a certain directory structure, the automatic
script generation interface of WSDGate enforces a certain directory
structure on the experimental setup.

The automatic script generation is done using the script mkconfig.pl. One of
the parameters required by this script is the name of the experimental
configuration (say config_name), this can be any alphanumeric string.

Upon execution with the appropriate parameters, mkconfig.pl generates the
following set of directories and files in the *CURRENT WORKING DIRECTORY* i.e.
the directory from where the script was executed: (paths relative to
current working directory)

    ./arffs/config_name/ :: The raw ARFF files that are generated during the
    experiments are stored here. Inside this directory, a sub-directory for
    each ambiguous term is automatically created (upon executing the
    experiment, not after running the mkconfig.pl script).

    ./results/config_name/ :: The results of running the WEKA classifiers are
    stored as *.wekaoutput files inside the corresponding sub-directories for
    each term. Also, intermediate ARFF files produced upon application of WEKA
    filters are stored in this location (again, inside respective sub-
    directories for each term).

    ./mlconfigs/config_name/ :: The feature specification XML file required by
    the Machine Learning component is generated in this directory. Only one
    file is generated inside this directory.

    ./scripts/config_name/ :: This directory contains the experiment specific
    files and executable scripts that should be used for running the 
    experiments. One file with extension ".term" is simply a file specifying 
    the list of terms to be disambiguated. In the current setup, one such file 
    is create per term, containing just that one single term inside it. The 
    second file is a parameter configuration file which is created from a 
    template. Apart from this an executable script is generated for each 
    ambiguous term. This script contains the java command line to invoke the 
    class gate.creole.wsd.WSDExperiments discussed above.

    Note that in each of the above cases, config_name is a sub-directory
    inside the 4 directories - arffs, results, mlconfigs and scripts.
    Running mkconfig.pl with a new configuration name will therefore create
    new set of files, keeping the previous ones intact.


STEP 3 - Running actual experiments
-----------------------------------

Let us assume that you run the above command (after replacing the appropriate
paths) with your current working directory as 
/home/johnsmith/wsdgate-v0.05/experiments
(where johnsmith is your user name), and the working directory was empty
initially. Henceforward we will refer to the current working directory as
the "experiments" directory.

After running the command, your experiments directory should contain the
*exact* directory structure that was shown at the end of STEP 2 above 
(assuming a UNIX like system).

The script that is to be executed for running experiments related to the
abbreviation APC is ./scripts/ub5p2/runtests.APC.ub5p2.sh.

BEFORE EXECUTING THE SCRIPT,

** You need perl binary in your PATH

** You need Ngram Statistics Package installed with count.pl and statistic.pl
   scripts from NSP in your PATH

** You need to update the CLASSPATH variable in your environment to include
   the weka.jar file that is included in you WEKA installation.

   * To do this on UNIX like systems, in BASH shell use the command
     
     export CLASSPATH=/FULL/PATH/TO/YOUR/WEKA_DIR/weka.jar

   * To do this on UNIX like systems, in CSH or TCSH shell use the command

     setenv CLASSPATH /FULL/PATH/TO/YOUR/WEKA_DIR/weka.jar

   * To do this on Windows systems, use the following command in the "cmd.exe"
     or "command.com" DOS prompt:

     set CLASSPATH=/FULL/PATH/TO/YOUR/WEKA_DIR/weka.jar

In all of the above commands, replace /FULL/PATH/TO/YOUR/WEKA_DIR with the
path to WEKA installation specific to your machine.

Now execute the above shell script from the experiments directory as (assuming
UNIX like OS and the presence of bash shell):

# ./scripts/ub5p2/runtests.APC.ub5p2.sh

The experiment run is started and the STDOUT and STDERR are redirected to
the following files respectively:

./scripts/ub5p2/runtests.APC.ub5p2.sh.output
./scripts/ub5p2/runtests.APC.ub5p2.sh.error

Upon completion, you should have an empty error file and some output
in the output file. The directory structure of the experiments directory
should now look similar to the following (the random strings in between the 
file names will differ):

(NEW files and directories marked with "<---- NEW relevant_comments"):

# ls -R1
arffs
mlconfigs
results
scripts

./arffs:
ub5p2

./arffs/ub5p2:
APC <---- NEW directory for each ambiguous word

./arffs/ub5p2/APC:
APC.ub5p2.1132282003909.ub5p2.xml.arff <---- NEW arff file containing data

./mlconfigs:
ub5p2

./mlconfigs/ub5p2:
ub5p2.xml

./results:
ub5p2

./results/ub5p2:
APC <---- NEW directory for each ambiguous word

./results/ub5p2/APC:
APC.ub5p2.1132282003909.ub5p2.xml.1132282008408.weka.classifiers.functions.SMO.wekaoutput <---- NEW file, output of classifier (one per classifier)
APC.ub5p2.1132282003909.ub5p2.xml.1132282009569.weka.classifiers.bayes.NaiveBayes.wekaoutput <---- NEW file, output of classifier (one per classifier)
APC.ub5p2.1132282003909.ub5p2.xml.1132282010800.weka.classifiers.trees.J48.wekaoutput <---- NEW file, output of classifier (one per classifier)
APC.ub5p2.1132282003909.ub5p2.xml.filterout.arff <---- NEW arff file on which the classifiers are executed
APC.ub5p2.1132282003909.ub5p2.xml.s2n.arff <---- NEW intermediate arff
APC.ub5p2.1132282003909.ub5p2.xml.withdummy.arff <---- NEW intermediate arff

./scripts:
ub5p2

./scripts/ub5p2:
APC.term
runtests.APC.ub5p2.sh
runtests.APC.ub5p2.sh.error <---- NEW STDERR of the experiment script
runtests.APC.ub5p2.sh.output <---- NEW STDOUT of the experiment script
ub5p2.config


In the above listing, the *.wekaoutput files represents the results of the
cross-validation from WEKA on the dataset extracted using the specified
features. To see how to extract the summary of results from these files, read
STEP 4 below.


STEP 4 - Summarizing results
----------------------------

In the general case, there is one directory per ambiguous term inside the
results/<config_name> directory. Inside each of these directories there will
be one or more *.wekaoutput files which contain the WEKA output from the
selected classifiers.

To summarize the results for all ambiguous terms, for each classifier you
can use the script summarystats.pl. This extracts the accuracy, error,
training time and execution time for the experiments in comma separated
values format. This output is printed at STDOUT and the user should redirect
it to the required file.

summarystats.pl takes at most 2 arguments (*in specified order*):

1. PATH TO RESULTS DIRECTORY

The first one is mandatory and it specifies the path to the results directory
containing the directories for each ambiguous term. 

2. CONFIGURATION SUFFIX (optional)

The second argument is optional and specifies any *suffix* that should be used 
for the configuration name that will be shown in the output. By default the 
configuration name specified to mkconfig.pl is used.

So for example, to run summarystats.pl from the experiments directory in
the current example, one can use the command:

# ../bin/summarystats.pl results/ub5p2 Jul12 > summary_July12.csv

This should use the string Jul12 as the suffix for the experimental
configuration name and store the output in the summary_Jul12.csv file.


** Copyright Notice **
----------------------

Copyright (C) 2005-06, 

Mahesh Joshi
University of Minnesota, Duluth

Ted Pedersen
University of Minnesota, Duluth

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.