*************************************************
README file for WSDGate v 0.05
*************************************************
The purpose of this document is to give the readear a brief
introduction to the WSDGate framework.
WSDGate is an end-to-end Supervised Word Sense Disambiguation (WSD) framework
developed by making use of existing resources such as GATE (General
Architecture for Text Engineering) and WEKA (Waikato Environment for Knowledge
Analysis). It also makes use of NSPGate, which is a GATE processing resource
that acts as a wrapper around the Ngram Statistics Package (NSP).
The aim of WSDGate is to facilitate batch mode experiments of supervised
WSD, using GATE and NSPGate for feature identification and extraction and
WEKA for machine learning, to perform several cross-validation experiments.
Typical supervised WSD approach requires identification of the best set of
features and the best machine learning algorithms for a given set of data.
WSDGate is intended to facilitate exactly this requirement.
More background about the system is available in the following Intelligent
Systems demo paper:
Mahesh Joshi, Serguei Pakhomov, Ted Pedersen, Richard Maclin and Christopher
Chute. An End-to-end Supervised Target-Word Sense Disambiguation System. To
appear in Proceedings of the Twenty-First National Conference on Artificial
Intelligence, Intelligent Systems Demonstrations, 2006 (AAAI-06).
The following sections provide background regarding a typical target-word
disambiguation task described in the USAGE document, where we have several
manually labeled instances of a set of ambiguous words and machine learning
models are to be learnt for disambiguating their future instances.
In a typical supervised WSD task, two factors are important:
1. Choosing good features
2. Choosing good machine learning algorithms (with good parameters)
Below is a description of the various components used in the WSDGate system
with respect to their parameters and the features they provide for a WSD
task. In the GATE machine learning framework, features are essentially
identified from the annotations created on documents or attributes of created
annotations. Each of the following plugins create different types of
annotations (and attributes on them), which can be used as features by WSDGate.
------------------------------------------------------------------------------
Brief overview of what the plugins do and the features we get to use as of now
------------------------------------------------------------------------------
** NSP Wrapper **
-----------------
(PLUGIN FROM NSPGate, http://sourceforge.net/projects/nspgate/)
This plugin is a wrapper for the Ngram Statistics Package (NSP). Please refer
to its README document for explanation of its parameters. It produces
ngram annotations according to the parameters specified. The new annotations
that it creates are:
1gram - for unigrams
2gram - for bigrams
3gram - for trigrams, and so on.
All these are created in the specified output annotation set, which is
a parameter taken by NSPGate. In abscence of any specification for this
parameter, the GATE framework has a "default" annotation set which does
not have any name.
** ANNIE **
-----------
(Existing plugin in GATE, we only use it)
We use 3 components of ANNIE - Tokenizer, Sentence Splitter and Part of Speech
Tagger. Parameters for these are documented in the GATE documentation at
http://gate.ac.uk/sale/tao/index.html. The annotation types created by these
components in the default annotation set are:
Token, SpaceToken - by Tokenizer
Sentence - by Sentence Splitter
"category" attribute on a Token annotation - by POS Tagger (only enhances
the Token annotations created by the Tokenizer by adding a "category"
attribute to them, no new annotations created)
(Note: An "attribute" in XML terminology is something that gives additional
information about an XML tag, e.g. in
"instance" and "senseid" are attributes of the "answer" tag.)
Below is the information of other Gate PRs which do not create any annotations
but make use of the annotations created by the PRs above, or integrate the
functionality of several PRs.
** WSDGate PR **
----------------
(Plugin made available as a part of WSDGate)
This is a wrapper plugin that combines the functionality of all of the
above plugins and integrates the process of running WEKA classifiers
on the ARFF files produced, after applying pre-processing filters etc.
It generates WEKA output files containing output from WEKA classifiers.
----------
PARAMETERS
----------
"corpus"
The source corpus containing XML files for instances of an ambiguous imported
into the GATE environment.
"remove"
The frequency cutoff value for ngram features that is passed to the NSPGate PR.
"crossValCount"
The number of times cross-validation experiment is to be repeated per machine
learning algorithm.
"ngrams"
The value of N for various Ngrams to use, separated by the pipe symbol (|).
For example to use unigrams and bigrams, this should be set to "1|2" without
the quotes.
"nontoken"
The file that contains regular expressions for tokens to be discarded. This is
passed to the NSPGate PR.
"score"
The score cutoff value to be used by the statistics module of the Ngram
Statistics Package (NSP). This is passed to the NSPGate PR.
"stop"
The file containing list of common stop words or functional words which should
be discarded from the list of features. This is passed to NSPGate PR.
"token"
The file containing regular expressions which we want to be identified as
tokens. This is passed to NSPGate PR.
"statModule"
The statistic module to be used by the statistic.pl program from NSP. This is
passed to NSPGate PR.
"datasetPath"
The path where WEKA dataset files in the ARFF format should be created.
"resultPath"
The path where the output of WEKA preprocessing filters and classifiers is to
be stored.
"fileNamePrefix"
The prefix to be used while creating ARFF and output file names.
"wekaClassifiers"
The WEKA classifier(s) to use for performing cross-validation experiments,
separated by the pipe symbol (|). For example, to run the Support Vector
Machine classifier and the naive Bayes classifier in weka, one would use
"weka.classifiers.functions.SMO|weka.classifiers.bayes.NaiveBayes" without the
quotes.
"wekaOptions"
Respective WEKA classifier options for each of the classifiers in the parameter
above, separated by the pipe symbol (|).
"wekaFilterWithOptions"
Any WEKA filter to use, with its options. Only one can be used at most.
"stringToNominalRange"
A list of attributes to convert from string to nominal type.
"mlConfigFilesDir"
Directory containing the Machine Learning XML configuration files. The files
are passed to the WSDMachineLearningPR, a new one for each file.
"mlConfigFilesFilter"
Wild card filters to choose any particular machine learning configuration
files in the directory above. By default all XML file will be used.
"inputASName"
The input annotation set to use for extracting features.
"outputASName"
The output annotation set name to use for creating any new annotations, default
is - "WSD Annotations" without the quotes.
** WSDMachineLearningPR **
--------------------------
This is a modified version of the machine learning PR from GATE. It adds
functionality to enable feature extraction from a "flexible distance" in the
sense that a feature position of -1 means "get the first feature to the left
of the ambiguous word, no matter how far it is", instead of the usual "get the
first word to the left of the ambiguous word if its a feature, else return
empty".
----------
PARAMETERS
----------
"document"
The document to be processed, for which the features are to be extracted using
the annotations in the document.
"inputASName"
The name of the annotation set in the above document, to be used for input,
that is for extracting features.
"configFileURL"
The URL for the Machine Learning configuration file that specifies what
features are to be extracted from the above document.
"training"
A boolean flag which is set to true for enabling training mode, else the PR
operates in testing mode, applying a previously generated model to the
input instances.
** SplitSval2Instances PR **
----------------------------
For efficiency reasons, it is best to have the input data files (that is the
instances of an ambiguous word) in multiple xml files -- one per instance
of the ambiguous word. This PR helps in breaking up a single file that is in
Senseval-2 format, into multiple files, one per instance in GATE xml format.
----------
PARAMETERS
----------
"document"
The source Senseval-2 file imported into GATE environment.
"outputPath"
The directory where the output GATE xml files, one per instance should be
created.
==================
COMMAND LINE USAGE
==================
Note that all of the above components are GATE plugins and do not run
outside of GATE. To facilitate command line usage another Java class
has been developed that parses a configuration file containing parameters for
the plugins above and invokes these components after initializing GATE in
non-UI mode. This class is gate.creole.wsd.WSDExperiments.
The paramters required for this Java class at command line can be given in
two ways.
The first way is used when one already has multiple input XML files (one
per instance) of an ambiguous word. They are as follows (in the specified
order below):
1. TERM_FILE
This is a plain text file containing the list of ambiguous words to be
experimented upon. Note that these are also the directory names in the
root level input data directory. Refer to USAGE document for input data
directory structure.
2. OPTIONS_FILE
This is a plain text configuration file containing parameter specification for
NSPGate and WSDGate PRs. The format is PARAMETER_NAME:=VALUE or
PARAMETER_NAME:=VALUE, depending upon whether the parameter is global for
all experiments or local for a single experiment respectively. Details of
the parameter names and the file format are in the sample configuration file
samples/configtemplate.
3. INPUTS_DIR
This is the top level root directory containing one directory per ambiguous
word. The names of these directories should be exactly the same as those
listed in the TERM_FILE above. Each of these directories contain multiple
XML files (one per instance) for the corresponding ambiguous word.
4. OUTPUTS_DIR
This is the top level root directory where the WEKA output should be stored.
Similar to the INPUTS_DIR, one directory per ambiguous word is created in
this output directory and the output files for an ambiguous word are stored in
the corresponding sub-directory.
The second way to invoke gate.creole.wsd.WSDExperiments is when one has a
single Senseval-2 formatted file for an ambiguous word, containing all the
instances. The parameters to use in this case are as follows (in the given
order):
1. --singlefile
This is just a flag that indicates that what follows is the name of the
Senseval-2 formatted input file.
2. INPUT_FILE
This is the name of the Senseval-2 formatted input file for the ambiguous
word. Note that in this mode, only one word can be processed per invokation
of gate.creole.wsd.WSDExperiments.
3. OPTIONS_FILE
This is the same configuration file as described in the first method of
invoking gate.creole.wsd.WSDExperiments.
The manual creation of XML configuration files for feature extraction is a
tedious task. To facilitate some automation in the process and also to create
and experimental directory structure, WSDGate provides a utility Perl script
mkconfig.pl. Following is a description of the same.
mkconfig.pl
-----------
mkconfig.pl is the program that automatically creates experimental
directories, configuration files and scripts for running experiments.
REQUIRED parameters to mkconfig.pl are:
--configname
This is the name given to the generated experiment. Usually this can be
something that reflects the choice of features. e.g. If the features are
unigrams and bigrams in a window of 5 and POS tags in a window of 2, then
one possible configuration name can be "ub5p2". The name should not
contain spaces, semi-colons or = sign.
--javapath
This is the full path to the Java virtual machine binary, e.g.
/usr/local/j2sdk1.4.2_09/jre/bin/java
--gatehome
This is the full path to the home directory of the GATE installation
on your machine, e.g. /usr/local/GATE3.0
Note: If the GATE path contains a space, use double quotes to specify.
--configtemplatefile
This is the path to the configuration template file (configtemplate) that
comes with the package and *WHICH YOU SHOULD CUSTOMIZE* for your set of
experiments. More information on how to customize the configuration
template is available in the "configtemplate" file itself, the format
is mentioned and each of the parameters are explained.
####################
** IMPORTANT NOTE **
####################
You *MUST* update all the /FULL/PATH/... related parameters in the sample
configuration template to point to directories and files on your machine,
(you can comment the ones that are not marked ** REQUIRED **).
############################
** ANOTHER IMPORTANT NOTE **
############################
In the description below, the terminology used is as follows:
**********************
* Annotation = XML Tag
e.g. in the string "APC", the annotation is "head" and it
encloses the text "APC".
************************************
* Attribute = Property of an XML Tag
e.g. in the string ", the attributes
are "id" and "senseid" and their values are "12345" and "river"
respectively. (The annotation is "answer", and it does not
have any enclosed text, since the tag ends immediately).
****************************************************************
* Feature = A property of an instance of ambiguous word which is
used by Machine Learning algorithms, either as
for prediction of some feature, or as the value to be
predicted
e.g. in the string
"<1gram string="fertile">fertile1gram>bank"
the value "fertile" of the attribute "string" of the annotation "1gram"
can be a feature. Also, simply the presence or absence of the "1gram"
annotation around the word "fertile" can be a feature. So features
are defined based on values of attributes or presence of annotations.
###################################
** END OF ANOTHER IMPORTANT NOTE **
###################################
The following parameters determine the features that will be used in
the experiment. These decide the contents of the feature specification
XML file. Keeping the above terminology convention in mind, the annotations
in the feature specification XML file will be simply called XML tags.
--inst
This is the annotation in your input data files that encloses the
instance i.e. the occurrence of the ambiguous terms in the data files.
--class
The feature specification XML file consists of a tag, inside
which there are several tags which specify the details
about the features to be extracted. One (and only one) of the features
is the CLASS feature (one which is to be predicted, based on
knowledge of other features). This command line argument takes the
details about the CLASS feature.
The class feature parameters are to be specified in the following format:
There should be exactly 6 parameters separated by comma (,) symbol. Where
parameters values are not required, a comma should be repeated
without any space after the previous comma e.g. a,b,,-1,Y,emptyvalues.
There should be absolutely no space anywhere in between the parameter
specification.
NAME,ANNOTATION_TYPE,ATTRIBUTE,POSITION,FLOATING_FEATURE,VALUES
NAME: Mandatory, cannot be empty.
This is the name given to the feature, as it should seen in
the WEKA ARFF file, e.g. a name for the class feature of the target
word can be "Meaning". These names should be strictly alpha-numeric
with only minus sign (-) and parentheses where needed.
ANNOTATION_TYPE: Mandatory, cannot be empty.
This is the annotation type in the input data file
using which the class feature is to be extracted.
ATTRIBUTE: Optional, when empty only presence or absence of
ANNOTATION_TYPE above is recorded.
This is the attribute of the above annotation from which
the value for the CLASS feature is to be extracted.
POSITION: Mandatory, cannot be empty.
This parameter should specify the relative position of
ANNOTATION_TYPE with respect to the instance annotation.
FLOATING_FEATURE: Optional, when empty defaults to "N".
This parameter specifies whether the physical
position of ANNOTATION_TYPE should be treated as a "floating" position
such that even if the ANNOTATION_TYPE is not present exactly at the
specified position (where a different annotation might be present),
the search should continue until the specified occurrance of
ANNOTATION_TYPE is found. If the end of the document is reached,
then a missing value (?) is returned. The value to be specified for
this parameter should be either 'Y' if floating position is
desired or 'N' if floating position is not desired.
VALUES: Optional, when empty it defaults to "novalues".
This parameter can take one of the following 3 values,
1. "novalues" - A set of possible values for the feature need
not be specified. This is used when the feature is binary, where
only the presence or absence of an annotation is to be recorded.
2. "emptyvalues" - Useful for creating features of datatype "string"
in the WEKA ARFF format, where the specification means that
the set of possible values for this attribute is not known
in advance, but should rather be decided after all the instances
are known.
3. A file containing possible nominal values: This is useful for
creation of WEKA nominal features where the set of possible
values of the feature is known in advance.
For the purpose of most WSD experiments, it is suitable to select
"emptyvalues" which means that the attributes will be created as
string attributes in the WEKA ARFF file and then they can be
converted to nominal or other required types using WEKA filters.
--feat
This argument specifies features other than the class feature in the
feature specification XML file. The difference from the class feature
parameters specification is that here we can have specification of
parameters for multiple attributes.
The feature parameters are to be specified in the following format:
There should be exactly 7 parameters for each feature, separated by
comma (,) symbol, and every feature should be separated by plus (+)
symbol. Wherever parameters values are not required, a comma should be
repeated without any space after the previous comma e.g.
a,b,,-1:1,Y,N,emptyvalues.
There should be absolutely no space in between the feature parameter
specifications.
NAME,ANNOTATION_TYPE,ATTRIBUTE,POSITION,ZERO,FLOATING_FEATURE,VALUES
NAME: Mandatory, cannot be empty.
This is the name given to the feature, as it should seen in
the WEKA ARFF file, e.g. a name for the feature that is one position
to the left of the target word can be "U(-1)". These names should be
strictly alpha-numeric with only minus sign (-) and parentheses where
needed.
ANNOTATION_TYPE: Mandatory, cannot be empty.
This is the annotation type in the the input file
from which the feature is to be extracted.
ATTRIBUTE: Optional, whem empty only presence or absence of
ANNOTATION_TYPE is recorded.
This is the attribute of the ANNOTATION_TYPE above
from which the value for the feature is to be extracted.
So a crucial part of this feature specification is knowing
what annotations and attributes are created by which components
and only then one is able to use them as features by specifying the
required details about them to the mkconfig.pl file. Note that
incorrect feature specification will lead to creation of incorrect
feature specification files, and since those annotations or attributes
may not be present in the data file, no feature values will be
extracted, instead a missing values dataset will be created.
Annotations produced by components that WSDGate uses are listed in
the README document.
POSITION: Mandatory, cannot be empty.
This parameter should specify the relative position of
ANNOTATION_TYPE above with respect to the instance annotation. One
crucial difference of this parameter with
respect to the corresponding class feature parameter is that this
supports a RANGE of values. So one can specify that this feature
should be captured in a range of positions say -5 to 5. In such a
case to keep the NAME of the feature unique, the NAME value
is appended with this position information. The format for specifying
a RANGE is
:, e.g. "-5:5"
IMPORTANT: See next two parameters that are related, and provide an
example.
ZERO: Optional, defaults to "N" when empty.
This parameter applies when a RANGE is specified for the
POSITION parameter above. If the RANGE includes the position 0 (zero)
then one might or might not want a feature for that position.
Position 0 essentially means the instance annotation itself, i.e. the
target word. So, for example for Part of Speech (POS) features in a
window of 2 around the target word, one might or might not want the POS
tag for the target term. This parameter can have 2 values: "Y" if one
wants the position 0 to be a feature, "N" if position 0 should not
be a feature.
FLOATING_ATTRIBUTE: Optional, defaults to "N" when empty.
This parameter specifies whether the physical
position of ANNOTATION_TYPE should be treated as a "floating" position
such that even if ANNOTATION_TYPE above is not present exactly at the
specified position (where a different annotation might be present),
the search should continue until the specified occurrance of the
ANNOTATION_TYPE is found. If the end of the document is reached, then
a missing value is returned.
VALUES: Optional, defaults to "novalues" when empty.
This parameter can take one of the following 3 values,
1. "novalues" - A set of possible values for the feature need
not be specified.
2. "emptyvalues" - Useful for creating features of datatype "string"
in the WEKA ARFF format, where the specification means that
the set of possible values for this feature is not known
in advance, but should rather be decided after all the instances
are known.
3. A file containing possible nominal values: This is useful for
creation of WEKA nominal features where the set of possible
values of the feature is known in advance.
For the purpose of most WSD experiments, it is suitable to select
"emptyvalues" which means that the attributes will be created as
string attributes in the WEKA ARFF file and then they can be
converted to nominal or other required types using WEKA filters.
An exception is the POS tag values, which we know in advance. A
file pos_values.txt has been provided with the package. It contains
a list of all the POS tags that can be marked up by the ANNIE POS
tagger. This should be used as a VALUES parameter for POS attributes.
Optional parameters to mkconfig.pl are:
--engine
This is a parameter specific to GATE and decides which machine learning
engine should be used. Currently, if used this should *ALWAYS* be set to
gate.creole.wsd.WSDWekaWrapper and is therefore redundant.
--memory
This *optional* argument can be used to modify the heap size that should
be used by the java virtual machine, e.g. "--memory 1024M". The default
heap size used to initialize the JVM is 400M.
** Questions? **
----------------
Contact Mahesh Joshi (joshi031@d.umn.edu) or Ted Pedersen (tpederse@d.umn.edu).
** Copyright Notice **
----------------------
Copyright (C) 2005-06,
Mahesh Joshi
University of Minnesota, Duluth
Ted Pedersen
University of Minnesota, Duluth
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.