************************************************* README file for WSDGate v 0.05 ************************************************* The purpose of this document is to give the readear a brief introduction to the WSDGate framework. WSDGate is an end-to-end Supervised Word Sense Disambiguation (WSD) framework developed by making use of existing resources such as GATE (General Architecture for Text Engineering) and WEKA (Waikato Environment for Knowledge Analysis). It also makes use of NSPGate, which is a GATE processing resource that acts as a wrapper around the Ngram Statistics Package (NSP). The aim of WSDGate is to facilitate batch mode experiments of supervised WSD, using GATE and NSPGate for feature identification and extraction and WEKA for machine learning, to perform several cross-validation experiments. Typical supervised WSD approach requires identification of the best set of features and the best machine learning algorithms for a given set of data. WSDGate is intended to facilitate exactly this requirement. More background about the system is available in the following Intelligent Systems demo paper: Mahesh Joshi, Serguei Pakhomov, Ted Pedersen, Richard Maclin and Christopher Chute. An End-to-end Supervised Target-Word Sense Disambiguation System. To appear in Proceedings of the Twenty-First National Conference on Artificial Intelligence, Intelligent Systems Demonstrations, 2006 (AAAI-06). The following sections provide background regarding a typical target-word disambiguation task described in the USAGE document, where we have several manually labeled instances of a set of ambiguous words and machine learning models are to be learnt for disambiguating their future instances. In a typical supervised WSD task, two factors are important: 1. Choosing good features 2. Choosing good machine learning algorithms (with good parameters) Below is a description of the various components used in the WSDGate system with respect to their parameters and the features they provide for a WSD task. In the GATE machine learning framework, features are essentially identified from the annotations created on documents or attributes of created annotations. Each of the following plugins create different types of annotations (and attributes on them), which can be used as features by WSDGate. ------------------------------------------------------------------------------ Brief overview of what the plugins do and the features we get to use as of now ------------------------------------------------------------------------------ ** NSP Wrapper ** ----------------- (PLUGIN FROM NSPGate, http://sourceforge.net/projects/nspgate/) This plugin is a wrapper for the Ngram Statistics Package (NSP). Please refer to its README document for explanation of its parameters. It produces ngram annotations according to the parameters specified. The new annotations that it creates are: 1gram - for unigrams 2gram - for bigrams 3gram - for trigrams, and so on. All these are created in the specified output annotation set, which is a parameter taken by NSPGate. In abscence of any specification for this parameter, the GATE framework has a "default" annotation set which does not have any name. ** ANNIE ** ----------- (Existing plugin in GATE, we only use it) We use 3 components of ANNIE - Tokenizer, Sentence Splitter and Part of Speech Tagger. Parameters for these are documented in the GATE documentation at http://gate.ac.uk/sale/tao/index.html. The annotation types created by these components in the default annotation set are: Token, SpaceToken - by Tokenizer Sentence - by Sentence Splitter "category" attribute on a Token annotation - by POS Tagger (only enhances the Token annotations created by the Tokenizer by adding a "category" attribute to them, no new annotations created) (Note: An "attribute" in XML terminology is something that gives additional information about an XML tag, e.g. in "instance" and "senseid" are attributes of the "answer" tag.) Below is the information of other Gate PRs which do not create any annotations but make use of the annotations created by the PRs above, or integrate the functionality of several PRs. ** WSDGate PR ** ---------------- (Plugin made available as a part of WSDGate) This is a wrapper plugin that combines the functionality of all of the above plugins and integrates the process of running WEKA classifiers on the ARFF files produced, after applying pre-processing filters etc. It generates WEKA output files containing output from WEKA classifiers. ---------- PARAMETERS ---------- "corpus" The source corpus containing XML files for instances of an ambiguous imported into the GATE environment. "remove" The frequency cutoff value for ngram features that is passed to the NSPGate PR. "crossValCount" The number of times cross-validation experiment is to be repeated per machine learning algorithm. "ngrams" The value of N for various Ngrams to use, separated by the pipe symbol (|). For example to use unigrams and bigrams, this should be set to "1|2" without the quotes. "nontoken" The file that contains regular expressions for tokens to be discarded. This is passed to the NSPGate PR. "score" The score cutoff value to be used by the statistics module of the Ngram Statistics Package (NSP). This is passed to the NSPGate PR. "stop" The file containing list of common stop words or functional words which should be discarded from the list of features. This is passed to NSPGate PR. "token" The file containing regular expressions which we want to be identified as tokens. This is passed to NSPGate PR. "statModule" The statistic module to be used by the statistic.pl program from NSP. This is passed to NSPGate PR. "datasetPath" The path where WEKA dataset files in the ARFF format should be created. "resultPath" The path where the output of WEKA preprocessing filters and classifiers is to be stored. "fileNamePrefix" The prefix to be used while creating ARFF and output file names. "wekaClassifiers" The WEKA classifier(s) to use for performing cross-validation experiments, separated by the pipe symbol (|). For example, to run the Support Vector Machine classifier and the naive Bayes classifier in weka, one would use "weka.classifiers.functions.SMO|weka.classifiers.bayes.NaiveBayes" without the quotes. "wekaOptions" Respective WEKA classifier options for each of the classifiers in the parameter above, separated by the pipe symbol (|). "wekaFilterWithOptions" Any WEKA filter to use, with its options. Only one can be used at most. "stringToNominalRange" A list of attributes to convert from string to nominal type. "mlConfigFilesDir" Directory containing the Machine Learning XML configuration files. The files are passed to the WSDMachineLearningPR, a new one for each file. "mlConfigFilesFilter" Wild card filters to choose any particular machine learning configuration files in the directory above. By default all XML file will be used. "inputASName" The input annotation set to use for extracting features. "outputASName" The output annotation set name to use for creating any new annotations, default is - "WSD Annotations" without the quotes. ** WSDMachineLearningPR ** -------------------------- This is a modified version of the machine learning PR from GATE. It adds functionality to enable feature extraction from a "flexible distance" in the sense that a feature position of -1 means "get the first feature to the left of the ambiguous word, no matter how far it is", instead of the usual "get the first word to the left of the ambiguous word if its a feature, else return empty". ---------- PARAMETERS ---------- "document" The document to be processed, for which the features are to be extracted using the annotations in the document. "inputASName" The name of the annotation set in the above document, to be used for input, that is for extracting features. "configFileURL" The URL for the Machine Learning configuration file that specifies what features are to be extracted from the above document. "training" A boolean flag which is set to true for enabling training mode, else the PR operates in testing mode, applying a previously generated model to the input instances. ** SplitSval2Instances PR ** ---------------------------- For efficiency reasons, it is best to have the input data files (that is the instances of an ambiguous word) in multiple xml files -- one per instance of the ambiguous word. This PR helps in breaking up a single file that is in Senseval-2 format, into multiple files, one per instance in GATE xml format. ---------- PARAMETERS ---------- "document" The source Senseval-2 file imported into GATE environment. "outputPath" The directory where the output GATE xml files, one per instance should be created. ================== COMMAND LINE USAGE ================== Note that all of the above components are GATE plugins and do not run outside of GATE. To facilitate command line usage another Java class has been developed that parses a configuration file containing parameters for the plugins above and invokes these components after initializing GATE in non-UI mode. This class is gate.creole.wsd.WSDExperiments. The paramters required for this Java class at command line can be given in two ways. The first way is used when one already has multiple input XML files (one per instance) of an ambiguous word. They are as follows (in the specified order below): 1. TERM_FILE This is a plain text file containing the list of ambiguous words to be experimented upon. Note that these are also the directory names in the root level input data directory. Refer to USAGE document for input data directory structure. 2. OPTIONS_FILE This is a plain text configuration file containing parameter specification for NSPGate and WSDGate PRs. The format is PARAMETER_NAME:=VALUE or PARAMETER_NAME:=VALUE, depending upon whether the parameter is global for all experiments or local for a single experiment respectively. Details of the parameter names and the file format are in the sample configuration file samples/configtemplate. 3. INPUTS_DIR This is the top level root directory containing one directory per ambiguous word. The names of these directories should be exactly the same as those listed in the TERM_FILE above. Each of these directories contain multiple XML files (one per instance) for the corresponding ambiguous word. 4. OUTPUTS_DIR This is the top level root directory where the WEKA output should be stored. Similar to the INPUTS_DIR, one directory per ambiguous word is created in this output directory and the output files for an ambiguous word are stored in the corresponding sub-directory. The second way to invoke gate.creole.wsd.WSDExperiments is when one has a single Senseval-2 formatted file for an ambiguous word, containing all the instances. The parameters to use in this case are as follows (in the given order): 1. --singlefile This is just a flag that indicates that what follows is the name of the Senseval-2 formatted input file. 2. INPUT_FILE This is the name of the Senseval-2 formatted input file for the ambiguous word. Note that in this mode, only one word can be processed per invokation of gate.creole.wsd.WSDExperiments. 3. OPTIONS_FILE This is the same configuration file as described in the first method of invoking gate.creole.wsd.WSDExperiments. The manual creation of XML configuration files for feature extraction is a tedious task. To facilitate some automation in the process and also to create and experimental directory structure, WSDGate provides a utility Perl script mkconfig.pl. Following is a description of the same. mkconfig.pl ----------- mkconfig.pl is the program that automatically creates experimental directories, configuration files and scripts for running experiments. REQUIRED parameters to mkconfig.pl are: --configname This is the name given to the generated experiment. Usually this can be something that reflects the choice of features. e.g. If the features are unigrams and bigrams in a window of 5 and POS tags in a window of 2, then one possible configuration name can be "ub5p2". The name should not contain spaces, semi-colons or = sign. --javapath This is the full path to the Java virtual machine binary, e.g. /usr/local/j2sdk1.4.2_09/jre/bin/java --gatehome This is the full path to the home directory of the GATE installation on your machine, e.g. /usr/local/GATE3.0 Note: If the GATE path contains a space, use double quotes to specify. --configtemplatefile This is the path to the configuration template file (configtemplate) that comes with the package and *WHICH YOU SHOULD CUSTOMIZE* for your set of experiments. More information on how to customize the configuration template is available in the "configtemplate" file itself, the format is mentioned and each of the parameters are explained. #################### ** IMPORTANT NOTE ** #################### You *MUST* update all the /FULL/PATH/... related parameters in the sample configuration template to point to directories and files on your machine, (you can comment the ones that are not marked ** REQUIRED **). ############################ ** ANOTHER IMPORTANT NOTE ** ############################ In the description below, the terminology used is as follows: ********************** * Annotation = XML Tag e.g. in the string "APC", the annotation is "head" and it encloses the text "APC". ************************************ * Attribute = Property of an XML Tag e.g. in the string ", the attributes are "id" and "senseid" and their values are "12345" and "river" respectively. (The annotation is "answer", and it does not have any enclosed text, since the tag ends immediately). **************************************************************** * Feature = A property of an instance of ambiguous word which is used by Machine Learning algorithms, either as for prediction of some feature, or as the value to be predicted e.g. in the string "<1gram string="fertile">fertilebank" the value "fertile" of the attribute "string" of the annotation "1gram" can be a feature. Also, simply the presence or absence of the "1gram" annotation around the word "fertile" can be a feature. So features are defined based on values of attributes or presence of annotations. ################################### ** END OF ANOTHER IMPORTANT NOTE ** ################################### The following parameters determine the features that will be used in the experiment. These decide the contents of the feature specification XML file. Keeping the above terminology convention in mind, the annotations in the feature specification XML file will be simply called XML tags. --inst This is the annotation in your input data files that encloses the instance i.e. the occurrence of the ambiguous terms in the data files. --class The feature specification XML file consists of a tag, inside which there are several tags which specify the details about the features to be extracted. One (and only one) of the features is the CLASS feature (one which is to be predicted, based on knowledge of other features). This command line argument takes the details about the CLASS feature. The class feature parameters are to be specified in the following format: There should be exactly 6 parameters separated by comma (,) symbol. Where parameters values are not required, a comma should be repeated without any space after the previous comma e.g. a,b,,-1,Y,emptyvalues. There should be absolutely no space anywhere in between the parameter specification. NAME,ANNOTATION_TYPE,ATTRIBUTE,POSITION,FLOATING_FEATURE,VALUES NAME: Mandatory, cannot be empty. This is the name given to the feature, as it should seen in the WEKA ARFF file, e.g. a name for the class feature of the target word can be "Meaning". These names should be strictly alpha-numeric with only minus sign (-) and parentheses where needed. ANNOTATION_TYPE: Mandatory, cannot be empty. This is the annotation type in the input data file using which the class feature is to be extracted. ATTRIBUTE: Optional, when empty only presence or absence of ANNOTATION_TYPE above is recorded. This is the attribute of the above annotation from which the value for the CLASS feature is to be extracted. POSITION: Mandatory, cannot be empty. This parameter should specify the relative position of ANNOTATION_TYPE with respect to the instance annotation. FLOATING_FEATURE: Optional, when empty defaults to "N". This parameter specifies whether the physical position of ANNOTATION_TYPE should be treated as a "floating" position such that even if the ANNOTATION_TYPE is not present exactly at the specified position (where a different annotation might be present), the search should continue until the specified occurrance of ANNOTATION_TYPE is found. If the end of the document is reached, then a missing value (?) is returned. The value to be specified for this parameter should be either 'Y' if floating position is desired or 'N' if floating position is not desired. VALUES: Optional, when empty it defaults to "novalues". This parameter can take one of the following 3 values, 1. "novalues" - A set of possible values for the feature need not be specified. This is used when the feature is binary, where only the presence or absence of an annotation is to be recorded. 2. "emptyvalues" - Useful for creating features of datatype "string" in the WEKA ARFF format, where the specification means that the set of possible values for this attribute is not known in advance, but should rather be decided after all the instances are known. 3. A file containing possible nominal values: This is useful for creation of WEKA nominal features where the set of possible values of the feature is known in advance. For the purpose of most WSD experiments, it is suitable to select "emptyvalues" which means that the attributes will be created as string attributes in the WEKA ARFF file and then they can be converted to nominal or other required types using WEKA filters. --feat This argument specifies features other than the class feature in the feature specification XML file. The difference from the class feature parameters specification is that here we can have specification of parameters for multiple attributes. The feature parameters are to be specified in the following format: There should be exactly 7 parameters for each feature, separated by comma (,) symbol, and every feature should be separated by plus (+) symbol. Wherever parameters values are not required, a comma should be repeated without any space after the previous comma e.g. a,b,,-1:1,Y,N,emptyvalues. There should be absolutely no space in between the feature parameter specifications. NAME,ANNOTATION_TYPE,ATTRIBUTE,POSITION,ZERO,FLOATING_FEATURE,VALUES NAME: Mandatory, cannot be empty. This is the name given to the feature, as it should seen in the WEKA ARFF file, e.g. a name for the feature that is one position to the left of the target word can be "U(-1)". These names should be strictly alpha-numeric with only minus sign (-) and parentheses where needed. ANNOTATION_TYPE: Mandatory, cannot be empty. This is the annotation type in the the input file from which the feature is to be extracted. ATTRIBUTE: Optional, whem empty only presence or absence of ANNOTATION_TYPE is recorded. This is the attribute of the ANNOTATION_TYPE above from which the value for the feature is to be extracted. So a crucial part of this feature specification is knowing what annotations and attributes are created by which components and only then one is able to use them as features by specifying the required details about them to the mkconfig.pl file. Note that incorrect feature specification will lead to creation of incorrect feature specification files, and since those annotations or attributes may not be present in the data file, no feature values will be extracted, instead a missing values dataset will be created. Annotations produced by components that WSDGate uses are listed in the README document. POSITION: Mandatory, cannot be empty. This parameter should specify the relative position of ANNOTATION_TYPE above with respect to the instance annotation. One crucial difference of this parameter with respect to the corresponding class feature parameter is that this supports a RANGE of values. So one can specify that this feature should be captured in a range of positions say -5 to 5. In such a case to keep the NAME of the feature unique, the NAME value is appended with this position information. The format for specifying a RANGE is :, e.g. "-5:5" IMPORTANT: See next two parameters that are related, and provide an example. ZERO: Optional, defaults to "N" when empty. This parameter applies when a RANGE is specified for the POSITION parameter above. If the RANGE includes the position 0 (zero) then one might or might not want a feature for that position. Position 0 essentially means the instance annotation itself, i.e. the target word. So, for example for Part of Speech (POS) features in a window of 2 around the target word, one might or might not want the POS tag for the target term. This parameter can have 2 values: "Y" if one wants the position 0 to be a feature, "N" if position 0 should not be a feature. FLOATING_ATTRIBUTE: Optional, defaults to "N" when empty. This parameter specifies whether the physical position of ANNOTATION_TYPE should be treated as a "floating" position such that even if ANNOTATION_TYPE above is not present exactly at the specified position (where a different annotation might be present), the search should continue until the specified occurrance of the ANNOTATION_TYPE is found. If the end of the document is reached, then a missing value is returned. VALUES: Optional, defaults to "novalues" when empty. This parameter can take one of the following 3 values, 1. "novalues" - A set of possible values for the feature need not be specified. 2. "emptyvalues" - Useful for creating features of datatype "string" in the WEKA ARFF format, where the specification means that the set of possible values for this feature is not known in advance, but should rather be decided after all the instances are known. 3. A file containing possible nominal values: This is useful for creation of WEKA nominal features where the set of possible values of the feature is known in advance. For the purpose of most WSD experiments, it is suitable to select "emptyvalues" which means that the attributes will be created as string attributes in the WEKA ARFF file and then they can be converted to nominal or other required types using WEKA filters. An exception is the POS tag values, which we know in advance. A file pos_values.txt has been provided with the package. It contains a list of all the POS tags that can be marked up by the ANNIE POS tagger. This should be used as a VALUES parameter for POS attributes. Optional parameters to mkconfig.pl are: --engine This is a parameter specific to GATE and decides which machine learning engine should be used. Currently, if used this should *ALWAYS* be set to gate.creole.wsd.WSDWekaWrapper and is therefore redundant. --memory This *optional* argument can be used to modify the heap size that should be used by the java virtual machine, e.g. "--memory 1024M". The default heap size used to initialize the JVM is 400M. ** Questions? ** ---------------- Contact Mahesh Joshi (joshi031@d.umn.edu) or Ted Pedersen (tpederse@d.umn.edu). ** Copyright Notice ** ---------------------- Copyright (C) 2005-06, Mahesh Joshi University of Minnesota, Duluth Ted Pedersen University of Minnesota, Duluth This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.