****************************************** USAGE document for WSDGate v 0.05 ****************************************** The purpose of this file is to show step by step, the process of setting up and running command line interface based experiments with the WSDGate framework. BEFORE READING THIS DOCUMENT, PLEASE READ THE README FILE. ========================================================= ########################################## NOTE: THIS IS A MEMORY INTENSIVE TOOLKIT!! ########################################## The minimum required memory for this toolkit to function properly is 512MB. At least 1GB of physical memory is recommended for carrying out large scale experiments. The java programs are run with a default heap size of 400M. If you have more memory, you can use the optional argument "--memory" of the mkconfig.pl script to specify the heap size that the java virtual machine can use. While this parameter can also be used to decrease the JVM heap size, it is not recommended since that will slow down the process to a great extent, specially for large datasets. For the purpose of this document, we will proceed with a running example that deals with setting up a simple experiment using a small sample of data contained in the samples/ directory of wsdgate. All paths mentioned in the running example are relative to the base directory of wsdgate-v0.05. The sample data is an extract from the abbreviation data created by Dr. Hongfang Liu from the University of Maryland, Baltimore County. (Refer to her publication "H. Liu, V. Teller and C. Friedman. Multi-Aspect Comparison Study of Supervised Word Sense Disambiguation. JAMIA 2004" at http://userpages.umbc.edu/~hfliu/article/liujamia_2004.pdf). The data was split into one GATE XML document per instance of the ambiguous abbreviation using the SplitSval2 PR mentioned in the README. STEP 1 - Making sure that GATE sees WSDGate framework ----------------------------------------------------- After running the INSTALL.sh as mentioned in the INSTALL document, * Open GATE * Click on the menu item: File -> Manage CREOLE plugins * The dialog box that comes up shows list of plugins that GATE can see. * The list should have the following entry: - WSD * Each entry has 2 check-boxes in front of it. Selecting the first check box causes GATE to load that plugin for the current session. Selecting the second check box causes GATE to load the plugin for all subsequent sessions. Select both the check boxes for all of the above component and click on "OK". * The GATE "Messages" tab in the main window should show a message saying "CREOLE plugin loaded: WSD". Successful execution of these steps ensures that GATE can talk with all the new WSDGate plugin. STEP 2 - Running mkconfig.pl ---------------------------- ----------------------------------------------------------------------- Expected structure of the directories containing the input data for WSD ----------------------------------------------------------------------- The framework is meant for running "target word disambiguation" kind of tasks where there is a fixed set of terms which we want to disambiguate. The directory structure expected for this is as follows: If there are 5 ambiguous words for which we have data, then the directories are (indentation represents one level below in the hierarchy): root_data_dir/ word1/ word1 file(s) word2/ word2 file(s) word3/ word3 file(s) word4/ word4 file(s) word5/ word5 file(s) where word1 through word5 are directories having the same name as the ambiguous terms. The files inside each term directory are XML files containing sense annotated instances for those terms. For efficiency reasons, it is desirable to have one XML file per instance of the term. So if word1 has 100 instances then there should be 100 separate XML files for the term. For the sample data provided, the top level root data directory is samples/abbr. We have just one one ambiguous abbreviation APC so there is only one sub-directory under samples/abbr. Inside this directory, there are 20 XML files for 20 instances of the ambiguous abbreviation. While these XML files are in GATE XML format and have been generated using Senseval-2 format input XML files, it is not required that the data be in particular XML format, it can be any valid XML containing desired sense annotations. The following note assumes that mkconfig.pl is run after creation of a new "experiments" directory inside the extracted wsdgate-v0.05 root directory. The current directory is this new "experiments" directory. ** Setting the various command line parameter values for mkconfig.pl Let us assume that we are interested in disambiguating the given sample data using the following features: 1. 5 unigrams to the left and right of the ambiguous acronym APC 2. 5 bigrams to the left and right of APC 3. 2 Part of Speech tags to the left and right of APC, including the POS tag of the ambiguous acronym itself. --CONFIGNAME Given our features, let us name our experimental configuration as "ub5p2". ==>> Therefore, the value for the --configname parameter is ub5p2 <<== --JAVAPATH For my machine, this is /usr/bin/java, you should use path to your java binary. ==>> Therefore, the value for the --javapath parameter is /usr/bin/java <<== --GATEHOME For my machine, this is /home/mahesh/GATE. Use path specific to your machine. ==>> Therefore, the value for the --gatehome parameter is /home/mahesh/GATE <<== --CONFIGTEMPLATE ********************** *** VERY IMPORTANT *** ********************** Don't forget to update all the parameters that have value /FULL/PATH/TO/YOUR/... in the configtemplate file to match your directory and file locations. Assuming, I extracted wsdgate-v0.05.tar.gz in /home/mahesh/wsdgate-v0.05 and that the current directory is therefore /home/mahesh/wsdgate-v0.05/experiments, ==>> value for --configtemplate is ../samples/configtemplate <<== ########################################################################## # TO UNDERSTAND THE PARAMETER VALUES AHEAD OF THIS, IT WILL BE USEFUL # IF THE USER LOADS ONE OF THE SAMPLE XML FILES IN **GATE** IN THE DOCUMENT # VIEWER AND BRINGS UP THE ANNOTATION SETS TREEVIEW AND ANNOTATION LIST ########################################################################## --INST In the given sample data, the ambiguous instances of APC are marked using the "head" annotations. ==>> Therefore, the value of --inst is head <<== --CLASS Let us name our class feature as "Sense". So NAME is "Sense" (This class parameter is mandatory, cannot be empty) In the case of our sample data, the class i.e. the sense of the abbreviation is specified in the "answer" annotation. So ANNOTATION_TYPE is "answer" (This class parameter is mandatory, cannot be empty) The attribute on "answer" annotation where the sense is present is "senseid". So ATTRIBUTE is "senseid" (This class parameter is optional, when empty only the presense or absence of the ANNOTATIO_TYPE will be recorded) Our instance annotation is the "head" annotation. The "answer" annotation for any instance is the first annotation of that type *before* the "head" annotation. So its position is negative with respect to ANNOTATION_TYPE. Since we want the *first* "answer" annotation to the left of "head" annotation, the position is -1. So POSITION is -1 (This class parameter is mandatory, cannot be empty) We know that the "answer" annotation is somewhere to the left of the "head" annotation. But its actual position depends upon the size of the context for that instance of APC. Hence the position of the "answer" tag with respect to the "head" tag should be specified as floating. So FLOATING_FEATURE parameter should have a value of "Y" to indicate that our class feature is a floating feature with respect to the instance annotation. So FLOATING_FEATURE is "Y" (This class parameter is optional, when empty it defaults to "N") Assuming we don't know in advane what all senses exist for APC in the given data, we set VALUES to "emptyvalues". So VALUES is "emptyvalues" (This class parameter is optional, when empty it defaults to "novalues") So, finally our "--class" argument looks like this: ==>> --class Sense,answer,senseid,-1,Y,emptyvalues <<== --FEAT We have three types of features - unigrams, bigrams and part of speech tags. Let us name unigrams as U(x) where x is the position, bigrams as B(x) and part of speech tags as POS(x). So NAME will have values "U","B" and "POS" (This attribute parameter is mandatory, cannot be empty) The annotation marked up by NSPGate for unigrams is "1gram", that for bigrams is "2gram" and the annotation for POS tags is "Token" which is marked by ANNIE POS Tagger PR in GATE. So ANNOTATION_TYPE will have values "1gram", "2gram" and "Token" (This attribute parameter is mandatory, cannot be empty) The attribute on 1gram and 2gram annotations which contains the n-gram string is "string". For part of speech tags, it is "category" attribute on the "Token" annotation that contains the POS tag value. So ATTRIBUTE will have values "string", "string" and "category" (This attribute parameter is optional, when empty only the presence or absence of the ANNOTATION_TYPE is recorded as a feature) We want position of the unigrams and bigrams in a window of 5 around the abbreviation APC, i.e. in a range of -5 to 5. For part of speech tags we want a window of 2, i.e. -2 to 2. So POSITION will have values "-5:5", "-5:5" and "-2:2" (This attribute parameter is mandatory, cannot be empty) For unigrams and bigrams, including the 0th position will get the ambiguous acronym APC. Instead, we skip the zeroth position for them. But for POS features, we want to keep the POS tag of the ambiguous acronym. So we want the zeroth position in that case. So ZERO will have values "N", "N" and "Y" (This attribute parameter is optional, when empty will default to "N") Assuming we want floating unigrams and bigrams, but non-floating POS tags: So the FLOATING_FEATURE will have values "Y", "Y" and "N" (This attribute parameter is optional, when empty it defaults to "N") We don't know the possible values of unigrams and bigrams in advance, but we know the part of speech tags assigned by the ANNIE system. The list is provided with WSDGate in the file samples/pos_values.txt in the base WSDGate directory. So the VALUES parameter will be "emptyvalues", "emptyvalues" and "../samples/pos_values.txt" (This attribute parameter is optional, when empty will default to "novalues") Combining all the parameters of the 3 types of features, the final value of the "--feat" option is: ==>> --feat U,1gram,string,-5:5,N,Y,emptyvalues+B,2gram,string,-5:5,N,Y,emptyvalues+POS,Token,category,-2:2,Y,N,../samples/pos_values.txt <<== Let us keep the optional parameters "--engine" and "--memory" to their default values. This completes the description of parameters of the mkconfig.pl file. So finally, the sample command that you can use for creation of experimental setup is as follows (recollect that the current directory is /home/mahesh/wsdgate-v0.05/experiments): # ../bin/mkconfig.pl --configname ub5p2 --javapath /usr/bin/java --gatehome /home/mahesh/GATE --configtemplatefile ../samples/configtemplate --inst head --class Sense,answer,senseid,-1,Y,emptyvalues --feat U,1gram,string,-5:5,N,Y,emptyvalues+B,2gram,string,-5:5,N,Y,emptyvalues+POS,Token,category,-2:2,Y,N,../samples/pos_values.txt NOTE: You will need to replace path to java and home directory of GATE with the correct paths on your machine. AFTER RUNNING THE COMMAND AS ABOVE: Here's the directory structure (recursive w.r.t. current directory) that should be produced for the sample data provided (assuming UNIX like operating system). The config_name is "ub5p2". .: arffs mlconfigs results scripts ./arffs: ub5p2 ./mlconfigs: ub5p2 ./mlconfigs/ub5p2: ub5p2.xml <--------- Feature specification XML file (for *all* terms) ./results: ub5p2 ./scripts: ub5p2 ./scripts/ub5p2: APC.term <---------------- Term file containing the string "APC" runtests.APC.ub5p2.sh <--- Executable script to run APC experiments ub5p2.config <------------- Configuration file (for *all* terms) ---------------------------------------------- Note on experimental directory structure above ---------------------------------------------- Along with requiring the data in a certain directory structure, the automatic script generation interface of WSDGate enforces a certain directory structure on the experimental setup. The automatic script generation is done using the script mkconfig.pl. One of the parameters required by this script is the name of the experimental configuration (say config_name), this can be any alphanumeric string. Upon execution with the appropriate parameters, mkconfig.pl generates the following set of directories and files in the *CURRENT WORKING DIRECTORY* i.e. the directory from where the script was executed: (paths relative to current working directory) ./arffs/config_name/ :: The raw ARFF files that are generated during the experiments are stored here. Inside this directory, a sub-directory for each ambiguous term is automatically created (upon executing the experiment, not after running the mkconfig.pl script). ./results/config_name/ :: The results of running the WEKA classifiers are stored as *.wekaoutput files inside the corresponding sub-directories for each term. Also, intermediate ARFF files produced upon application of WEKA filters are stored in this location (again, inside respective sub- directories for each term). ./mlconfigs/config_name/ :: The feature specification XML file required by the Machine Learning component is generated in this directory. Only one file is generated inside this directory. ./scripts/config_name/ :: This directory contains the experiment specific files and executable scripts that should be used for running the experiments. One file with extension ".term" is simply a file specifying the list of terms to be disambiguated. In the current setup, one such file is create per term, containing just that one single term inside it. The second file is a parameter configuration file which is created from a template. Apart from this an executable script is generated for each ambiguous term. This script contains the java command line to invoke the class gate.creole.wsd.WSDExperiments discussed above. Note that in each of the above cases, config_name is a sub-directory inside the 4 directories - arffs, results, mlconfigs and scripts. Running mkconfig.pl with a new configuration name will therefore create new set of files, keeping the previous ones intact. STEP 3 - Running actual experiments ----------------------------------- Let us assume that you run the above command (after replacing the appropriate paths) with your current working directory as /home/johnsmith/wsdgate-v0.05/experiments (where johnsmith is your user name), and the working directory was empty initially. Henceforward we will refer to the current working directory as the "experiments" directory. After running the command, your experiments directory should contain the *exact* directory structure that was shown at the end of STEP 2 above (assuming a UNIX like system). The script that is to be executed for running experiments related to the abbreviation APC is ./scripts/ub5p2/runtests.APC.ub5p2.sh. BEFORE EXECUTING THE SCRIPT, ** You need perl binary in your PATH ** You need Ngram Statistics Package installed with count.pl and statistic.pl scripts from NSP in your PATH ** You need to update the CLASSPATH variable in your environment to include the weka.jar file that is included in you WEKA installation. * To do this on UNIX like systems, in BASH shell use the command export CLASSPATH=/FULL/PATH/TO/YOUR/WEKA_DIR/weka.jar * To do this on UNIX like systems, in CSH or TCSH shell use the command setenv CLASSPATH /FULL/PATH/TO/YOUR/WEKA_DIR/weka.jar * To do this on Windows systems, use the following command in the "cmd.exe" or "command.com" DOS prompt: set CLASSPATH=/FULL/PATH/TO/YOUR/WEKA_DIR/weka.jar In all of the above commands, replace /FULL/PATH/TO/YOUR/WEKA_DIR with the path to WEKA installation specific to your machine. Now execute the above shell script from the experiments directory as (assuming UNIX like OS and the presence of bash shell): # ./scripts/ub5p2/runtests.APC.ub5p2.sh The experiment run is started and the STDOUT and STDERR are redirected to the following files respectively: ./scripts/ub5p2/runtests.APC.ub5p2.sh.output ./scripts/ub5p2/runtests.APC.ub5p2.sh.error Upon completion, you should have an empty error file and some output in the output file. The directory structure of the experiments directory should now look similar to the following (the random strings in between the file names will differ): (NEW files and directories marked with "<---- NEW relevant_comments"): # ls -R1 arffs mlconfigs results scripts ./arffs: ub5p2 ./arffs/ub5p2: APC <---- NEW directory for each ambiguous word ./arffs/ub5p2/APC: APC.ub5p2.1132282003909.ub5p2.xml.arff <---- NEW arff file containing data ./mlconfigs: ub5p2 ./mlconfigs/ub5p2: ub5p2.xml ./results: ub5p2 ./results/ub5p2: APC <---- NEW directory for each ambiguous word ./results/ub5p2/APC: APC.ub5p2.1132282003909.ub5p2.xml.1132282008408.weka.classifiers.functions.SMO.wekaoutput <---- NEW file, output of classifier (one per classifier) APC.ub5p2.1132282003909.ub5p2.xml.1132282009569.weka.classifiers.bayes.NaiveBayes.wekaoutput <---- NEW file, output of classifier (one per classifier) APC.ub5p2.1132282003909.ub5p2.xml.1132282010800.weka.classifiers.trees.J48.wekaoutput <---- NEW file, output of classifier (one per classifier) APC.ub5p2.1132282003909.ub5p2.xml.filterout.arff <---- NEW arff file on which the classifiers are executed APC.ub5p2.1132282003909.ub5p2.xml.s2n.arff <---- NEW intermediate arff APC.ub5p2.1132282003909.ub5p2.xml.withdummy.arff <---- NEW intermediate arff ./scripts: ub5p2 ./scripts/ub5p2: APC.term runtests.APC.ub5p2.sh runtests.APC.ub5p2.sh.error <---- NEW STDERR of the experiment script runtests.APC.ub5p2.sh.output <---- NEW STDOUT of the experiment script ub5p2.config In the above listing, the *.wekaoutput files represents the results of the cross-validation from WEKA on the dataset extracted using the specified features. To see how to extract the summary of results from these files, read STEP 4 below. STEP 4 - Summarizing results ---------------------------- In the general case, there is one directory per ambiguous term inside the results/ directory. Inside each of these directories there will be one or more *.wekaoutput files which contain the WEKA output from the selected classifiers. To summarize the results for all ambiguous terms, for each classifier you can use the script summarystats.pl. This extracts the accuracy, error, training time and execution time for the experiments in comma separated values format. This output is printed at STDOUT and the user should redirect it to the required file. summarystats.pl takes at most 2 arguments (*in specified order*): 1. PATH TO RESULTS DIRECTORY The first one is mandatory and it specifies the path to the results directory containing the directories for each ambiguous term. 2. CONFIGURATION SUFFIX (optional) The second argument is optional and specifies any *suffix* that should be used for the configuration name that will be shown in the output. By default the configuration name specified to mkconfig.pl is used. So for example, to run summarystats.pl from the experiments directory in the current example, one can use the command: # ../bin/summarystats.pl results/ub5p2 Jul12 > summary_July12.csv This should use the string Jul12 as the suffix for the experimental configuration name and store the output in the summary_Jul12.csv file. ** Copyright Notice ** ---------------------- Copyright (C) 2005-06, Mahesh Joshi University of Minnesota, Duluth Ted Pedersen University of Minnesota, Duluth This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.