COM3110/6150 Text Processing (2010/11)
Assignment: Document Retrieval
Task in brief: To implement and test a basic document retrieval system.
Submission: Your assignment work is to be submitted electronically using MOLE. Precise
instructions for what files to submit are given later in this document.
SUBMISSION DEADLINE: 3pm, Monday, Week 9 (22 November, 2010)
Penalties: Standard departmental penalties apply for late hand-in and for plagiarism
Materials Provided
Download the file TP Assignment Files.zip from the module homepage, which unzips to
give a folder containing the following data and code files, for use in the assignment:
data files: documents.txt, queries.txt, cacm gold std.txt, stoplist.txt
example results file.txt
code files: Collection.py, eval ir.py
The file documents.txt contains a collection of documents which record publications in the
CACM (Communications of the Association for Computing Machinery). Each document is a
short record of a CACM paper, including its title, author(s), and abstract | although one or
other of these (especially abstract) may be absent for a given document. The file queries.txt
contains a set of IR queries for use against this collection. (These are `old-style' queries, where
users might write an entire paragraph describing their interest.) The file cacm gold std.txt
is a `gold standard' identifying the documents that have been judged relevant to each query.
These files constitute a standard test set that has been used for evaluating IR systems (although
it is now somewhat dated, not least by being very small by modern standards).
Code files: If you inspect the files documents.txt and queries.txt, you will see that they
have a common format, where each document or query comes enclosed within (XML-style)
open and close document tags, that also specify a numeric identifier for the document/query.
The Python class Collection.py provides convenient access to documents/queries in the
manner of a simple iteration. In particular, if we create an instance of this class (supplying the
name of the file containing the document collection or query set), then the .docs() method
of that instance returns an iterator that will successively return the documents/queries, as
illustrated in the following code example:
import Collection
file = ...
collection = Collection.Collection(file)
for doc in collection.docs()
print "ID: ", doc.docid
print doc.lines[0]
COM3110/6150 Assignment: 2010/11
Here, the document is returned as an instance of a Document class, that has two attributes:
an attribute docid whose (integer) value is the numeric identier of the document (or query),
and an attribute lines whose value is a list of strings for the lines of text in the document.
For instance, the example code above prints the identifier and first line of each document in
the collection. You are free to use the Collection class as part of your system (but this is
optional).
The Python script eval ir.py calculates system performance scores. This requires systems
to produce a `results' file in a standard format, indicating the documents it has returned
for each query. Execute the script with its `help' option (-h) for instructions on using the
script, and on the required format of the results file (which is also exemplified by the data file
example results file.txt).
Task Description
Your task is to implement a document retrieval system, based on the vector space model, and
to evaluate its performance over the CACM test collection, under the alternative parameter
settings that arise from the following choices:
stoplist: whether a stoplist is used or not (to exclude less useful terms)
term weighting: whether just term frequency is used, or the TF.IDF approach
As an optional `extra', you might also try using stemming in your system. (A python version
of the Porter stemmer is available from: tartarus.org/martin/PorterStemmer).
You should implement your retrieval system in Python, and you should ensure that it will run
under a linux/unix evironment (such as that provided by the department), as it will be tested
under such an environment when your work is marked.
Assignment: Document Retrieval
Task in brief: To implement and test a basic document retrieval system.
Submission: Your assignment work is to be submitted electronically using MOLE. Precise
instructions for what files to submit are given later in this document.
SUBMISSION DEADLINE: 3pm, Monday, Week 9 (22 November, 2010)
Penalties: Standard departmental penalties apply for late hand-in and for plagiarism
Materials Provided
Download the file TP Assignment Files.zip from the module homepage, which unzips to
give a folder containing the following data and code files, for use in the assignment:
data files: documents.txt, queries.txt, cacm gold std.txt, stoplist.txt
example results file.txt
code files: Collection.py, eval ir.py
The file documents.txt contains a collection of documents which record publications in the
CACM (Communications of the Association for Computing Machinery). Each document is a
short record of a CACM paper, including its title, author(s), and abstract | although one or
other of these (especially abstract) may be absent for a given document. The file queries.txt
contains a set of IR queries for use against this collection. (These are `old-style' queries, where
users might write an entire paragraph describing their interest.) The file cacm gold std.txt
is a `gold standard' identifying the documents that have been judged relevant to each query.
These files constitute a standard test set that has been used for evaluating IR systems (although
it is now somewhat dated, not least by being very small by modern standards).
Code files: If you inspect the files documents.txt and queries.txt, you will see that they
have a common format, where each document or query comes enclosed within (XML-style)
open and close document tags, that also specify a numeric identifier for the document/query.
The Python class Collection.py provides convenient access to documents/queries in the
manner of a simple iteration. In particular, if we create an instance of this class (supplying the
name of the file containing the document collection or query set), then the .docs() method
of that instance returns an iterator that will successively return the documents/queries, as
illustrated in the following code example:
import Collection
file = ...
collection = Collection.Collection(file)
for doc in collection.docs()
print "ID: ", doc.docid
print doc.lines[0]
COM3110/6150 Assignment: 2010/11
Here, the document is returned as an instance of a Document class, that has two attributes:
an attribute docid whose (integer) value is the numeric identier of the document (or query),
and an attribute lines whose value is a list of strings for the lines of text in the document.
For instance, the example code above prints the identifier and first line of each document in
the collection. You are free to use the Collection class as part of your system (but this is
optional).
The Python script eval ir.py calculates system performance scores. This requires systems
to produce a `results' file in a standard format, indicating the documents it has returned
for each query. Execute the script with its `help' option (-h) for instructions on using the
script, and on the required format of the results file (which is also exemplified by the data file
example results file.txt).
Task Description
Your task is to implement a document retrieval system, based on the vector space model, and
to evaluate its performance over the CACM test collection, under the alternative parameter
settings that arise from the following choices:
stoplist: whether a stoplist is used or not (to exclude less useful terms)
term weighting: whether just term frequency is used, or the TF.IDF approach
As an optional `extra', you might also try using stemming in your system. (A python version
of the Porter stemmer is available from: tartarus.org/martin/PorterStemmer).
You should implement your retrieval system in Python, and you should ensure that it will run
under a linux/unix evironment (such as that provided by the department), as it will be tested
under such an environment when your work is marked.