WMEB is an unsupervised bootstrapping algorithm for automatically
extracting semantic lexicons from raw biomedical literature. Previous
approaches suffer from semantic drift, where a lexicon's meaning shifts
during bootstrapping. WMEB prevents semantic drift by extracting multiple competing classes simultaneously and
exploiting statistical measures of association strength.
Extensions of WMEB utilise bagging and distributional similarity techniques to detect
and prevent semantic drift further.
The systems are domain-independent and significantly outperforms previous
approaches. See
[ALTA08],
[ACL09] and
[PhD Thesis] for details.
NegFinder is an unsupervised algorithm for automatically detecting
competing categories during bootstrapping. The discovered negative
categories are then exploited to reduce semantic drift. Prior to this
work, WMEB required a domain expert to manually craft negative
competing categories.
NegFinder exploits the agglomerative process of
hierarchical clustering to efficiently detect drifting categories.
State-of-the-art results were published in
[EMNLP10].
RGB is the first bootstrapping algorithm that automatically discovers
open relationships between the target semantic categories. By
simultaneously extracting lexicons and their open relations, the
necessity of manually crafted category and relationship constraints is
removed. State-of-the art results will be presented at ACL 2011.
As part of my final undergraduate year, I developed new Association Rule (AR) mining algorithms to extract gene
relationships from microarray data.
At the time, existing microarray data mining
techniques suffered from two major weaknesses - restricting the number of genes
which can be included in the analysis, and the assumption that only common
gene relationships are of interest.
My research rectified these. More
specifically, I developed the first comprehensive AR algorithm, MaxConf,
which can mine dense data with no support threshold (traditionally used to
prune uncommon relationships), by incorporating new confidence threshold
properties.
Some work and data is open-sourced, and is available on request.