[Xapian-discuss] Full Text Searching a Relational Model

yarong at xennexinc.com yarong at xennexinc.com
Thu Jan 24 12:40:45 GMT 2008


Hi,

(Warning, not for the weak-hearted)

I'm currently working on a project where we have a large and complex data
model, related to Genomics. We are trying to build a search engine that
provides "full text" and "field-based text" searches for our customer base
(mostly academic research), and are evaluating different tools for this
purpose.

As a starting point, we have, as an example, a set of objects (stored in
tables as a relational model):
Gene [ID, Symbol, Description]
Article - M:M with Gene [ID, Title]
Disease - M:M with Gene [ID, Name]
Author - M:M with Article [ID, Name]
(Note: M:M tables exist, just link IDs)

An example model would be (hierarchical, relations dealt with as
duplications)

  Gene [ID=1, Symbol=EGFR, Description=epidermal growth factor receptor]
    Article [ID=1, Title=EGFR mutations in lung cancer: correlation with
clinical response to gefitinib therapy]
      Author [ID=1, Name=H. Michaelson]
      Author [ID=2, Name=J. Watson]
    Article [ID=2, Title=Proteomics analysis of epidermal protein kinases
by target class-selective prefractionation and tandem mass
spectrometry]
      Author [ID=1, Name=H. Michaelson]
      Author [ID=3, Name=M. Roberts]
    Disease [ID=1, Name=Epidermal sluffing]

  Gene [ID=2, Symbol=AHCY, Description=S-adenosylhomocysteine hydrolase]
    Article [ID=3, Title=Limited proteolysis of S-adenosylhomocysteine
hydrolase: implications for the three-dimensional structure]
      Author [ID=4, Name=B. Cohen]
      Author [ID=5, Name=L. Alexander]
    Article [ID=2, Title=Proteomics analysis of epidermal protein kinases
by target class-selective prefractionation and tandem mass
spectrometry]
      Author [ID=1, Name=H. Michaelson]
      Author [ID=3, Name=M. Roberts]

Note IDs in the objects above, as they relay the relations in the
hierarchical model.

In our Full-Text search, we would like to allow users to search ANY
textual field for any string. For instance, the term "epidermal", and
display the list of genes which have any data associated with them with
that term (ranked, of course).
Our list of results would be something like:

EGFR
  Found in Description (epidermal growth factor receptor)
  Found in Article ID#2, in Title (proteomics analysis of epidermal
protein kinases by target class-selective prefractionation and tandem
mass spectrometry)
  Found in Disease ID#1, in Name (Epidermal sluffing)

AHCY
  Found in Article ID#2, in Title (proteomics analysis of epidermal
protein kinases by target class-selective prefractionation and tandem
mass spectrometry)

Note that the results retain a hierarchial view of our Genes (us being
Gene-Centric, we're pretty much framing the question "find this term
related in information related to those genes"). Also note that Article ID
#2 has an M:M with Gene ID2 (AHCY) and Gene ID1 (EGFR), and only due to
that fact, AHCY is considered a gene that has "epidermal" in its
annotations.

Obviously, we'd like to rank fields by location in hierarchy (A term in a
gene name is scored higher than the name of the author of an article
related to a gene) and by number of hits (number of times a term is found
related to that gene, 3 in the case of EGFR above).

Ideas for how to take on this challenge? Implementation? Tools?

Thanks!
Yaron Golan




More information about the Xapian-discuss mailing list