[Xapian-discuss] Tag-based filesystem with xapian, advice?

Olly Betts olly at survex.com
Mon Mar 2 05:27:07 GMT 2009


On Sat, Feb 28, 2009 at 10:21:15PM +0100, Karel Marissens wrote:
> Now my 1th question is, what is the path of a file? The content of a  
> document? A term? A value? I need to be able to use the path when  
> searching as I need to be able to limit the file-results to files in a  
> certain directory. Thus for example, only files that have a path of / 
> photo's/2008/*. Or do I have to work with a relevance-set or something?

I would put the path in the document data for reading when you get
results, and also index all the directories which the file is in as
terms (e.g. P/photo's and P/photo's/2008 for a file in /photo's/2008).

> I tried using the path as a tag itself, but when I do a query for "/ 
> photo's/2008/*", it is automatically translated to 2 separated terms I  
> think? (a file tagged as 2008 also showed up for example)

I don't think you want to use QueryParser here - just build your Query
objects up by hand.

If you want to allow "free text queries" after +FIND, then you can parse
that part with QueryParser and then filter the result using the
appropriate "P"-prefixed term, e.g. in C++:

    Xapian::Query q = qp.parse_query(query_string);
    q = Xapian::Query(q.OP_FILTER, q, Xapian::Query("P/photo's"));

> My 2th question is, what is the easiest way to get a list of all the  
> tags associated with files in the resultset? I want to have a list of  
> all tags associated with files in /photo's/2008. One method would be  
> to do a search for all files in /photo's/2008, or any subdirectory,  
> loop all the results, and per document, loop the terms associated with  
> it and add these to a list.

You can add all documents in the MSet to an RSet and use
Enquire::get_eset() to get a set of all the terms in all the documents.
That's not so different to what you describe, though Xapian does most
of the work for you, including eliminating duplicates.

If you just want the "tag" terms (and not P/photo's, etc), you can use
an "ExpandDecider" to only pick out those.

I'd suggest for efficiency that you might want to consider adding a
special case for "/" and use Database::allterms_begin() to iterate over
all the terms in the database.

> My 3th question is how I can get ALL results? Get_mset() requires a  
> maximum amount of results. Do I just set it to an extremely big number  
> and see it as a safety-limitation that shouldn't be reached?

If you can handle result sets of any size, just pass db.get_doccount() -
there can't be more matching documents than there are documents in the
database.

I'll add a note to the documentation comment for Enquire::get_mset() as
this has been asked a few times before.

Cheers,
    Olly



More information about the Xapian-discuss mailing list