[Xapian-discuss] Tag-based filesystem with xapian, advice?

Karel Marissens karel.marissens at gmail.com
Mon Mar 2 21:50:34 GMT 2009


Olly, Thank you for your answers. I haven't got time yet to test it  
all out but I looked at the API for your answer to my 2th question and  
it's not entirely clear to me yet.

First of all, how do I go from an MSet to an RSet? Is there a built-in  
method I'm overlooking?

As you guessed, I want to eliminate path terms in my taglists. I see  
the ExpandDecider can accept a term to ignore, so I need to loop over  
all terms, check for a '/' and if found add the term to the decider?  
Or is the last paragraph of your answer on this question not directly  
related? I'm a little confused.

I'm also curious what you think might be the performance of this  
search for available tags compared to an RDBMS solution? In an RDBMS  
solution with 3 tables (files, files_tags and tags), there could be a  
LIKE '/path/%' in the files table to find relevant files (I believe an  
index can be used for the like), a join with the files_tags table, a  
join with the tags table and finally a group by on the found tags. But  
I have no idea if that is more/less performant than the xapian way.

Another requirement that I (probably) have is to be able to add a term  
to the database without actually adding it to a file yet. Is this  
possible? I off course can always use an empty document which has all  
terms (except paths)...

Lastly, do you have any idea if there's python documentation similar  
to the API documentation for C++? (see link below) Or can it be  
generated somehow? I did find the python bindings page and everything  
seems to be about the same as the C++ API, but still it would be  
handy...
http://xapian.org/docs/apidoc/html/classes.html

Thanks in advance,

Karel

On 02 Mar 2009, at 06:27, Olly Betts wrote:

> On Sat, Feb 28, 2009 at 10:21:15PM +0100, Karel Marissens wrote:
>> Now my 1th question is, what is the path of a file? The content of a
>> document? A term? A value? I need to be able to use the path when
>> searching as I need to be able to limit the file-results to files  
>> in a
>> certain directory. Thus for example, only files that have a path of /
>> photo's/2008/*. Or do I have to work with a relevance-set or  
>> something?
>
> I would put the path in the document data for reading when you get
> results, and also index all the directories which the file is in as
> terms (e.g. P/photo's and P/photo's/2008 for a file in /photo's/2008).
>
>> I tried using the path as a tag itself, but when I do a query for "/
>> photo's/2008/*", it is automatically translated to 2 separated  
>> terms I
>> think? (a file tagged as 2008 also showed up for example)
>
> I don't think you want to use QueryParser here - just build your Query
> objects up by hand.
>
> If you want to allow "free text queries" after +FIND, then you can  
> parse
> that part with QueryParser and then filter the result using the
> appropriate "P"-prefixed term, e.g. in C++:
>
>    Xapian::Query q = qp.parse_query(query_string);
>    q = Xapian::Query(q.OP_FILTER, q, Xapian::Query("P/photo's"));
>
>> My 2th question is, what is the easiest way to get a list of all the
>> tags associated with files in the resultset? I want to have a list of
>> all tags associated with files in /photo's/2008. One method would be
>> to do a search for all files in /photo's/2008, or any subdirectory,
>> loop all the results, and per document, loop the terms associated  
>> with
>> it and add these to a list.
>
> You can add all documents in the MSet to an RSet and use
> Enquire::get_eset() to get a set of all the terms in all the  
> documents.
> That's not so different to what you describe, though Xapian does most
> of the work for you, including eliminating duplicates.
>
> If you just want the "tag" terms (and not P/photo's, etc), you can use
> an "ExpandDecider" to only pick out those.
>
> I'd suggest for efficiency that you might want to consider adding a
> special case for "/" and use Database::allterms_begin() to iterate  
> over
> all the terms in the database.
>
>> My 3th question is how I can get ALL results? Get_mset() requires a
>> maximum amount of results. Do I just set it to an extremely big  
>> number
>> and see it as a safety-limitation that shouldn't be reached?
>
> If you can handle result sets of any size, just pass  
> db.get_doccount() -
> there can't be more matching documents than there are documents in the
> database.
>
> I'll add a note to the documentation comment for Enquire::get_mset()  
> as
> this has been asked a few times before.
>
> Cheers,
>    Olly




More information about the Xapian-discuss mailing list