[Xapian-discuss] Tag-based filesystem with xapian, advice?

Karel Marissens karel.marissens at gmail.com
Sat Feb 28 21:21:15 GMT 2009


Hi.

For my thesis, I'm working on a combination of a hierarchical and tag- 
based filesystem written in python. I'm using FUSE to "write" the  
filesystem. Now I am thinking about using xapian but could use some  
advise.

Before I go into my questions, I'll explain the idea of the system  
(questions after the horizontal line). The idea is that there's a  
(hidden) directory using a hiërarchical filesystem. I then virtually  
replicate this directory at a logical place (say, the homefolder of a  
user) using FUSE. It can be used exactly the same as a user would  
normally, but, I add extra functionality: tags. Every file will have  
the ability to have different tags (keywords) associated with it. An  
image of the christmas tree for example might be hiërarchically  
located in /photo's/2008/christmas and be tagged as "tree, christmas,  
photo, 2008".

How tags are added to files etc. is not important here. What is  
important is that the association of a file with several tags will be  
saved in a database, as this information needs to be searchable.

Every directory in the hierarchy will have a special folder: +FIND.  
When one goes to /photo's/2008/+FIND, all tags associated with files  
in the directory /photo's/2008, or any of its subdirectories, will be  
visible as subdirectories. By opening such a subdirectory, a list of  
tags (in the form of subdirectories) that can be combined with it will  
be showed. So /photo's/2008/+FIND/christmas will show all tags  
associated with files in the directory /photo's/2008, or any of its  
subdirectories, which are tagged as christmas.

At any moment, the user can go in the special subdirectory +FILES to  
see a list of all the files that comply to the selection. /photo's/ 
2008/+FIND/christmas/tree/+FILES will thus show all files in /photo's/ 
2008, or any of its subdirectories, which are tagged as christmas and  
tree.

----------------------------------------------------------------------------------------------------

So, as I was searching for the best way to save all the needed  
information in a database and find it back, I stumbled upon xapian. I  
read the information pages and the whole API, looked at the few  
examples I could find and did some small tests. I will only use  
boolean search functionality as I have no need to "guess" which file  
is most relevant, I just need to show them all.

Now my 1th question is, what is the path of a file? The content of a  
document? A term? A value? I need to be able to use the path when  
searching as I need to be able to limit the file-results to files in a  
certain directory. Thus for example, only files that have a path of / 
photo's/2008/*. Or do I have to work with a relevance-set or something?

I tried using the path as a tag itself, but when I do a query for "/ 
photo's/2008/*", it is automatically translated to 2 separated terms I  
think? (a file tagged as 2008 also showed up for example)

My 2th question is, what is the easiest way to get a list of all the  
tags associated with files in the resultset? I want to have a list of  
all tags associated with files in /photo's/2008. One method would be  
to do a search for all files in /photo's/2008, or any subdirectory,  
loop all the results, and per document, loop the terms associated with  
it and add these to a list.

My 3th question is how I can get ALL results? Get_mset() requires a  
maximum amount of results. Do I just set it to an extremely big number  
and see it as a safety-limitation that shouldn't be reached?

----------------------------------------------------------------------------------------------------

To sum it all up:
1) Where do I store the path of a file?
2) How do I get a list of all terms associated with documents in the  
resultset?
3) How do I get ALL results, not a limited amount?

Thanks in advance for any advice!

Karel


More information about the Xapian-discuss mailing list