[Xapian-devel] Added a python example to the community page
aarsh shah
aarshkshah1992 at gmail.com
Fri Feb 8 16:19:07 GMT 2013
Hey James, Hi ,Hope your doing fine :) Thanks for your detailed feedback .
I'm really sorry for the shabby code,I had just begun to get acquainted
with Xapian when I wrote this.In order to improve the example according to
the points that you have mentioned,here's what I'd like to do :-
1.) I had actually thought of just ignoring the first word of every
sentence but then realized that that would eliminate some genuine proper
nouns as well.There are two ways I can use to find the proper nouns :-
a.) Use the Named Entity Recognition
provided by the nltk NLP library of python.It will just directly pinpoint
the proper nouns in the sentence to me.But then any one who wants to run
the example,will have to install nltk as well as the associated corpora for
it.However,it's performance is extremely good.
b.) I can just check to see if the
first word of a sentence is a standard dictionary word or not (by using
something like PyEnchant to speed up the process ) ,if it is,I wont
consider is it as a proper noun.PyEnchant is relatively simple to install
but this method will make mistakes depending on the words present in the
standard dictionary because some dictionaries include a lot of proper nouns
in them.
2.) I agree,I just read the code of the xapian.TermGenerator.index_text( )
and realized that it already does a lot of processing like
tokenization,stemming etc. So what I'll now do is (because I only want to
index a single word) first produce the unstemmed/stemmed form of the words
(by directly using the Xapian::Stem object) depending on the stemming
strategy which I'll now take from the command line (and then use
TermGenerator.set_stemming_strategy as this will help my example be a good
example for the various stemming strategies we provide),combine them with
the prefix (which will again be selected by him ) and then instead of using
index_text( ),Ill directly use the add_term(term,wdf_inc) function of my
Document object.I don't need to use the Document.add_posting( ) function as
our example does not need phrase searching.
3.)Ill work on the code to include support for sentences broken across
lines.I just somehow didn't do this when I wrote the example.
The modified code will now show various aspects of Xapian such as stemming
strategies,specifying the prefix for the terms,some methods of the Document
object etc.
Please let me know what you think and thank you so much for your time :)
-Regards
-Aarsh
>
On Fri, Feb 8, 2013 at 3:11 AM, James Aylett <james-xapian at tartarus.org>wrote:
> On 27 Jan 2013, at 20:09, aarsh shah <aarshkshah1992 at gmail.com> wrote:
>
> > Hey guys,I have added a python indexer example to the SampleCode page of
> our wiki.Please do have a look.The code can also be found here :-
> >
> >
> https://github.com/aarshkshah1992/xapian/blob/efcf443527b74326119bbc0935fc41a002ce60db/xapian-bindings/python/docs/examples/simpleindexgrep.py/
>
> Aarsh — what are you actually trying to do here? Because what your
> comments say you're doing isn't what the code does. Three problems:
>
> 1) English uses capitals at the start of sentences, so you're actually
> just indexing more or less everything
>
> 2) you're running xapian.TermGenerator.index_text() on single words, which
> isn't really what it's designed to do (it has its own word-splitting
> algorithm)
>
> 3) you don't support sentences broken across lines, which doesn't match
> the majority of use cases — although you may have a particular one in mind
>
> Does what you're trying to do show how to use an aspect of Xapian that we
> don't already show in the existing examples? Or at least show it more
> clearly?
>
> J
>
> --
> James Aylett, occasional trouble-maker
> xapian.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130208/ec39213c/attachment.htm>
More information about the Xapian-devel
mailing list