[Xapian-discuss] bigrams and co-occurrence matrix

Ying Liu liux0395 at umn.edu
Tue Oct 27 17:29:13 GMT 2009


Hi Yung-chung,

I download the libunicode-0.4.tar.gz and run your example code. It 
works. Thanks!

 From the methods you give, I couldn't understand how this work, index 
and search. If I want to find the bigrams of 10000 documents, I can set 
the $tknzr->ngram_size(2). How to set the $tknzr->tokenize()? My 
understand so far is to read in the 10000 documents and save them as 
Search::Xapian::Document $doc1 to $doc10000 first, store the 10000 $doc 
into Search::Xapian::Database, and then $tknzr->tokenize($db). Should I 
cut the texts into chunks first by $tknzr->segment()?

For this module, I don't see a method which can adjust the window size. 
The goal here is to build the co-occurrence  matrix, so  windowing 
operation is very important. Any suggestions about this? And do you have 
any suggestions for building co-occurrence matrix by Xapian? I am a 
newbie to Xapian. My understanding of your module might be wrong.

Thank you,
Ying




☼ 林永忠 ☼ (Yung-chung Lin) wrote:
> Hi Ying,
>
> You may download from libunicode from here: 
> http://ftp.gnome.org/pub/gnome/sources/libunicode/0.4/
>
> Best,
> Yung-chung Lin
>
> 2009/10/27 Ying Liu <liux0395 at umn.edu <mailto:liux0395 at umn.edu>>
>
>     Hi Yung-chung,
>
>     Thanks for your reply. I download the cjk-tokenizer from CPAN at
>     http://search.cpan.org/~xern/Lingua-CJK-Tokenizer-0.01/lib/Lingua/CJK/Tokenizer.pm
>     <http://search.cpan.org/%7Exern/Lingua-CJK-Tokenizer-0.01/lib/Lingua/CJK/Tokenizer.pm>.
>     It has a prerequisite libunicode by Tom Tromey. I don't find this
>     module on CPAN. What should I install to make the cjk-tokenizer
>     module work?
>
>     Thanks,
>     Ying
>
>
>     ☼ 林永忠 ☼ (Yung-chung Lin) wrote:
>
>         Hi Ying,
>
>         You may check this http://code.google.com/p/cjk-tokenizer/
>         A perl binding is also included.
>
>         Best,
>         Yung-chung Lin
>
>
>         2009/10/26 Ying Liu <liux0395 at umn.edu
>         <mailto:liux0395 at umn.edu> <mailto:liux0395 at umn.edu
>         <mailto:liux0395 at umn.edu>>>
>
>
>            Hello all,
>
>            I want to work out a solution to counting bigrams and
>         creating a
>            co-occurrence matix with Xapian Perl modules. By check archived
>            emails, there are some discussions about CJK tokens. I am just
>            working on English documents. My immediate goals are how
>         Xapian do
>            bigrams and how can it do that with windowing, like NSP
>         does with
>            the -- window option. Did anyone work on this before? Do
>         you have
>            some suggestions?
>
>            Thank you,
>            Ying
>
>
>            _______________________________________________
>            Xapian-discuss mailing list
>            Xapian-discuss at lists.xapian.org
>         <mailto:Xapian-discuss at lists.xapian.org>
>            <mailto:Xapian-discuss at lists.xapian.org
>         <mailto:Xapian-discuss at lists.xapian.org>>
>
>            http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
>
>
>
>     _______________________________________________
>     Xapian-discuss mailing list
>     Xapian-discuss at lists.xapian.org
>     <mailto:Xapian-discuss at lists.xapian.org>
>     http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
>




More information about the Xapian-discuss mailing list