[Xapian-discuss] bigrams and co-occurrence matrix
Ying Liu
liux0395 at umn.edu
Tue Oct 27 17:29:13 GMT 2009
Hi Yung-chung,
I download the libunicode-0.4.tar.gz and run your example code. It
works. Thanks!
From the methods you give, I couldn't understand how this work, index
and search. If I want to find the bigrams of 10000 documents, I can set
the $tknzr->ngram_size(2). How to set the $tknzr->tokenize()? My
understand so far is to read in the 10000 documents and save them as
Search::Xapian::Document $doc1 to $doc10000 first, store the 10000 $doc
into Search::Xapian::Database, and then $tknzr->tokenize($db). Should I
cut the texts into chunks first by $tknzr->segment()?
For this module, I don't see a method which can adjust the window size.
The goal here is to build the co-occurrence matrix, so windowing
operation is very important. Any suggestions about this? And do you have
any suggestions for building co-occurrence matrix by Xapian? I am a
newbie to Xapian. My understanding of your module might be wrong.
Thank you,
Ying
☼ 林永忠 ☼ (Yung-chung Lin) wrote:
> Hi Ying,
>
> You may download from libunicode from here:
> http://ftp.gnome.org/pub/gnome/sources/libunicode/0.4/
>
> Best,
> Yung-chung Lin
>
> 2009/10/27 Ying Liu <liux0395 at umn.edu <mailto:liux0395 at umn.edu>>
>
> Hi Yung-chung,
>
> Thanks for your reply. I download the cjk-tokenizer from CPAN at
> http://search.cpan.org/~xern/Lingua-CJK-Tokenizer-0.01/lib/Lingua/CJK/Tokenizer.pm
> <http://search.cpan.org/%7Exern/Lingua-CJK-Tokenizer-0.01/lib/Lingua/CJK/Tokenizer.pm>.
> It has a prerequisite libunicode by Tom Tromey. I don't find this
> module on CPAN. What should I install to make the cjk-tokenizer
> module work?
>
> Thanks,
> Ying
>
>
> ☼ 林永忠 ☼ (Yung-chung Lin) wrote:
>
> Hi Ying,
>
> You may check this http://code.google.com/p/cjk-tokenizer/
> A perl binding is also included.
>
> Best,
> Yung-chung Lin
>
>
> 2009/10/26 Ying Liu <liux0395 at umn.edu
> <mailto:liux0395 at umn.edu> <mailto:liux0395 at umn.edu
> <mailto:liux0395 at umn.edu>>>
>
>
> Hello all,
>
> I want to work out a solution to counting bigrams and
> creating a
> co-occurrence matix with Xapian Perl modules. By check archived
> emails, there are some discussions about CJK tokens. I am just
> working on English documents. My immediate goals are how
> Xapian do
> bigrams and how can it do that with windowing, like NSP
> does with
> the -- window option. Did anyone work on this before? Do
> you have
> some suggestions?
>
> Thank you,
> Ying
>
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> <mailto:Xapian-discuss at lists.xapian.org>
> <mailto:Xapian-discuss at lists.xapian.org
> <mailto:Xapian-discuss at lists.xapian.org>>
>
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
>
>
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> <mailto:Xapian-discuss at lists.xapian.org>
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
>
More information about the Xapian-discuss
mailing list