Here’s how you do full text searching across multiple different columns in a multilingual environment on Oracle 9 and better, quick and easy. I’m more of an MySQL guy, so you’ll have to excuse me if my Oracle lingo isn’t up to scratch.
One time setup, as system
This has to be execute by a privileged Oracle user. If there is an error message, it’s probably because CTX_DDL has not been installed and you’ll need to talk to a skilled DBA or figure out how to do it yourself.
GRANT EXECUTE ON CTX_DDL TO mydb;
Lexer setup
One the step above is done, you can do everything as the Oracle user ID you normally work under.
This step sets up the ‘global_lexer’ which determine how Oracle understands text (e.g. that oxen is the plural of ox in the English locale). If you are only working in English I believe you can just drop all the French references.
begin
ctx_ddl.drop_preference('global_lexer');
end;
/
begin
ctx_ddl.drop_preference('english_lexer');
end;
/
begin
ctx_ddl.drop_preference('french_lexer');
end;
/
begin
ctx_ddl.create_preference('english_lexer','basic_lexer');
ctx_ddl.set_attribute('english_lexer','index_themes','yes');
ctx_ddl.set_attribute('english_lexer','printjoins','$_#@*&^%/\()');
ctx_ddl.set_attribute('english_lexer', 'skipjoins', '-');
ctx_ddl.create_preference('french_lexer','basic_lexer');
ctx_ddl.set_attribute('french_lexer','index_themes','no');
ctx_ddl.set_attribute('french_lexer','base_letter','yes');
end;
/
exec ctx_ddl.create_preference('global_lexer','multi_lexer') ;
begin
ctx_ddl.add_sub_lexer('global_lexer', 'french', 'french_lexer');
ctx_ddl.add_sub_lexer('global_lexer', 'default','english_lexer');
end;
/
Table setup
If you’re doing a multilingual set up, you’ll need a field in your table that specifies the language. One can never be quite sure about how things are done in Oracle, but the values EN and FR seem to do the right thing. Our table looks something like this (we’re just showing the important stuff here):
CREATE TABLE Post
(
search CLOB,
subject NVARCHAR2(256) NOT NULL,
body CLOB,
lang VARCHAR2(6) NOT NULL
);
The subject and body fields have the data we want to search. Because Oracle doesn’t let you search multiple fields, we concatenate them into search at UPDATE/INSERT time. lang is storing the language code for this particular row, i.e. EN or FR.
Next we must set up a trigger that will maintain the search field for you:
CREATE OR REPLACE TRIGGER post_text_trigger
BEFORE UPDATE OR INSERT ON post
FOR EACH ROW
DECLARE
a CLOB;
BEGIN
a := :NEW.subject || ' ';
a := a || :NEW.body;
:NEW.search := a;
END post_text_trigger ;
/
Now you may be asking yourself: why is that ‘a’ assignment all over the place? Simple: who the hell knows, it’s Oracle.
After that we create a new INDEX on the table to do the full text searching:
CREATE INDEX post_text ON Post(search)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('LEXER global_lexer STOPLIST ctxsys.default_stoplist LANGUAGE COLUMN lang');
Maintaining the Index
Unlike what you might expect, Oracle doesn’t magically keep the full text search index up-to-date. This is something you have to maintain on your own. If you were to execute a search, you wouldn’t find anything right now.
To bring the index up to date, use the following command:
EXEC CTX_DDL.SYNC_INDEX('post_text');
In our implementation we just run that command from a cron script every 30 minutes. There’s also a way to do this in Oracle I understand, but we’re quite comfortable with UNIX commands. How often you want to run will vary depending on how much text you have and how often you update it.
You also have another option: create a trigger that will update the index every time a row is modified. There may be performance issues involved with this, but if you want to try it, here’s the magic:
CREATE OR REPLACE TRIGGER post_text_trigger2
AFTER INSERT OR UPDATE OR DELETE ON post
DECLARE
v_job NUMBER;
BEGIN
IF deleting THEN
DBMS_JOB.SUBMIT(v_job, 'ctx_ddl.optimize_index(''post_text'',''FULL'');', SYSDATE);
ELSE
DBMS_JOB.SUBMIT(v_job, 'ctx_ddl.sync_index(''post_text'');', SYSDATE);
END IF;
END;
/
Doing searches
Oracle provides all sorts of various methods, obtusely documented. See the references below for more. What you probably want to do is, well, look for stuff. Here’s what we did
First, we convert the search string into a safe list words — no punctuation, etc.. The we create a search string that looks something like the following:
SELECT * FROM Post WHERE contains(search, '${mutt} AND ${and} AND ${jeff}') > 0;
Note the {}: this stops the word ‘and’ being searched from from being recognized as a Oracle Keyword
Further reading
There’s lots of info here and the first reference in particular told me most of the information I needed to know.