David Janes' Code Weblog

November 15, 2011

Python, Sqlite3, FTS3 & MacOSX

db,python,tips · admin · 8:49 am ·

If you can’t load FTS3 on a Mac, this post on StackOverflow tells you how to solve the problem.

The final bit of magic is in:

from pysqlite2 import dbapi2 as sqlite3

March 26, 2009

How to: create a MySQL database and user

db · David Janes · 6:02 am ·

I always seem to forget this:

$ mysql -u root -p
Enter password: enter mysql-root-user passwd
mysql> CREATE DATABASE DATABASENAME
mysql> GRANT ALL PRIVILEGES ON DBNAME.* to "DBUSER"@"localhost" IDENTIFIED BY "DBPASSWORD";
mysql> FLUSH PRIVILEGES;
mysql> EXIT
$

December 29, 2008

Interesting links from the last month

db,ideas,semantic web · David Janes · 9:29 am ·
  • Aspena web server for highly extensible Python-based publication, application, and hybrid websites. As a potential alternative to Python’s builtin HTTPServer. MIT license.
  • V8V8 is Google’s open source JavaScript engine; written in C++; can run standalone, or can be embedded into any C++ application. I am very excited by this, as allowing users to send code to the server to execute Javascript is an amazingly powerful idea. If anyone knows of a Python wrapper, let me know please. New BSD license.
  • KomodoEdit (a testimonial) – I am going to try this out, though vi/vim will always be my first love (JJ also has an article on using ctags).
  • Virtuoso - an innovative Universal Server platform that delivers an enterprise level Data Integration and Management solution for SQL, RDF, XML, Web Services, and Business Processes. There’s way to much bla bla bla in that sentence, but apparently this is really sweet at handling SPARQL/RDF triples. Kingsley Idehen writes extensively about this on his blog (e.g.).
  • Drizzlea database optimized for Cloud and Net applications. Way too early to commit to this yet. See The New MySQL Landscape for more interesting going ons.
  • AuthKitauthentication and authorization toolkit for WSGI applications and frameworks.
  • Geodjangoa world-class geographic web framework. Lots of great ideas and pointers to libraries in here, even if you’re not planning to use this itself.
  • Disco – an open-source implementation of the Map-Reduce framework for distributed computing. The Disco core is written in Erlang, a functional language that is designed for building robust fault-tolerant distributed applications. Users of Disco typically write jobs in Python, which makes it possible to express even complex algorithms or data processing tasks often only in tens of lines of code. Here’s a blog post about the same, with references to vs. Hadoop.
  • On (Python) packaging. Debating distutil, easy_install and pip.

December 6, 2008

All your Base are belong to us

db,freebase,semantic web · David Janes · 6:26 am ·

Freebase is a user-editable, user-extensible structured database, a sort of one-stop shop semantic web/Wikipedia application. I started playing with Freebase about a year ago and the application has made significant strides over that period, especially in the usability department. Freebase also provides a very nice API which I’m using in GenX, with the caveat that it’s currently almost useless because of query timeouts.

I just came across the following page on Freebase: http://vancouver.freebase.com/. This page is what Freebase calls a Base, which is a collection of Tables/Views, which are things like “Vancouver Bloggers“, “Mayoral Candidates 2008” and so forth. A Table/View is a list of Topics, which are basically the equivalent of a Wikipedia page. Get all that? It makes sense after a while

A few observations:

  • Why have I written Table/View above? Because in some places it’s called a Table and other places it’s called a View. Which is it? I’m guessing View but it’s still not 100% clear.
  • I decided to create our own Toronto Base especially for the TorCamp community. Given that you get your own top-level domain name there’s somewhat of an incentive to be a first-mover on this
  • When you create a Base, it provides a list of suggested Views that can be added. Nice. Unfortunately, it added each View twice. I then had to go delete the duplicate View manually. Not so nice. And then even though I’ve deleted the View it still shows up on a detail page. Sigh.
  • On thus plus side, this is all done in a nice-Ajaxy way
  • It’s really not at all obvious how you create a new View. Really not obvious. Here’s the documentation.
  • My initial opinion was that Views seem to be copies, not references: this turns out to be a wrong assumption on my part. Views are in fact (if I got this right) the results of a query on the Freebase db. This means that as more Topics match the View query, they’ll automatically show up. The query is a copy, not a reference, but this is a good thing.
  • The implication is that it’s difficult to create a View that is an arbitrary “bag” of topics. For example, if I want to create a Toronto Bloggers View, I have to actually make sure that all the Topics that will show up are marked with some attribute that can be matched to give them a Toronto-bloggerness quality.

November 24, 2008

Database roundup

db,ideas,python,semantic web · David Janes · 7:27 am ·

Here’s a few things I was reading about over the weekend.

SQLAlchemy

SQLAlchemy is a full-featured Design Pattern-heavy pythonic database ORM. I am totally going to use this for my next Python SQL database project and may even do some playing with old datasets (using the reflection features, yum) soon. If you are considering doing SQL work on your next Python project, don’t even bother with the usual PEP 249 stuff, start with this.

Note that if you’re working with Django it handles the DB in its own way so SQLAlchemy may be of limited utility.

CouchDB

CouchDB “is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API”. I couldn’t have written that more succently myself, so I didn’t. I qualified the paragraph above on SQLAlchemy that I’m going to use that for my next SQL project because I’m really biting at the bit to try CouchDB out. The CouchDB design philosophy – a REST API a returning lists of JSON-objects – reflects my current design paradigm very closely, and the only question I have is whether in practically scales to millions of rows.

A caveat that it’s written in the-cool-nerds-are-doing-it language Erlang, but because you don’t have to interact with that it should be OK for us mortals.

CouchDB is about to officially become a “top level” Apache project, though none of the documentation on the Apache.org site reflects this yet.

Virtuoso

Virtuoso is a “high-performance object-relational SQL database”. It apparently can perform well. As I came across through the Planet RDF aggregator, this may be something you want to look into if you’re working on an RDF/SPARQL project.

Amazon Web Services Hosted Data Sets

That’s a mouthfull, isn’t it? Amazon is offering to host public datasets on EC2 for free. What’s the catch? It will host the data, but you have to pay for the computing resources to use that data in the normal EC2 manner. Still, if you’re using a large public dataset and you’re already EC2-friendly, you might want to consider this program. An even more interesting thought occurs (though I’m not sure if it will fly): if you’re using large amounts of your own data on EC2, you may want to offer it up as a free resource.

There’s more on this on by Lidija Davis on Read/Write Web.

October 27, 2008

How to do multi-column multilingual full text searching in Oracle

db · · David Janes · 7:43 pm ·

Here’s how you do full text searching across multiple different columns in a multilingual environment on Oracle 9 and better, quick and easy. I’m more of an MySQL guy, so you’ll have to excuse me if my Oracle lingo isn’t up to scratch.

One time setup, as system

This has to be execute by a privileged Oracle user. If there is an error message, it’s probably because CTX_DDL has not been installed and you’ll need to talk to a skilled DBA or figure out how to do it yourself.

GRANT EXECUTE ON CTX_DDL TO mydb;

Lexer setup

One the step above is done, you can do everything as the Oracle user ID you normally work under.

This step sets up the ‘global_lexer’ which determine how Oracle understands text (e.g. that oxen is the plural of ox in the English locale). If you are only working in English I believe you can just drop all the French references.

begin
ctx_ddl.drop_preference('global_lexer');
end;
/

begin
ctx_ddl.drop_preference('english_lexer');
end;
/

begin
ctx_ddl.drop_preference('french_lexer');
end;
/

begin
ctx_ddl.create_preference('english_lexer','basic_lexer');
ctx_ddl.set_attribute('english_lexer','index_themes','yes');
ctx_ddl.set_attribute('english_lexer','printjoins','$_#@*&^%/\()');
ctx_ddl.set_attribute('english_lexer', 'skipjoins', '-');
ctx_ddl.create_preference('french_lexer','basic_lexer');
ctx_ddl.set_attribute('french_lexer','index_themes','no');
ctx_ddl.set_attribute('french_lexer','base_letter','yes');
end;
/

exec ctx_ddl.create_preference('global_lexer','multi_lexer') ;

begin
ctx_ddl.add_sub_lexer('global_lexer', 'french', 'french_lexer');
ctx_ddl.add_sub_lexer('global_lexer', 'default','english_lexer');
end;
/

Table setup

If you’re doing a multilingual set up, you’ll need a field in your table that specifies the language. One can never be quite sure about how things are done in Oracle, but the values EN and FR seem to do the right thing. Our table looks something like this (we’re just showing the important stuff here):

CREATE TABLE Post
(
    search    CLOB,
    subject   NVARCHAR2(256)  NOT NULL,
    body      CLOB,
    lang      VARCHAR2(6)    NOT NULL
);

The subject and body fields have the data we want to search. Because Oracle doesn’t let you search multiple fields, we concatenate them into search at UPDATE/INSERT time. lang is storing the language code for this particular row, i.e. EN or FR.

Next we must set up a trigger that will maintain the search field for you:

CREATE OR REPLACE TRIGGER post_text_trigger
BEFORE UPDATE OR INSERT ON post
FOR EACH ROW
DECLARE
    a CLOB;
BEGIN
    a := :NEW.subject || ' ';
    a := a || :NEW.body;
    :NEW.search := a;
END post_text_trigger ;
/

Now you may be asking yourself: why is that ‘a’ assignment all over the place? Simple: who the hell knows, it’s Oracle.

After that we create a new INDEX on the table to do the full text searching:

CREATE INDEX post_text ON Post(search)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('LEXER global_lexer STOPLIST ctxsys.default_stoplist LANGUAGE COLUMN lang');

Maintaining the Index

Unlike what you might expect, Oracle doesn’t magically keep the full text search index up-to-date. This is something you have to maintain on your own. If you were to execute a search, you wouldn’t find anything right now.

To bring the index up to date, use the following command:

EXEC CTX_DDL.SYNC_INDEX('post_text');

In our implementation we just run that command from a cron script every 30 minutes. There’s also a way to do this in Oracle I understand, but we’re quite comfortable with UNIX commands. How often you want to run will vary depending on how much text you have and how often you update it.

You also have another option: create a trigger that will update the index every time a row is modified. There may be performance issues involved with this, but if you want to try it, here’s the magic:

CREATE OR REPLACE TRIGGER post_text_trigger2
AFTER INSERT OR UPDATE OR DELETE ON post
DECLARE
    v_job NUMBER;
BEGIN
    IF deleting THEN
        DBMS_JOB.SUBMIT(v_job, 'ctx_ddl.optimize_index(''post_text'',''FULL'');', SYSDATE);
    ELSE
        DBMS_JOB.SUBMIT(v_job, 'ctx_ddl.sync_index(''post_text'');', SYSDATE);
    END IF;
END;
/

Doing searches

Oracle provides all sorts of various methods, obtusely documented. See the references below for more. What you probably want to do is, well, look for stuff. Here’s what we did

First, we convert the search string into a safe list words — no punctuation, etc.. The we create a search string that looks something like the following:

SELECT * FROM Post WHERE contains(search, '${mutt} AND ${and} AND ${jeff}') > 0;

Note the {}: this stops the word ‘and’ being searched from from being recognized as a Oracle Keyword

Further reading

There’s lots of info here and the first reference in particular told me most of the information I needed to know.

Powered by WordPress

Switch to our mobile site