Archive for the ‘Autonomy’ category

Autonomy Search Developer Starter

January 17th, 2008

So you’ve got an Autonomy IDOL in hand, and you’ve been asked to build a search application around it.  Here are some thoughts on getting started.

Let’s assume you’ve got the content in.  In a later post I’ll cover some of the fetches/connectors that you have access to, and what you can do with them.  For now, let’s start with a simple query.  Assume that the IDOL is installed on the server search, on port 9000.  Open your browser to:

http://search:9000/action=query&text=*

An installed IDOL listens on many ports: the default port of 9000 is where the IDOL Proxy Service sits and listens.  The response for an “action=query” is to return results that match the “text=” query.  By default, the response will contain up to 6 records, showing the default fields for each records (usually that includes a small subset of the metadata, and none of the content for each record), and will look something like this:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<autnresponse xmlns:autn="http://schemas.autonomy.com/aci/">
  <action>QUERY</action>
  <response>SUCCESS</response>
  <responsedata>
     <autn:numhits>6</autn:numhits>
     <autn:hit>
       <autn:reference>test_document.doc</autn:reference>
       <autn:id>1</autn:id>
       <autn:section>0</autn:section>
       <autn:weight>96.00</autn:weight>
       <autn:database>News</autn:database>
     </autn:hit>
    ...
  <responsedata>
</autnresponse>

Lesson #1: All meaningful Autonomy interaction is through URLs, and the response is typically in XML.  Some simple C# code to handle the response above would look like:

   1: XmlDocument xml = new XmlDocument();
   2: xml.Load("http://search:9000/?action=query&text=*");
   3:  
   4: XmlNamespaceManager nsmgr = new XmlNamespaceManager(xml.NameTable);
   5: nsmgr.AddNamespace("autn", "http://schemas.autonomy.com/aci/");
   6:  
   7: XmlNode node = xml.SelectSingleNode("/responsedata/autn:hit[1]/autn:reference", nsmgr);

The next step is to figure out how to issue queries that are more meaningful than text=*.  For that, we turn to Autonomy’s built-in help page.  You access it by–you got it–going to a URL:

http://search:9000/action=help

Lesson #2: Always have the help URL open on a monitor.  The HTML help that is displayed is the single best resource for questions; and ironically, it isimage not searchable.  Non-searchable help?  From a search company?  Yes.  Perhaps that was left intentionally as a challenge to the buyer to set up their first source…  In any case, your first friend will be the Query node, where you can find all sorts of helpful information on how to build the specific query you’re looking for.  Remember, unless your users are technical, it will most likely be your responsibility to “query cook”, accepting simplified input from your users and creating the complex URL that Autonomy needs.

In future posts, I’ll look at some of the specifics of the query URL, and how to see the impact in the logs.

Using LogParser to quantize Autonomy search logs

January 15th, 2008

I explained in a previous post how you can use the Microsoft Log Parser to dice up Autonomy IDOL search logs.  If you’ve exhausted the typical checks for performance problems in your IDOL installation, it might help to narrow down when the problems occur, and look for cluster periods of slow performance.  That’s a great opportunity to use the Log Parser again.

The first step is to level the playing field on timing information; the content GRL and the DAH GRL show duration information in milliseconds mixed with seconds.  I’m sure there is a clever way to correct that in-stream, but I took the brute-force approach: create separate files from first the rows with seconds, then those with milliseconds, and finally produce a single file from the results.

logparser -i:xml -o:csv “select *, mul(to_real(extract_prefix([autn:duration], 0, ‘ s’)), 1000) as milliseconds into over_1_second.csv from http://server:port/?action=grl&format=xml&tail=10000 where [autn:duration] like ‘% s’”

Then the rows under 1 second:

logparser -i:xml -o:csv “select *, to_real(extract_prefix([autn:duration], 0, ‘ ms’)) as milliseconds into under_1_second.csv from http://server:port/?action=grl&format=xml&tail=10000 where [autn:duration] like ‘% ms’”

Then merge the two files into a single file:

logparser -i:csv -o:csv “select milliseconds, [autn:time], [autn:thread], [autn:status], [autn:action], [autn:request], [autn:client] into merged.csv from *.csv”

These three basic steps serve as the basis for most log analysis I do, so I’ve added them into a script.  The result, merge.csv is a flattened file that contains the data we need.  If you are going to script this, don’t forget to escape the percents, i.e. like ‘%% s’.

Next, we run a quant operation on the logs.  I’ve found that a half-hour period makes for a good range to view the average query performance:

logparser -i:csv -o:csv “select quantize(to_timestamp([autn:time], ‘dd MMM yy hh:mm:ss’), 1800) as period, avg(milliseconds) from merged.csv group by period order by period”

The 1800 constant there is seconds, i.e. half an hour.  The result is a list, here’s a short snippet:

Period Duration (ms)
2008-01-02 22:00:00 624.828913
2008-01-02 22:30:00 415.648974
2008-01-02 23:00:00 2410.331818

This report shows clearly that around 11pm, we see a sharp decline in performance.  You might also want to add a count(*) clause to the query to highlight the system activity.

Using the Microsoft Log Parser to parse Autonomy Logs

January 13th, 2008

Much has been written about the free Microsoft Log Parser, a simple command-line tool that can access and parse log files from a number of sources, execute SQL-like queries against that data, and present results.  Did I mention it’s free?

Autonomy services will drop log files everywhere (literally all over the place), in different formats, and the challenge is to merge all that log data into a single store in order to get a handle on the big picture.  For instance, a single query against the IDOL server shows up in several logs:

  1. the GetRequestLog
  2. content_index.log
  3. possibly the OGS query log (if you have securityinfo)
  4. possibly a DAH log (if you’re distributing/mirroring)

How can you aggregate all that information to get a single picture for performance analysis and forensics?  And what about aggregating in other trace information, like application trace logs and IIS logs?  Use the Log Parser.  The first example I’ll give here is a simple query against the GRL–that should provide a view of the current queries that the IDOL is servicing.  I am using the latest version of the Log Parser (2.2, from Jan 2005)–download and install, then either copy to your %SYSTEM32% path, or simply add the install directory to your path, and run the following from a command-line:

logparser.exe -i:XML -o:DATAGRID “select [autn:action], [autn:request], [autn:client], [autn:time], [autn:duration], [autn:status], [autn:thread] from http://server:port/?action=grl&format=xml”

That opens a pretty little window for you to scroll through.  You can modify the url with “&tail=[somenumber]” to return a different count of rows (the default is 100).  There are a couple of parameters for the output type (DATAGRID), one is the autoScroll, which is on by default.  This scrolls whenever new data shows up, but does not work with URLs, so you will have to re-run the command-line to get an update.

Let’s look at a slightly more complicated query.  I’m working with a client on query performance, and we’re studying why certain queries take longer than others.  Most queries take under a second, but every once in a while, they take longer.  With a simple query, we can look at exactly the information we need:

select 
    mul(to_real(extract_prefix([autn:duration], 0, ' s')), 1000),
    [autn:request]
from
    http://server:port/?action=grl&format=xml
where
    [autn:duration] like '% s'

We limit this to rows with a duration in the format ’1.62 s’, then turn the value into milliseconds.  Removing the [autn:request] column from the select, and surrounding the mul() operation with an AVG() gives you a handy number on average query time over 1 second.  Make sure to add a more meaningful depth, with something like &tail=10000 to your URL.

I’ll look at more complicated queries next.