Alternative approaches for indexing database-like content

Managed by | Updated .

The following is an example for what a University customer may need to provide in order for us to build a course finder.  It details the three most common approaches we have for getting access to the data.  This example could be adapted for indexing of any database-like content.

Notes: 

  • If any geospatial search is required then each course and unit record should contain a single geo coordinate that is specified as decimal lat/long. eg. <geo>-27.385;42.101</geo>

Web crawl of course / unit pages

Funnelback crawls course and unit pages using the web crawler, downloads and indexes the HTML pages for each course and unit.

Funnelback would need a way of identifying the course pages, or be provided with a list of all the URLs of the pages to crawl or an index page that links to all of the course and unit pages.

For an effective course finder rich metadata would need to be added to the course pages for anything that should be treated as a 'field' - eg for presentation in the search results, or for use in faceted navigation and rich query completion.

eg.

Webpage Metadata
<meta name="course_name" content="Bachelor of Science" />
<meta name="faculty" content="Faculty of Science and Engineering" />
<meta name="course_code" content="SCI001" />
<meta name="campuses" content="Campus 1|Campus 2"/>
<meta name="study_areas" content="Physics|Chemistry"/>
<meta_name="entrance_mark" content="73.5" />
<meta_name="course_type" content="Undergraduate"/>
<meta_name="course_schedule" content="Full time"/>

Any metadata fields that contain multiple values should be delimited with a vertical bar character.

eg. <meta name="course_schedule" content="Part time|Full time" /> for a course that is offered as both full time and part time.

Export of the courses database

Provide Funnelback with an XML export of the course and unit records from the courses database.

There would need to be a single record for each course and unit containing all the fields that are relevant to the item.  

Data can be provided in a single XML file (quicker to download and update) or as individual XML files.

eg.

Courses & Units XML
<courses>
    <course>
        <course_name>Bachelor of Science</course_name>
        <faculty>Faculty of Science and Engineering</faculty>
        <course_code>SCI001</course_code>
        <campuses>
                        <site>Campus 1</site>
                        <site>Campus 2</site>
        </campuses>
        <study_areas>
                        <area>Physics</area>
                        <area>Chemistry</area>
        </study_areas>
        <entrance_mark>73.5</entrance_mark>
        <course_type>Undergraduate</course_type>
        <course_schedule>Full time</course_schedule>
        ...
    </course>
    <course>
        <course_name>Bachelor of Engineering</course_name>
        <faculty>Faculty of Science and Engineering</faculty>
        <course_code>ENG001</course_code>
        <campuses>
                        <site>Campus 1</site>
                        <site>Campus 2</site>
        </campuses>
        <study_areas>
                        <area>Physics</area>
                        <area>Chemistry</area>
        </study_areas>
        <entrance_mark>89.5</entrance_mark>
        <course_type>Undergraduate</course_type>
        <course_schedule>Full time</course_schedule>
        ...
    </course>
</courses>
<units>
    <unit>
        <unit_name>Physics 101</unit_name>
        <faculty>Faculty of Science and Engineering</faculty>
        <unit_code>SCI001</unit_code>
        <campuses>
                        <site>Campus 1</site>
                        <site>Campus 2</site>
        </campuses>
        <degrees>
            <degree>Bacheclor of Science</degree>
            <degree>Bacheclor of Science, Bachelor of Engineering</degree>
            <degree>Bachelor of Engineeringe</degree>
        </degrees>
        <study_areas>
                        <area>Physics</area>
        </study_areas>
        <points>40</point>
        <course_type>Undergraduate</course_type>
        ...
    </unit>
    <unit>
        <unit_name>Physics 101</unit_name>
        <faculty>Faculty of Science and Engineering</faculty>
        <unit_code>SCI001</unit_code>
        <campuses>
                        <site>Campus 1</site>
                        <site>Campus 2</site>
        </campuses>
        <degrees>
            <degree>Bacheclor of Science</degree>
            <degree>Bacheclor of Science, Bachelor of Engineering</degree>
            <degree>Bachelor of Engineeringe</degree>
        </degrees>
        <study_areas>
                        <area>Physics</area>
        </study_areas>
        <points>40</point>
        <course_type>Undergraduate</course_type>
        ...
    </unit>
</units>

This is probably our preferred option as it offers rapid updating and provides maximum flexibility for faceted navigation and query completion.

The XML file(s) would need to be made available to Funnelback via a web downloadable location (eg. by executing a curl command to retrieve the file).

Direct connection to the database

Funnelback connects to the DBMS via a compatible JDBC driver and executes an SQL query.

The table returned is then indexed by Funnelback (with each row in the table corresponding to an item within the Funnelback index.  The returned table should also be normalised as Funnelback can't perform table joins once the data has been indexed.  This is often facilitated by creating a view within the database that contains the normalised data, with Funnelback accessing this using a SELECT * FROM table_view.

Any database fields that contain multiple values should be delimited with a vertical bar character.

This would require the university to provide access for the Funnelback crawl server to connect to the DBMS.

Was this artcle helpful?

Comments