Cobalt + ElasticSearch - Cobalt 8

Total posts: 137
27 Окт 2013 15:46

OK following on from the other search thread, attached is an alpha grade but working package for cobalt + elasticsearch as a proof of concept.

Some notes:

Its based on the JES package

Most importantly as that page says:

You absolutely need to have an ElasticSearch server if you want to use this extension

The original JES package wont work on J3 so I've updated the code.

I needed to make a couple of plugins to get cobalt to work with JES which are included in the attached package.

I've run the package on a blank J3 on my dev server running php 5.4 and it installs ok.

I havent tried with J2.5. If you want to try this maybe use the JES package and extract my cobalt plugins from this package.

After install there are a lof plugins to enable (under system, elasticsearch and mint)

and set your elasticsearch config options in admin->elasticsearch.

and you need to publish the search module somewhere.

(iirc there is a classname clash with some other (joomla) search module. I forget which, but if the JES search mod causes a fatal error so turn the other one off)

There also might have been some issues with SEF urls(?), but anyway it works just now with sef turned off.

It will index all cobalt records by running from admin, and will index cobalt new and updated records.

At the moment the cobalt plugin is simply indexing whatever is in #__js_res_record.fieldsdata and title.

Of course this means cobalt still pushing data into fieldsdata - which defeats the purpose but anyway its a basic trial.

In my test setup with around 14000 entries it takes about 5 mins to index all records. Thats longer than a cobalt re-index. But I've used cobalt itemsstore to fetch record data which I guess is not the most efficient method.

Searches certainly seem to be impressively fast, altho I havent done any hard numbers.

Is it worth it? :

My cobalt project has a lot of data in a lot of sites and unscientifically this seems much faster than standard cobalt / joomla search. I didnt look at the geospatial stuff yet but apparently its also fast and that's a big consideration.

It would need some more work to get it up to standard but I think its probably worth it. If you can test it or have opinion I'll be grateful for feedback.

[it shouldnt explode your install - but as always backup first! because unlike sergey I'm useless at support so if the sky falls in you will be on your own!]

Последние изменения: 02 Март 2014

Цитата

Sackgesicht VIP

Total posts: 1,636
01 Нояб 2013 08:38

is it possible you could make a list of cobalt fields and what that should map to in es (if possible and/or easiest first etc). I work now on some changes to incorporate that

If we want everything there, we have to get all possible fields for the section. We get this from the itemstore plus we have to get all possible fields out of the SECTION. There could be several TYPES associated with a SECTION.

First step is to determine the TYPES of the SECTION.

We have to tell the script what section to process, so i used the options setting of the indexer

and manually inserted the id in the where clause... the published =1 should be removed, it can be queried later.

Since we know the section_id, we can get the associated TYPES from the params column then all the fields of the types.

Most probably the itemstore might be helpful here again.

All those fields are the ones from the js_res_record fields column and should be part of the mapping.

Additional all fields from the $record = ItemsStore::getRecord($record);

It seems elastica does already a sort of bulk index with

$this->pushDocument($document);

$this->flushDocuments();

I observed that the first 2k goes in very quick, then it slows down ... Maybe we should do a flush after every 500-1000 records ..

Цитата

Sackgesicht VIP

Total posts: 1,636
01 Нояб 2013 08:39

I assumed storing category name will allow searching on that name, and a so search for 'red widgets' will return all records in the 'red widget' category

then lets try to get the original JSON from the categories column it .. not decoding it first, just adding it ...

Цитата

Sackgesicht VIP

Total posts: 1,636
01 Нояб 2013 08:47

within that ES-type under cobalt_en (or cobalt_fr - for example) the record id is unique and carries with it the type_id section_id and categories, but is it unique enough?

The type should be the type_id ... for language, there is a separate field langs

Цитата

Sackgesicht VIP

Total posts: 1,636
01 Нояб 2013 08:54

Additional all fields from the $record = ItemsStore::getRecord($record);

I will try to make a list of those field manually and apply some settings to it ...

Цитата

londoh

Total posts: 137
01 Нояб 2013 10:57

It seems elastica does already a sort of bulk index

yes. if you look a bit further thru the extension code it calls flushDocs() based on php memory_limit

it remains to be seen how well that working practice.

It seems elastica does already a sort of bulk index

then lets try to get the original JSON from the categories column it .. not decoding it first, just adding it

Its done. ES has array type. I also indexed the json from fields. But I'm not sure that its working as I expected.

I can get the existing field config data into the plugin no problem.

yes, via itemsstore ok for now.

later maybe its more efficient to go directly to the db

But for now the mapping between cobalt fields and es types is the time (and brain!) consuming exercise for me.

I have some detailed ES info which you might find helpful, but I cant post it all up here. Is there is a PM system on this forum? Or some way of taking this private - you'll find my forum nic on twitter.

Цитата

Sackgesicht VIP

Total posts: 1,636
01 Нояб 2013 20:38

But for now the mapping between cobalt fields and es types is the time (and brain!) consuming exercise for me.

I wanted to do this also. We had an internet outage/maintenance upgrade last night. Could not access my main resource (the elasticsearch guide online).. but the connection speed is now upgraded.. :D

I can try it now.

But for now the mapping between cobalt fields and es types is the time (and brain!) consuming exercise for me.

I have some detailed ES info which you might find helpful, but I cant post it all up here. Is there is a PM system on this forum? Or some way of taking this private - you'll find my forum nic on twitter.

I dont use twitter, maybe i have to create an account ... otherwise i have a yahoo.com email address where the name is sack.gesicht .... > But for now the mapping between cobalt fields and es types is the time (and brain!) consuming exercise for me.

I have some detailed ES info which you might find helpful, but I cant post it all up here. Is there is a PM system on this forum? Or some way of taking this private - you'll find my forum nic on twitter.

if you look a bit further thru the extension code it calls flushDocs() based on php memory_limit

it remains to be seen how well that working practice.

That might not work well ... I had to raise the memory limit to avoid an error ... maybe it is to high now to allow a proper indexing. Based on some comments, people use it even every ~100 records. I believe it will depend on the content (big HTML/text content) but it would be good to have manual control over it, to better understand limits and fine tune it. In a later version, it can even be used for automatic adjustments.

At a later stage, additional info like authors name might be included also. Same to some fields like parent/child.

Цитата

londoh

Total posts: 137
01 Нояб 2013 20:43

just a quick update:

I spent some while trying to get the record->fields data indexed as an array

whilst everything points to the fact it should work, I cant get it to.

Its ok if the json is presented as a string. But not if an array, or object.

screenshot of error attached

this SO thread is similar

also other similar MapperParsingException errors on SO

I dont know that it matters if the plan is to just follow the cobalt schema.

Except that being able to index fields as an array or object, and having it indexed and searchable as such would be more efficient that parsing out individual field data for es input

Цитата

Sackgesicht VIP

Total posts: 1,636
01 Нояб 2013 20:50

this SO thread is similar

The problem of this topic should not apply, since per field we dont change the mapping (if i understood the topic correctly) .

Our field mapping should be consistent.

In the topic, the guy uses the field "data" with different types of content. our fields (field name= field_id) should not change, right?

Цитата

Sackgesicht VIP

Total posts: 1,636
01 Нояб 2013 21:16

I took the fields column and manually indexed it into a new index (test) ... i made a screen shot of how it looks there. I believe, that is what we should get as final result, plus all the other fields within the $record from itemstore.

Note, that the date field will be an array , since it might hold a date range.

Цитата

Sackgesicht VIP

Total posts: 1,636
02 Нояб 2013 00:07

yes. if you look a bit further thru the extension code it calls flushDocs() based on php memory_limit

it remains to be seen how well that working practice.

as far as i understand the code, it will get all documents per query (i just limited it to a section id and number of docs) and store them first before it flush them. So a memory limit does not apply ..

Цитата

londoh

Total posts: 137
02 Нояб 2013 01:25

There should be a loop .. like push only 100 docs, then flush , push next 100 docs etc etc

my elastasticsearch/cobalt plugin class plgElasticsearchCobalt extends ElasticSearchIndexerAdapter

and so the method: $this->pushDocuments is in admin/comp/com_elas*/helpers/adapter.php

that method stacks up the documents before they are 'flushed' to ES and it checks when to do it by doing this:

        $mem_limit_bytes = trim(ini_get('memory_limit'))*1024*1024;

        if(memory_get_usage()>$mem_limit_bytes*0.20){ // Check memory use

            //if documents array is too big we flush it

            $this->flushDocuments();

        }

altho there isnt any reason to also count the loop and flushDocs every nnn

Цитата

londoh

Total posts: 137
02 Нояб 2013 01:33

I took the fields column and manually indexed it into a new index (test) ... i made a screen shot of how it looks there. I believe, that is what we should get as final result,

yep I have the plugin doing that insert, its not a problem. But according to my understanding (and I could easily be wrong) that is being indexed by es as a string, and not an array. And therefore without the benefit of being an array.

if you do _json_decode($fields, true) it will convert it to an array but ES barphs

I am thinking it has to be converted to an 'ES friendly' array and not just a php array

Цитата

londoh

Total posts: 137
02 Нояб 2013 01:55

The problem of this topic should not apply since per field we dont change the mapping (if i understood the topic correctly) .

Our field mapping should be consistent

Perhaps the thread was relevant to the error message and es data in arrays.

But generally I agree with your statement there.

In my understanding:

Field mapping always constant as per TYPE of cobalt field, text geo, digit whatever.

But the mapping will change per record, and is directly related to its type.

Unless record->fields data is indexed into ES as an array or object and thereby avoiding the need to switch mapping per field. And this is what I had being trying to do but cant get to work.

Цитата

londoh

Total posts: 137
02 Нояб 2013 02:22

I emailed you an updated plugin which will fail when indexing.

more notes in the mail

Страница 3 из 3

В начало
Назад
1
2
3
Вперёд
В конец

Работает на Cobalt