londoh VIP
Total posts: 137
27 Oct 2013 15:46

OK following on from the other search thread, attached is an alpha grade but working package for cobalt + elasticsearch as a proof of concept.

Some notes:

Its based on the JES package

Most importantly as that page says:

You absolutely need to have an ElasticSearch server if you want to use this extension

The original JES package wont work on J3 so I've updated the code.

I needed to make a couple of plugins to get cobalt to work with JES which are included in the attached package.

I've run the package on a blank J3 on my dev server running php 5.4 and it installs ok.

I havent tried with J2.5. If you want to try this maybe use the JES package and extract my cobalt plugins from this package.

After install there are a lof plugins to enable (under system, elasticsearch and mint)

and set your elasticsearch config options in admin->elasticsearch.

and you need to publish the search module somewhere.

(iirc there is a classname clash with some other (joomla) search module. I forget which, but if the JES search mod causes a fatal error so turn the other one off)

There also might have been some issues with SEF urls(?), but anyway it works just now with sef turned off.

It will index all cobalt records by running from admin, and will index cobalt new and updated records.

At the moment the cobalt plugin is simply indexing whatever is in #__js_res_record.fieldsdata and title.

Of course this means cobalt still pushing data into fieldsdata - which defeats the purpose but anyway its a basic trial.

In my test setup with around 14000 entries it takes about 5 mins to index all records. Thats longer than a cobalt re-index. But I've used cobalt itemsstore to fetch record data which I guess is not the most efficient method.

Searches certainly seem to be impressively fast, altho I havent done any hard numbers.

Is it worth it? :

My cobalt project has a lot of data in a lot of sites and unscientifically this seems much faster than standard cobalt / joomla search. I didnt look at the geospatial stuff yet but apparently its also fast and that's a big consideration.

It would need some more work to get it up to standard but I think its probably worth it. If you can test it or have opinion I'll be grateful for feedback.

[it shouldnt explode your install - but as always backup first! because unlike sergey I'm useless at support so if the sky falls in you will be on your own!]

Last Modified: 02 Mar 2014



londoh VIP
Total posts: 137
29 Oct 2013 06:28

So we agree that at least you and I would like to be able to use ES with Cobalt data.

But...

For most people on shared hosting it may be this is not an option at all (unless cloud=paid solution)

For people with modest sites its likely there is no need for ES at all.

And that limits the potential users quite a bit.

Anyway, If sergey is prepared to put the options for integration directly into cobalt, there's no problem.

And if not - integration has to go into a separate component.

It needs a lot more work for that component to be a generic all-singing all-dancing widget, than it does to hack up code which is dedicated to one site.

Well, at least I have an idea how I'd do that back end, and its quite a bit of work. And for sure I dont have time to code it right now.

Whereas how I did it over the weekend that ES-Cobalt mapping is hard coded into the plugin - its a crude solution. But once the Cobalt Type is setup for a site its pretty much done for life anyway, so for individual sites its not that difficult or unrealistic to continue to hardcode the mapping.

It could be defined within the plugin code, and done per type with a simple switch statement.

Or perhaps the mapping could be defined in a file (xml maybe, or php include) and loaded in.

Or maybe if it were done globally per site, just relying on cobalt field types it could be done very easily and neatly within the plugin xml file and so editable from the plugin admin screen.

[plugin being my plugin at plugins/elasticsearch/cobalt]

Of course if hard coded and something changes in Cobalt Type it would mean changing the mapping definition in the file. Editor->save->upload = a couple of minutes at most.

My point here is that it doesnt need an elaborate bells-n-whistles back end for the thing to work perfectly well.

Here's a list of what I think is needed to get to a minimum workable system. Probably in order of importance and, given I know v. little about ES, also difficulty!:

1) understanding the hows and whys of saving the data into ES and getting it out

2) understanding how that mapping between ES and Cobalt works out

3) the query (search) module needs work to allow input query params.

4) the plugin display template needs work to know what to do with the returned data.

Sorry for long post, I hope it makes sense?


Sackgesicht VIP
Total posts: 1,636
29 Oct 2013 21:11

For most people on shared hosting it may be this is not an option at all (unless cloud=paid solution)

For people with modest sites its likely there is no need for ES at all.

And that limits the potential users quite a bit.

Correct.

But if we look at Cobalts philosophy, it is, as stated several times, to provide best possible performance to its users. Other than this, enhance the functionality to gain additional benefits without sacrificing existing features and performance.

This is precisely what will happen with this "search engine" enhancement. It will allow Cobalt to reach a dimension, which Cobalt can never reach out of its own.

Cobalt never introduced features, only because "everyone" will benefit from it, it introduced features to widen the user base.

I only use a very limited feature set, and have not even discovered/used all possibilities. So the argument, that this will only be for a minority of users is not really a valid reason.

Obviously, based on the architectural structure, Cobalt is having problems with bigger datasets (depends on the definition of "bigger".. some would even classify it as small), based on some conditions. Even more, the moment you start searching (depending on the search mode) and apply filter like date range or geographical radius search. If you use certain settings, the answer time of a page can go up to 10sec and even higher (and this is just in a personal testing environment). How much more under a live scenario with several users querying, accessing it simultaneously?

A good example is the topic here

Even though the sql time for displaying was brought down to 230ms, they still need some settings which will slow it down again. But the display is not the only thing. Querying/searching will be the crucial part here. I highly doubt, if every geolocation per store is present, that the solution can provide an acceptable performance with the targeted users. The Cobalt structure (they way it stores the coordinates) is not really ready for this. The same applies to date range filters.

All of those cases, where a search engine "shines".

The way it is now, Cobalt will make it even hard for MySQL to perform well. Dates, numbers, geo locations are all stored as strings together in an index which could also holds up to 333 characters from other field types at the same time.

If filter/search can be OPTIONAL "outsourced" to a search engine, performance problems of Cobalt could easily be solved without even touching Cobalt's existing structure.

For most people on shared hosting it may be this is not an option at all (unless cloud=paid solution)

For people with modest sites its likely there is no need for ES at all.

And that limits the potential users quite a bit.

Is it worth it?

Based on Cobalts philosophy - definitely yes.

I would even say, it is the biggest feature enhancement ever.

But then again, it is not up to us to decide (except for an independent 3rd party solution).

3rd party support has and still is something where Cobalt has some deficiencies despite a lot of documentation.

If MintJoomla has other priorities, plans or reasons, there is nothing we can do (at least for the core implementation).


londoh VIP
Total posts: 137
30 Oct 2013 00:24

+1 for all of that.

You make forceful and accurate arguments why Cobalt should include options for ES

I can only agree with you.

I followed the performance thread at the time, and it was good work to find solutions to get times down.

But as you say there are inbuilt limitations to current schema.

I need geo to be much more useable.

So if only for that reason I will come back to this, but for the next week or 2 my time is limited.

In the meanwhile I hope perhaps there is some input from Cobalt devs.


Sergey
Total posts: 13,748
30 Oct 2013 01:05

I read this thread but I cannot completely understand. Just tell me what to add to Cobalt to prepare it for ES integration.


Sackgesicht VIP
Total posts: 1,636
30 Oct 2013 01:16

I will try to make a document, identifying the areas and propose an core integration. I would prefer something integrated rather than an external solution. Both approaches have advantage/disadvantages.

Will do a draft later today and post it here ...

Thanks for answering on this thread. :D


Sackgesicht VIP
Total posts: 1,636
30 Oct 2013 02:41

For basic understanding --

Elasticsearch/Solr will NOT be a replacement, it will be an ADDITIONAL option to store and query data. It will not affect any existing features in Cobalt.

If switched off, Cobalt will immediately continue to work as usual.

All data will still be accessible and stored as of now. The MySQL tables will be the source, if indices will be created from existing data.

Implementation should start with very basic support as an experimental setup, which will be refined along the way, while gathering more intimate knowledge/experience about the technology.

Later versions shall get specific enhancement like percolator and shall support other Cobalt areas like audit trail etc.

Cobalt shall support both, ES and Solr, to maximize the reach, while priority/focus will be on ES.

Where to start?

1) Search engine "configuration".

Cobalt needs to know, where the search engine is.

Proposal: 2 new fields under Cobalt Configuration

ElasticSearch REST End-point URL : http://localhost :9200

Solr REST End-point URL : http://localhost :8983/solr/

A URL check would be nice, so that the configuration would only be saved if the value is correct.

Thats all we need to know for the start in our configuration.

2) Activate / Deactivate the search engine.

The most logical place for Cobalt would be the SECTION parameter under "Search Mode".

Cobalt will add 2 new parameters (Elasticsearch and Solr) to the drop down when a valid, confirmed URL was encoded in the configuration and returns an OK , status 200.

If the parameter is set to a searchengine, but it is not reachable, it will fall back to full text search and grey out the searchengine selections (to reduce support questions for accidentally applying it without existing searchengine).

will be continued....


londoh VIP
Total posts: 137
30 Oct 2013 03:45

Just tell me what to add to Cobalt to prepare it for ES integration.

:D

Just tell me what to add to Cobalt to prepare it for ES integration.

I will try to make a document, identifying the areas and propose an core integration

:D


Sackgesicht VIP
Total posts: 1,636
30 Oct 2013 04:25

For testing purposes, i would suggest to install elasticsearch on your computer.

use homebrew

Make sure you are up-to-date:

brew update

brew upgrade

brew install elasticsearch

Follow the instructions and check if it is running by visiting http://localhost :9200

Install some plugins like HQ

http://www.elastichq.org/support_plugin.html

http://mobz.github.io/elasticsearch-head/

https://github.com/andrewvc/elastic-hammer

Most important for the start will be an Index/export tool. There are several ways to feed data into the searchengine. For the start, i would suggest an ordinary Index Tool similar to what was already started by ETH000 so that we can also experiment with the data. The index tool will be needed later for creating new indices, updating indices, re-index etc ... Along the way, we have to increase the functionality, "intelligence" ...

Everything we need to know about indices is here

3) Index tool

Creating searchengine indices based on Sections with their respective Types and Fields:

How to organize data in the searchengine?

We should follow the Cobalt methodology:

Section/Types/Record_id and translate it in to the searchengine structure Index/Type/ID

With this approach we will get a structure where all fields of all types plus the #__js_res_record are in the document.

For the initial version we can use the automatic field mapping process.

Next step would be a semiautomatic approach were we define the mapping for core fields like ctime, extime, mtime etc and then at the final phase for the indexer, mapping will be applied for the cobalt fields either automatic (based on field type/ field settings) or by setting manual parameters within the field definition.

will continue later on the initial Indexer ....


Sackgesicht VIP
Total posts: 1,636
30 Oct 2013 04:39

Sergey
Total posts: 13,748
31 Oct 2013 01:35

So, ES is not a Joomla extension?


Sackgesicht VIP
Total posts: 1,636
31 Oct 2013 01:46

No, elasticsearch is a search engine, like a NoSQL database

There is a joomla plugin - ETH000 used it already for some initial tests.

Check the php client mentioned above.

Elasticsearch and Solr have quite some support in Drupal, but not yet so much under Joomla. My initial test are really convincing.

Initial data range search with my setup under Cobalt takes 1800ms, then after it got cached 900ms .. the same query under elastic search returns it in 3ms (without optimization) ..

The results always returns the number of hits, no need for additional counting.


Sackgesicht VIP
Total posts: 1,636
31 Oct 2013 21:38

For most people on shared hosting it may be this is not an option at all (unless cloud=paid solution)

There are even small free starter packages available.

Searchly - 2 indices, 5MB

Openshift

Q: What can you store in 5MB?

A: I added 2 indices through ETH000's plugin indexer ~ 20k records --> 5MB

Adapting to those small time or free test accounts, Cobalt should apply a flexible index structure.

In general there might be 3 scenarios:

1) Using 1 index for all sections

2) Using 1 index per section

3) Flexible - decide what index to use per section

This would leave us with a configuration parameter per section. There might be an additional parameter under configuration -- like the 3 options (all-in-one, one-on-one, manual) to make it even easier for starter.

After initial test and comparison between elasticsearch and solr, i would now suggest to fully concentrate on elasticsearch first and (if ever) add solr support later.

btw, a nice read about a user case is here

Over the last year, we came always to a point where Cobalt needed optimization to perform well under several scenarios. This is reflected in several blog posts over time. Still there is a lot to do, especially for the filters.

Now elasticsearch can address all the performance problems we are facing as of the moment:

Category counting (smart and fast count)

Sort on field content

Search performance (especially for field contents)

Date range filter

Numeric filter

Filter result counting

Full text search

Full text indexing

Geo range filter

and future use case scenarios like

Real time analytics (log files etc)

Just an example:

A date range search over a section with 58K article takes between 1800 and 3500ms (incl 21 category number counts) the same result was delivered by elasticsearch in 3ms

Most of those problems are based on the Cobalt structure. While it doesn't matter so much on small datasets (couple of hundred records), it creates already unnecessary load with several thousands of records ... How much more with 10k and above, not even thinking of bigger numbers.

The average Cobalt user does not need to know anything about data bases at all, Cobalt will take care of it. How beautiful would it be, if Cobalt can open the door to the next level of search and data analytics with this integration?

But it would need passion and dedication from the programmers to do it in a magical way, creating a product, allowing the average Joomla user accessing and unfolding the power of a search engine without the need to even understand it.

Cobalt did this already before, now it should evolve to the next level.


Sackgesicht VIP
Total posts: 1,636
31 Oct 2013 22:48

See andrew eddie's comments about app store/jed development regarding the usage of a search engine here


londoh VIP
Total posts: 137
31 Oct 2013 23:52

re: app store stuff: yea I subscribe to the j dev lists and read that stuff at the time.

Sadly it all appeared to get quite divisive and joomla didnt make the leap to escape velocity via super fast search.

You make a lot more good points up there btw.

free hosted options would mean benefits available to all.

I spent a bit of time last night to get the wijiti jsolr package working as well - just to see.

there's a very relevant quote from that envato post you tagged:

annoying XML schema files - and they are just that!

whereas to quote that guy again - > annoying XML schema files

It’s super easy to interact with Elasticsearch


Sackgesicht VIP
Total posts: 1,636
01 Nov 2013 00:16

annoying XML schema files

  • and they are just that!

whereas to quote that guy again -

It’s super easy to interact with Elasticsearch

Before, i did not see so much the difference, therefore i suggested a parallel integration for elasticsearch and Solr, just for the sake of getting a wider acceptance or even support from both camps. But after playing with both, it is clearly elasticsearch, fitting better (and it will be easier) for a cobalt integration.

For the indexer, i would suggest to concentrate on the bulk api and not considering individual indexing, even for a small number of documents. (trying to setup some data for indexing performance testing -- will report later about it)

While creating test queries, i came along another plugin .. https://github.com/polyfractal/elasticsearch-inquisitor


londoh VIP
Total posts: 137
01 Nov 2013 00:21

from what I've seen I definitely agree es is easier.

i would suggest to concentrate on the bulk api

yea I saw it. its easy to factor in.


Sackgesicht VIP
Total posts: 1,636
01 Nov 2013 01:00

yea I saw it. its easy to factor in.

If you can still spare some time for your ES component/plugin i would have some requests:

Adding the category as JSON as stored and not decoding it before.

maybe even better just to store the category_id(s), since the category description will become useless, once someone will change it.

If only the ID's will be stored, we can declare it as short or int for smaller index size and faster processing. They might be used for the category index count through a terms facet.

Instead of storing the fieldsdata in 1 field, every included field should be in its own field.

(In a much much later version of an indexer, it should consider fieldkeys to merge content under same field and disregard the id -- but this would be already in the final stage, when multitype configuration would come into the game..)

The type should be the record type_id (now it is cobalt_en).

Adding the other core fields for better 1:1 comparison to the cobalt setup.

indexing through bulk api ...

I would be able to do some of the modifications, but i just want to avoid, that we have 2 different version, colliding all the time. (another problem might be my limited skills in this area :D )


londoh VIP
Total posts: 137
01 Nov 2013 06:51

ok I have some time the rest of today.

The type should be the record type_id (now it is cobalt_en).

I assumed J language needs to come into es schema. If not there, where is it defined?

so to follow cobalt's schema

within that ES-type under cobalt_en (or cobalt_fr - for example) the record id is unique and carries with it the type_id section_id and categories, but its not unique of it were to store

I assumed storing category name will allow searching on that name, and a so search for 'red widgets' will return all records in the 'red widget' category

The type should be the record type_id (now it is cobalt_en).

every included field should be in its own field

I agree

is it possible you could make a list of cobalt fields and what that should map to in es (if possible and/or easiest first etc). I work now on some changes to incorporate that


londoh VIP
Total posts: 137
01 Nov 2013 06:55

so to follow cobalt's schema

within that ES-type under cobalt_en (or cobalt_fr - for example) the record id is unique and carries with it the type_id section_id and categories, but its not unique of it were to store

sorry that doesnt make sense! I try again...

so to follow cobalt's schema

within that ES-type under cobalt_en (or cobalt_fr - for example) the record id is unique and carries with it the type_id section_id and categories, but is it unique enough?

oh yea: what about category hierarchy? is it important?


londoh VIP
Total posts: 137
01 Nov 2013 07:31

btw the JES package I worked off uses Elastica php library

Elastica on github

[ https://groups.google.com/forum/# !forum/elastica-php-client](Elastica google group)

Powered by Cobalt