londoh VIP
Total posts: 137
27 Oct 2013 15:46

OK following on from the other search thread, attached is an alpha grade but working package for cobalt + elasticsearch as a proof of concept.

Some notes:

Its based on the JES package

Most importantly as that page says:

You absolutely need to have an ElasticSearch server if you want to use this extension

The original JES package wont work on J3 so I've updated the code.

I needed to make a couple of plugins to get cobalt to work with JES which are included in the attached package.

I've run the package on a blank J3 on my dev server running php 5.4 and it installs ok.

I havent tried with J2.5. If you want to try this maybe use the JES package and extract my cobalt plugins from this package.

After install there are a lof plugins to enable (under system, elasticsearch and mint)

and set your elasticsearch config options in admin->elasticsearch.

and you need to publish the search module somewhere.

(iirc there is a classname clash with some other (joomla) search module. I forget which, but if the JES search mod causes a fatal error so turn the other one off)

There also might have been some issues with SEF urls(?), but anyway it works just now with sef turned off.

It will index all cobalt records by running from admin, and will index cobalt new and updated records.

At the moment the cobalt plugin is simply indexing whatever is in #__js_res_record.fieldsdata and title.

Of course this means cobalt still pushing data into fieldsdata - which defeats the purpose but anyway its a basic trial.

In my test setup with around 14000 entries it takes about 5 mins to index all records. Thats longer than a cobalt re-index. But I've used cobalt itemsstore to fetch record data which I guess is not the most efficient method.

Searches certainly seem to be impressively fast, altho I havent done any hard numbers.

Is it worth it? :

My cobalt project has a lot of data in a lot of sites and unscientifically this seems much faster than standard cobalt / joomla search. I didnt look at the geospatial stuff yet but apparently its also fast and that's a big consideration.

It would need some more work to get it up to standard but I think its probably worth it. If you can test it or have opinion I'll be grateful for feedback.

[it shouldnt explode your install - but as always backup first! because unlike sergey I'm useless at support so if the sky falls in you will be on your own!]

Last Modified: 02 Mar 2014



Sackgesicht VIP
Total posts: 1,636
27 Oct 2013 16:40

Cool,

i have an elasticsearch server with some basic plugins ready to test.

Will try it later today (5:30 am here :D )

and give feedback ...

Excited already ... :D

Of course this means cobalt still pushing data into fieldsdata - which defeats the purpose but anyway its a basic trial.

Not necessarily. It might even be good , since cobalt reindex is not ready for types with more than a couple of thousand records. Switching back would be no problem.

For some initial testing of elasticsearch, i exported before a cobalt type into csv and imported it into elasticsearch. I did not measure the time - i believe it was faster than 5 min .. will do it later again ...

If the tests seems to be successful, i hope this would get a core implementation into Cobalt (Solr and elasticsearch) without the plugins.

  • Section definition would create the index

  • Type definition would even do a mapping

  • Queries will be optimized through field settings (filter order etc)

Thanks again for doing the first step ... :D


Sackgesicht VIP
Total posts: 1,636
27 Oct 2013 16:45

Hmmm ... can not download the plugin ...


londoh VIP
Total posts: 137
27 Oct 2013 16:48

hohum... I just noticed I left a firebug link in that pkg that will crash it

I cant see how to delete prev upload so here's another one.

or comment out line 126 [ fb( ] in plugins/elasticsearch/cobalt/cobalt.php


Sackgesicht VIP
Total posts: 1,636
27 Oct 2013 17:11

The component finds the elasticserver and i started the indexing ... it looks like this:

Should i just be patient?

Now i got errors ...

Notice: Trying to get property of non-object in /latest/plugins/elasticsearch/cobalt/cobalt.php on line 218

Notice: Trying to get property of non-object in /latest/plugins/elasticsearch/cobalt/cobalt.php on line 221

Notice: Trying to get property of non-object in /latest/plugins/elasticsearch/cobalt/cobalt.php on line 224

Warning: Invalid argument supplied for foreach() in /latest/plugins/elasticsearch/cobalt/cobalt.php on line 225

Notice: Trying to get property of non-object in /latest/plugins/elasticsearch/cobalt/cobalt.php on line 231

Notice: Trying to get property of non-object in /latest/plugins/elasticsearch/cobalt/cobalt.php on line 232

Notice: Trying to get property of non-object in /latest/plugins/elasticsearch/cobalt/cobalt.php on line 233

Notice: Undefined variable: categories in /latest/plugins/elasticsearch/cobalt/cobalt.php on line 234

Warning: implode(): Invalid arguments passed in /latest/plugins/elasticsearch/cobalt/cobalt.php on line 234

Notice: Trying to get property of non-object in /latest/plugins/elasticsearch/cobalt/cobalt.php on line 237

and then finally

Fatal error: Maximum execution time of 30 seconds exceeded in latest/libraries/joomla/language/language.php on line 1259

I will increase the execution time now ...

It imported 3k documents and left elasticsearch in a "yellow" condition .. have to check ...


Sackgesicht VIP
Total posts: 1,636
27 Oct 2013 17:18

The yellow condition comes from the index definition, it was declared as 4 shards with 1 replica ... since there is only 1 node, I don't have replicas ...


Sackgesicht VIP
Total posts: 1,636
27 Oct 2013 17:32

Elasticsearch cobalt plugin ...

Ordering shows:

Notice: Trying to get property of non-object in /latest/libraries/cms/form/field/ordering.php on line 85

Notice: Trying to get property of non-object in /latest/libraries/cms/form/field/ordering.php on line 86

Warning: Invalid argument supplied for foreach() in /latest/libraries/joomla/database/driver.php on line 1490

New items default to the last position. The ordering can be changed after this item is saved.


londoh VIP
Total posts: 137
27 Oct 2013 17:34

Now i got errors ...

oh yea - and me! :D

that code doesnt make perfect sense !

I've gone to watch football but try in that plugin around row 210 - ish

private function rowToDocument($record){

    if(!isset($record)) return false;

    if(!is_object($record)){

        $record = ItemsStore::getRecord($record);

        if(!$record) return false;

    }

something like that prob stop notices

also there are still some fatal errors in the JES package JView errors on purge etc

ok football and bed


londoh VIP
Total posts: 137
27 Oct 2013 17:38

oh yea forgot to say this:

there is no checking at all on 'index all' - it will keep pushing dup data into ES


Sackgesicht VIP
Total posts: 1,636
27 Oct 2013 20:31

it will keep pushing dup data into ES

In the long run, this could be avoided, if we use the unique cobalt id as _id

--> elastic search would then create "versions" of the document and not just create another new document.

A good read is here for a better understanding how it works.

If ever Cobalt will go for an OPTIONAL integration with elasticsearch/Solr things like will have an enormous impact how the "Enterprise Cobalt" will work and perform.

I increased the execution time to 60 sec, now 3600 documents were indexed before it crashed without a notice.

The indexing itself did not deliver the expected reults. No content in title and fieldsdata and the url is not correct as well ...

But everything gets more clear now ... 1 elasticsearch query can return everything we need for pagination and facetted search (like the existing filter display).

My query for the first 10 results of text within href field:

{"query":{"bool":{"must":[{"text":{"cobalt.href":"abalos"}}],"must_not":[],"should":[]}},"from":0,"size":10,"sort":[],"facets":{}}

also has the number of total hits in the result:

For my existing project it will already have a huge impact. :D

I have an upcoming project with millions of records. With the way Cobalt is conceptualized as of now, i was quite hesitant to consider Cobalt as tool of choice, while trying to find the bottlenecks.

After initial tests with elasticsearch, i feel very confident that this combination can deliver the job. :D


londoh VIP
Total posts: 137
28 Oct 2013 01:31

attached is an updated package, mainly elasticsearch/cobalt plugin

its still a bit clunky but should fix some of those issues with url etc

My cobalt_en index shows complete data - but it also did before so I dont know why yours doesnt have title and fieldsdata etc?

cobalt rec id is provided as index id and should be included also

maybe problems are from first aborted run?

maybe drop index and try again?

another point: as its coded, creating full index might (will) need lots of php memory and time

Section definition would create the index

Type definition would even do a mapping

Queries will be optimized through field settings (filter order etc)

wow - yes!

needs a bit of work yet tho!

and no doubt sergey's input

I need to read up on ES.

I'm away from desk for most of day now


londoh VIP
Total posts: 137
28 Oct 2013 04:58

problem with xml if upgrade some of the JES packages

and I made a typo in package xml

I think it should work now


Sackgesicht VIP
Total posts: 1,636
28 Oct 2013 06:01

I changed the PHP maximum memory from 128MB to 512MB -- now it is the max execution time which limits the indexing. After 60sec i got 5252 docs indexed with following errors:

Undefined variable: categories in /plugins/elasticsearch/cobalt/cobalt.php on line 240

implode(): Invalid arguments passed in /plugins/elasticsearch/cobalt/cobalt.php on line 240

It gets now the fields data etc... :D

Since elasticsearch gets its mapping on a index/_type level, i would suggest to use the cobalt type_id as _type to get closer to the Cobalt structure and make it easier to query on a type level.

For the start it might be good to put everything into one index to have a better comparison to the existing Cobalt setup.

Depending on performance tests it might later be a better approach to create an individual index from a section (including multiple types if needed).

The index create a type "article" which represents a module... (1 doc) ????????

The categories should be indexed as they are stored in Cobalt. It would create an array/object with the id and the description. It will deliver correct results independently from the category name (and most probably even a little faster.

When indexing the "fields" column should be part of the whole JSON string, then all fields will be represented with their respective values like an array. I tested it manually ...

Actually all record fields should be included, so that we can create a query environment like Cobalt ... published, hidden etc etc ...

For indexing "bulk" index (~80-100 docs per call) is recommended .. See here

I believe the itemstore is not needed and will only cost memory and time ... a bulk index direct from the #js_res_record would be the fastest option ...

This goes into the right direction .. excellent !!!!!


Sackgesicht VIP
Total posts: 1,636
28 Oct 2013 06:07

missing "/" in href


Sackgesicht VIP
Total posts: 1,636
28 Oct 2013 06:30

Example of a result from an index where i manually indexed the JSON of "field" ... (partly) ... this together with the other #__js_res_record columns would be a good start to play with facets to simulate a cobalt advanced filter scenario:

{

  "_shard": 1,

  "_node": "gUTl_41PRHW85TBT4T4KlQ",

  "_index": "test",

  "_type": "post",

  "_id": "1",

  "_score": 1,

  "fields": {

    "1": "12471",

    "2": "1135.00",

    "3": [

      "2016-02-08 00:00:00"

    ],

    "4": [

      {

        "id": "51",

        "title": null,

        "height": null,

        "description": null,

        "width": null,

        "filename": "1335229200_5a24de4f700060d1e274ce790d6fa942.PDF",

        "params": null,

        "realname": "2010.PDF",

        "fullpath": "2012-04-24/1335229200_5a24de4f700060d1e274ce790d6fa942.PDF",

        "ext": "pdf",

        "size": "2249150"

      },

      {

        "id": "248273",

        "title": null,

        "height": null,

        "description": null,

        "width": null,

        "filename": "1360602000_5f279f538519abc8b2a76381edde7a2b.pdf",

        "params": null,

        "realname": "2013.PDF",

        "fullpath": "2013-02-12/1360602000_5f279f538519abc8b2a76381edde7a2b.pdf",

        "ext": "pdf",

        "size": "4130986"

      }

    ],

    "5": [

      "2014-02-27 00:00:00"

    ],

    "6": [

      {

        "id": "53",

        "title": null,

        "height": null,

        "description": null,

        "width": null,

        "filename": "1335229200_4782b90f9f854a0a6a479db2426db2a3.PDF",

        "params": null,

        "realname": "2011.PDF",

        "fullpath": "2012-04-24/1335229200_4782b90f9f854a0a6a479db2426db2a3.PDF",

        "ext": "pdf",

        "size": "2541564"

      },

      {

        "id": "54",

        "title": null,

        "height": null,

        "description": null,

        "width": null,

        "filename": "1335229200_5f758faa6f897dd1b56cbb92a08a34ed.PDF",

        "params": null,

        "realname": "2012.PDF",

        "fullpath": "2012-04-24/1335229200_5f758faa6f897dd1b56cbb92a08a34ed.PDF",

        "ext": "pdf",

        "size": "3039359"

      },

      {

        "id": "52",

        "title": null,

        "height": null,

        "description": null,

        "width": null,

        "filename": "1335229200_6d6c584f2a7fc075ed7305091c78fe74.PDF",

        "params": null,

        "realname": "2010.PDF",

        "fullpath": "2012-04-24/1335229200_6d6c584f2a7fc075ed7305091c78fe74.PDF",

        "ext": "pdf",

        "size": "408543"

      },

      {

        "id": "251985",

        "title": null,

        "height": null,

        "description": null,

        "width": null,

        "filename": "1363665600_6ba2415cca328b5d64793b5158b8b495.pdf",

        "params": null,

        "realname": "2013.PDF",

        "fullpath": "2013-03-19/1363665600_6ba2415cca328b5d64793b5158b8b495.pdf",

        "ext": "pdf",

        "size": "3345047"

      }

    ],

    "25": [

      94296

    ],


The field (field_id) 3 is a date field. In cobalt it can hold a date range --> array

4 --> upload

5 --> date

6 --> upload

25 --> child (in an advanced version, it might be good to add the Parent Title to the array)


londoh VIP
Total posts: 137
28 Oct 2013 08:31

some good and valid points in your detailed post

the undef vars thing: most of the 'alpha' code is without any checks or fallbacks so they will occur

Categories: hmmm should be inc as a string. are they not?

Type ID: yes I see same benefit

+1 for Bulk inserts

included fields or not: In my view the cobalt field types need a mapping to ES types. Armed with that the component backend could easily allow user defined selection of fields per type.

@Sergey: are you there?

Can I ask why doesnt this work in the backend:

$url = JRoute::_(Url::record($record));

well it works but I get a url that begins /administrator/ or as per Sackgesicht's screenshot ' eg:/project/latest'

I found some possible correction but none work either


Sackgesicht VIP
Total posts: 1,636
28 Oct 2013 08:46

Categories: hmmm should be inc as a string. are they not?

yes, they are a string, but they should be better like

{"2":"Active"}

resulting in an array, which would give us the advantage to search/count for the ID.

See here

Categories: hmmm should be inc as a string. are they not?

In my view the cobalt field types need a mapping to ES types.

Agree.

If mintjoomla will follow the optional integration of Solr/elasticsearch it should be part of the field definition for fine tuning purposes.

Mapping parameters would be determined by filter/search settings and individual field settings.

/project/latest is my joomla installation directory


londoh VIP
Total posts: 137
28 Oct 2013 10:55

Yes ok with the category array: good point. In fact I put code there to break them out. Easier just leave the existing cobalt json data.

ES+cobalt Integration: I wonder how much enthusiasm sergey will have for this?

About the mapping: Perhaps we see this differently. My expectation is that it can be entirely contained within 3rd party component. And it could be argued an external component is the correct place for it. So then if its ES or Solr or whatever, cobalt doesnt need to know.

Cobalt just needs some way to not care at all about search and filters. Turning off search on each field is easy, but that still leaves all the json data in #__js_res_record.fields column and I suspect that's still enough to cause large datasets to fall over.

I'm assuming cobalt needs the data in fields col for basic functionality aside of search(?) but I dont know cobalt well enough to be sure about that(?)

/project/latest: Yes understood. I was alluding to the point that there is some reason why JRoute::_($url) from within admin doesnt provide correct front end url, which is why the urls arent correct. Small prob that's fixable but I forget how right now.


Sackgesicht VIP
Total posts: 1,636
28 Oct 2013 19:05

When accessing the component after initial indexing :

Undefined index: cobalt in /administrator/components/com_elasticsearch/models/default.php on line 109

array_key_exists() expects parameter 2 to be array, null given in /administrator/components/com_elasticsearch/models/default.php on line 109

Undefined index: cobalt in /administrator/components/com_elasticsearch/models/default.php on line 111

ES+cobalt Integration: I wonder how much enthusiasm sergey will have for this?

Actually I talked to him about this, even before the discussion started here and there was an interest ...

ES+cobalt Integration: I wonder how much enthusiasm sergey will have for this?

About the mapping: Perhaps we see this differently. My expectation is that it can be entirely contained within 3rd party component.

Yes it can, but it might get complicated on a more complex setup (granted that the user should finetune and influence the mapping). The mapping will happen to a Cobalt TYPE. Whenever you create/modify a TYPE, there should be a facility to support this. Changed mapping will have a serious impact if you want to have a "perfect" setup. Adding fields is never a problem, but if you change other attributes you will have to reindex. It is similar to Cobalt, when you change/add searchable fields. See this blogpost here

A cobalt scenario has to take a lot of user settings into account, published, featured, hidden etc etc .. Therfore those values needs to be replicated in the index. We can actually ignore the fieldsdata column, if we do a proper mapping.

ES+cobalt Integration: I wonder how much enthusiasm sergey will have for this?

About the mapping: Perhaps we see this differently. My expectation is that it can be entirely contained within 3rd party component.

I'm assuming cobalt needs the data in fields col for basic functionality aside

It is a fast replication of the individual fields data, which might have still data in it, which is not relevant to the system anymore. Think of unpublishing or deleting a field. The field content will still be there until the next record modification. Therefore it is important to have the latest TYPE/Field connection applied to it, or at least follow it when processing a query.

It is not only the fulltext search, what we should look into, it is the overall benefit what we get with this optional data storage. It will NOT replace the MySQL, it will COMPLEMENT it. Still all data will be in the MySQL, which can be the source of new indexes etc.

It will bring us solutions where the Cobalt structure is not so well suited for. Fast pagination, data analytics, facetted search (included result numbers), geo location related results and easy scalability when we run into problems.


londoh VIP
Total posts: 137
29 Oct 2013 01:05

yes I see those errors, but I think mostly irrelevant at this stage.

The mapping will happen to a Cobalt TYPE

I agree

The mapping will happen to a Cobalt TYPE

Whenever you create/modify a TYPE, there should be a facility to support this

I agree. The debate is about where thats provided. I'm not convinced its necessary or correct from within cobalt.

But if it is to happen within cobalt, it doesnt matter what I think - its Sergey that has to be convinced! :D

The mapping will happen to a Cobalt TYPE

Whenever you create/modify a TYPE, there should be a facility to support this

We can actually ignore the fieldsdata column, if we do a proper mapping.

I agree

The mapping will happen to a Cobalt TYPE

Whenever you create/modify a TYPE, there should be a facility to support this

We can actually ignore the fieldsdata column, if we do a proper mapping.

It will bring us solutions where the Cobalt structure is not so well suited for

And I agree with your reasons here... fast pagination etc .For my current project the geospatial stuff looks compelling.

So we can agree on most aspects. And based on that rough and ready test we just tried I believe there are some great benefits to using ES (or solr?).

But perhaps what I need most right now is a good book on ES. (and some time!)


Sackgesicht VIP
Total posts: 1,636
29 Oct 2013 02:40

I'm not convinced its necessary or correct from within cobalt.

It does not need to be there, but it would simplify the matter. Mapping happens when field and type definition changes ..

In a complete independent solution, it can also be controlled from the plugin/component. It is just a matter of workflow and understanding the process.

But a core integration as an option would streamline the process.

Then again, as you mentioned already, it is not up to us to decide. :D

I'm not convinced its necessary or correct from within cobalt.

And based on that rough and ready test we just tried I believe there are some great benefits to using ES (or solr?).

Right, i tested the last days with different datasets and query scenarios-- the longer i "play" with it, the more I can see the advantages of the combination Cobalt/MySQL TOGETHER with elasticsearch for quite some use cases.

Actually, it will be a make or break feature for an upcoming project of mine. (Not for the project but for the choice of tools)

I'm not convinced its necessary or correct from within cobalt.

And based on that rough and ready test we just tried I believe there are some great benefits to using ES (or solr?).

But perhaps what I need most right now is a good book on ES. (and some time!)

For the book aspect, maybe some of the following resources will be of help for a start... for the required time, thats harder to provide. :D

But if there is something i can help with, i am more than willing to provide whatever it needs within my skill set.

ElasticSearch Server Chapter 1

Mastering Elasticsearch Chapter 1

Introducing elasticsearchExploring elasticsearch

Safari Online books - ElasticSearch Server

Powered by Cobalt