Home / Blog

Some of us at Netgen attended eZ Conference a few weeks ago in Berlin (and had a great time there :-), so we would like to present eZ developers with a small summary on few topics that we think are important.

Two code bases

IMG_0386

It was announced by eZ crew that in the future there will be 2 code bases for eZ Publish CMS: one for enterprise version and the other for community version. Preparation for this have already began, with current code being moved to github (http://github.com/ezsystems/ezpublish ) few days before the conference. 

In our opinion this decision could be very good for the future of eZ Publish but only if it’s being conducted properly:

  • Core of the community version should be kept under control of eZ crew and some prominent eZ community people (community board). This could enable faster and easier creation of new features and bug fixes, together with other useful features that come from stronger community engagement (faster testing of new features, etc). After some time features added in community version would be introduced to enterprise version. Due to this, enterprise version would lag a bit with its feature list, but for solutions where stability is the most important factor, it would be the correct choice.
  • Extensions in enterprise version should be kept under eZ crew control. These extensions would be built by them or certified by them.
  • Extensions in community version should be under community board control. Useful and mature extensions from project.ez.no should be included into community version (in close cooperation with extension authors, of course). In that way public will benefit from having more features introduced regularly.

In general we should have a choice between less featured stable version and more featured community version (but still good enough for lot of projects).

Solr as a content storage

In future eZ Find versions we should have an option to store complete eZ objects into Solr engine. Currently, the main application for this would be archiving old content that does not need to be editable any more. On very large sites with fast growing content (like news portals, forums, etc.) big database tables are becoming performance bottleneck, so the possibility of periodically shifting old content to Solr could be a life saver. Archived content would be still searchable. Few notes though:

  • there should be an “unarchive” function
  • links to old content should work

There are also other use-cases for using Solr engine because of the performance gains it brings. One example is shown in our tutorial about replacing standard list/tree fetches with ezfind search (available here ).

REST API

The last but not the least is the new API announcement. eZ crew will implement in forthcoming versions a new, more abstract API . It will not replace the existing one, just add higher level features for often used tasks (e.g. content publishing, editing, ... ). This API will also be accessible through REST. This announcement was loudly approved by the audience when announced. It will surely provide easier way to implement various adaptable interfaces (e.g. for mobile and tablet devices) and integration utilities.
@eZ please, develop this soon :-)

Berlin

We are glad we were there and met interesting eZ people (from both eZ Systems and eZ community). And it was very nice to experience city of Berlin: from grand Alex and Brandenburger Tor over charming Ku’damm to artistic and youthful Kreuzberg. Till next eZ conference (rumor is it will be hosted in London)...

In this blog post we present a simple example on how to build a spatial search with eZFind. The most common usage of the spatial search would be finding  the nearest locations available and this is what we are describing here.

Prerequisites

1.  eZFind 2.2 extension with Solr configured and working. We need 2.2 version because it introduces new Solr geopoint fieldtype. It is basically a two-dimensional double for storing longitude and latitude:

<fieldType name="geopoint" class="solr.PointType" dimension="2" subFieldType="double"/>

eZPublish attributes for location should have a suffix „_gpt“ when indexed by eZFind:

<dynamicField name="*_gpt" type="geopoint" indexed="true" stored="true"/>

Searching for the nearest objects can also be achieved with older versions of eZFind by defining 2 fields based on double type. There is a great thread about this here

2.  Preparing location data in eZPublish. We took the ezgmaplocation extension (which comes included in the latest eZPublish versions) as datatype for storing location. But there are also other extensions that could be used for this purpose and it can be developed from scratch. Important thing is to have latitude and longitude prepared for indexation.

3.  Indexing. For indexing gmaplocation datatype we used unofficial eZFind code from this link  provided by Paul Borgermans. Hopefully it will be implemented in the next version of eZFind.
IMPORTANT NOTE: currently there is a bug within solr when using field names with dash „-„: https://issues.apache.org/jira/browse/SOLR-1172  and the indexing code creates such field names (e.g. subattr_gmaps_location-coordinates_gpt). To solve this issue the easiest way is to create additional solr field under the <fields> node in the extension/ezfind/java/solr/conf/schema.xml:

<field name="gmaps_coordinates" type="geopoint" indexed="true" stored="true" />

and define rule for automatical copying of location data to that field (under the <schema> node in schema.xml):

<copyField source="subattr_gmaps_location-coordinates_gpt" dest="gmaps_coordinates"/>

FYI if some other geo location datatype is used geopoint solr field (*_gpt) should be updated in following format: LATITUDE,LANGITUDE (use name without dashes to skip the copying part).

Reindex & check if data is in the Solr index over /solr/admin url.

search

Search on mobile phone

Query time boosting with distance

There are 2 main methods on how to sort the search results in Solr by distance.

  • by using distance function in sort parameter: e.g.  sort=dist(2, geopoint1, geopoint2) desc
  • by using relevance sorting and query time boosting:  _val_:recip(dist(2, geopoint1, geopoint2),1,1,0)

Direct sorting is not as tunable as sorting based on score (relevance) so we decided to use the second method.

The base for this method is the dist() Solr function. There are more ways how distance can be calculated with this function: taxicab method, euclidean method, etc. (more about it here ). For small distances euclidean method is good enough. Formula is simple:
dist() <=> sqrt( (lat1-lat2)2+(lng1-lng2)2 )

For bigger distances (thousands of km) Haversine function should be used (more about it here ). 

Because we are using distance for boosting only (search result will not return the distance), the square root is not needed and we can use faster function sqedist().  Used formula is then even more simpler:
sqedist() <=> (lat1-lat2)2+(lng1-lng2)2. More info here .

Before boosting we need to kind of normalize the output of the sqedist() function and recip() functions is used for this task. More info here .

Finally, our solr function for boosting looks like this: recip(sqedist(2, geopoint1, geopoint2),1,1,0)

To boost with this function when querying, following string should be placed before the searching text:

_val_:"recip(sqedist(gmaps_coordinates,vector(48.166085,-104.326172)),1,1,0)"

More about the whole topic: http://wiki.apache.org/solr/SpatialSearch

result

Search results based on distance

Testing

To test if the method works fine use /solr/admin url again. You will need a geopoint value (latitude and longitude) as a reference point. Easiest way to get it is to use maps.google.com. Pick a spot with a right mouse click and then click on „Center map here“ . Geo point value of the selected spot should be filled in the field under the Link button (upper right corner) as 'll' http get parameter.

Once you get the desired point, enter following into /solr/admin search field:

_val_:"recip( sqedist( gmaps_coordinates, vector(YOUR_LAT,YOUR_LONG) ), 1, 1, 0 )"YOUR SEARCH TERM

Search results should be sorted with nearest first.

Implementation

First thing is to prepare the reference location – the point from where is distance calculated. This is highly dependent on what you are trying to accomplish. In our case the reference point was the mobile phone location. We used  W3C  geolocation API (http://www.w3.org/TR/geolocation-API ) to get the location data from the client. The location itself is extracted from GPS or calculated by BTS triangulation.

Location is then put in get parameters together with search text. In content/search.tpl templateboosting string is prepared like this:

{if ezhttp( 'spot', 'get', 'hasVariable' )} 
 {def $dboost=concat('_val_:\"recip(sqedist(attr_location_gpt,vector(',ezhttp( 'spot', 'get' ),')),1,1,0)\"')}
{/if}

And finally included in search function:

{set search=fetch( ezfind, search, hash(query,concat($dboost,$search_text)))}

And that's it. 

In this blog post we introduce some of the details on how we developed "Netgen Suggest" extension for eZPublish. Extension is shared with community : projects.ez.no/ngsuggest . Download and installation instructions can be found there.

About the extension

"Netgen Suggest" is a eZPublish extension developed on top of eZFind that implements a drop down with suggestions on search fields, by using Solr facets. When a user starts typing suggestions are presented in drop down element underneath the search box. Suggestions are sorted by number of occurences in the search index. When a new word is added into the search box, suggestions are recalculated based on combination of all the words entered.

Extension is based on a jQuery component. For better results we created a new type of field for Solr. Existing 'ezf_df_text' field can be used but it concatenates the entered items, so often suggestions are not precise enough.

Implementation

Netgen Suggest extension on top of ezwebin demo site

Netgen Suggest extension on top of ezwebin demo site

Creating Solr query

For retrieving search results from Solr we could use direct url but we wanted to control it better.

For this purpose we implemented a simple module as a configurable proxy. It constructs solr query using current installation_id and following settings, configured through ini settings:

  • Limit: maximum number of facets
  • RootNode: root node id as additional filter condition
  • Classes: array of class ids as additional filter condition
  • Section: array of section ids as additional filter condition
  • FacetField: solr field from where facets are pulled

Solr query is constructed with following GET parameters:

  • rows=0 (standard search results are not needed)
  • facet=on (enable facets)
  • facet.field=ngsuggest_text (facet field defined in ngsuggest.ini)
  • facet.prefix=con (letters user entered)
  • facet.limit=10 (maximal number of suggestions defined in ngsuggest.ini)
  • facet.mincount=1 (minimum occurances for word to show up in suggestions)
  • q=*:* (all index is searched)
  • fq=meta_is_hidden_b:false+AND+meta_is_invisible_b:false (filter conditions: root node id, classes, installation id, .. )
  • wt=json (return results in JSON format)

Output from Solr is simply redirected as output of the module. The module can be used from jQuery component as AJAX call:

/ngsuggest/searchsolr?id=[search_id]&keyword=[text]

where [search_id] is id of the search field (used for accessing relevant ini settings) and [text] are letters user entered.

If user enters 2 or more words query is the same except the 'q' parameter which is then used to narrow search with previous words.

Using jQuery component

To implement smooth visual representation we used a component developed by Tom Coote: jquery-json-suggestsearch-box and tweaked it to better fit our needs. The component catches every text change on input fields of class 'ngsuggestfield' and launches AJAX call to the module described above. Returned JSON data from the module is used to fill the drop down div element.

Solr results

For first tests we used 'ez_df_text' solr field because it contains all indexed strings. What quickly become apparent was that the results are flooded with concatenated words due to the 'ez_df_text' solr field type and how it works. Also, 'ez_df_text' field stores lot of meta content like node ids, installation ids, etc. which should not be exposed. To get better results new field type needs to be added to extension/ezfind/java/solr/conf/schema.xml under <types> node:

<fieldType name="text4suggest" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
  <tokenizer class="solr.WhitespaceTokenizerFactory" />
  <filter class="solr.StopFilterFactory" 
          ignoreCase="true" 
          words="stopwords.txt" 
          enablePositionIncrements="true" />
  <filter class="solr.WordDelimiterFilterFactory" 
          generateWordParts="1" 
          generateNumberParts="1" 
          catenateWords="0" 
          catenateNumbers="0" 
          catenateAll="0" 
          splitOnCaseChange="1" />
  <filter class="solr.LowerCaseFilterFactory" />
 </analyzer>
 <analyzer type="query">
  <tokenizer class="solr.WhitespaceTokenizerFactory" />
  <filter class="solr.WordDelimiterFilterFactory" 
          generateWordParts="1" 
          generateNumberParts="1" 
          catenateWords="0" 
          catenateNumbers="0" 
          catenateAll="0" 
          splitOnCaseChange="1" />
  <filter class="solr.LowerCaseFilterFactory" />
 </analyzer>
</fieldType>

New field type is defined as 'text4suggest' based on intrinsic type 'solr.TextField'. Following actions are applied when indexing happens (similar as when querying):

  1. text is tokenized by white space
  2. stop filter factory removes unwanted words defined in 'stopwords.txt'
  3. factory for slicing and combining words is called (useful for slicing words with numbers) but without concatenations
  4. lower case factory does casing normalization

Next, new field 'ngsuggest_text' is added under <fields> node:

<field name="ngsuggest_text" 
       type="text4suggest" 
       indexed="true" 
       stored="true" 
       multiValued="true" 
       termVectors="true" />

Finally, new field should be filled with content under <schema> node:

<copyField source="attr_*" dest="ngsuggest_text" />
<copyField source="meta_name_s" dest="ngsuggest_text" />
<copyField source="meta_url_alias_s" dest="ngsuggest_text" />

For demo purposes 'ngsuggest_text' is filled with content from all attribute fields and from 2 useful meta fields. This, of course, can be tweaked on project basis.

Here is just a short technical explanation of the topic in the last blog post . That post described how we solved a performance issue on one eZ Publish based web site by developing an eZ Publish extension which overrides eZMutex to use memcache instead of file locking. The extension is published and shared with the community on projects.ez.no . eZMutex is used by standard file handler (eZFSFileHandler). In some future post we will try also to compare this solution with a next generation file handler (eZFS2FileHanlder) included as optional module in current stable eZ Publish version.

Technical details

Default eZMutex has 3 main functions: test(), lock() and unclock(). Function test() returns bool value if the mutex exist or not. In the file based version it does 1 or 2 flock() calls:

if ( flock( $fp, LOCK_EX | LOCK_NB ) ) {
    flock( $fp, LOCK_UN );
    return false;
}
return true;

In the memcache version we only call 1 memcache action:

return memcache_get($this->mc,$this->KeyName);

Function lock() makes an exclusive lock on mutex resource. In the file based version it does 1 flock() call, 1 file_exists() call, 2 create_file() calls and 2 rename_file() calls:

if ( flock( $fp, LOCK_EX ) ) {
    $this->clearMeta();
    $this->setMeta( 'timestamp', time() );
    return true;
}
return false;

In the memcache version there, all in all, 4 memcache calls:

if ( memcache_add($this->mc,$this->KeyName,"1",false, $time) ) {
  $this->clearMeta();
  $this->setMeta( 'timestamp', time() );
  return true;
}
return false;

Function unlock() releases lock on mutex resource. In the file based version functions does 1 fclose() and 2 file_delete() calls.

if ( $fp = $this->fp() ) {
  fclose( $fp );
  @unlink( $this->MetaFileName );
  @unlink( $this->FileName );
  $GLOBALS['eZMutex_FP_' . $this->FileName] = false;
}
return false;

In there memcache version there are 2 memcache calls:

memcache_delete($this->mc,$this->MetaKeyName, 0);
memcache_delete($this->mc,$this->KeyName, 0);
return true;

To summarize the comparison on a test-lock-unlock cycle: we replaced 10-11 file system interactions with 7 memcache interactions and in practice that gives huge performance gains.

First, an important remark. This is the first post on Netgen Blog (powered by eZ Publish ). Hope it will not be the last :)

The post is describing how we solved a performance issue on one eZ Publish based web site. The site was using more web servers with shared disk device based on SAN. With no obvious reason web servers had high load averages and what was even worse: increasing load on one web server would quickly increase load on other web servers. The main cause for this behavior was the slow and many IO operations on shared device. Further inspection revealed that a large part of these IO operations were related to eZmutex files.

So there were 2 difficulties to solve:

  • For some unknown reason, a number of files in the eZMutex cache folder had been growing and causing slower access for apache to that folder
  • The var folder was mounted on a network shared device with a slow flock() system call (locking is distributed by file system to all web servers and therefor slower then usual)

To solve the performance problem we developed an eZ Publish extension which overrides eZMutex to use memcache instead of file locking. The extension is published and shared with the community on projects.ez.no . More about memcache on memcached.org .

In this case memcache is not intended to be a real lock manager, but it can handle this scenario very well because locking doesn't need to be persistent. Anyway, eZMutex is generally used for 2 purposes:

  1. to lock file writings in the var folder (locking in seconds)
  2. to lock cronjob scripts to prevent overlaping (locking in hours)

So the extension was built to support both cases (overriding eZMutex and eZruncronjob classes) and to use the same logic of the original mutex as much as possible.

There was one difference though. Memcache has an expiry possibility which is an option when creating a file. We have set 60 seconds for default expiry which is a reasonible time for generating caches. Expiry in memcache means that the key/value pair will not be deleted but scheduled for removal when needed. The important thing is that the memcache_add() function (base for locking) will be able to use expiried keys.

The direct result of using this extension with memcached was a sudden big drop of IO disk usage. It is hard to estimate the magnitude of the drop but a rough estimate would be around 90% (based on a generated graph from system monitoring tool). As the disk device was shared by more web servers over an OCFS2 file system, this had a big impact on overall performance due to the fact that IO operations are more costly then on a single server system.

Using this extension on a normal eZ web (not too big; var folder on local disk, not shared; ezmutex folder not growing) makes no sense as the performance boost will be very small. But if you have a similar situation or problem as we did, this extension could have a big impact on the performance because there will be a lot less IO operations. Remember that for every file that eZ Publish writes to disk it also needs to create 2 ezmutex files and flock() them.

Short backstory of the Blog: Sharing our experience from various web projects based on eZ Publish, HTML5, MySQL, jQuery, CSS, etc. and focusing on solving problems we encountered

Mon Tue Wed Thu Fri Sat Sun
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30