Your approach is actually good. Some popular frameworks like Compass are performing what you describe at a lower level in order to automatically mirror to the index changes that have been performed via the ORM framework (see http://www.compass-project.org/overview.html).
In addition to what you describe, I would also regularly re-index all the data which lives in MongoDB in order to ensure both Solr and Mongo are sync'd (probably not as long as you might think, depending on the number of document, the number of fields, the number of tokens per field and the performance of the analyzers : I often create index from 5 to 8 millions documents (around 20 fields, but text fields are short) in less than 15 minutes with complex analyzers, just ensure your RAM buffer is not too small and do not commit/optimize until all documents have been added).
Regarding performance, a commit is costly and an optimize is very costly. Depending on what matters the most to you, you could change the value of mergefactor in Solrconfig.xml (high values improve write performance whereas low values improve read performance, 10 is a good value to start with).
You seem to be afraid of the index build time. However, since Lucene indexes storage is segment-based, the write throughput should not depend too much on the size of the index (http://lucene.apache.org/java/2_3_2/fileformats.html). However, the warm-up time will increase, so you should ensure that
- there are typical (especially for sorts in order to load the fieldcaches) but not too complex queries in the firstSearcher and newSearcher parameters in your solrconfig.xml config file,
- useColdSearcher is set to
- false in order to have good search performance, or
- true if you want changes performed to the index to be taken faster into account at the price of a slower search.
Moreover, if it is acceptable for you if the data becomes searchable only a few X milliseconds after it has been written to MongoDB, you could use the commitWithin feature of UpdateHandler. This way Solr will have to commit less often.
For more information about Solr performance factors, see
http://wiki.apache.org/solr/SolrPerformanceFactors
To delete documents, you can either delete by document ID (as defined in schema.xml) or by query :
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/SolrServer.html
Best Answer
I would suggest storing the access roles (yes, its plural) as document metadata. Here the required field
access_roles
is a facet-able multi-valued string field.The user owning the document is a default access role for that document.
To change the access roles of a document, you edit
access_roles
.When Jane searches, the access roles she belongs to will be part of the query. Solr will retrieve only the documents that match the user's access role.
When Jane (
user_jane
), manager at vienna office (manager_vienna
) searches, her searches go like:which fetches all documents which contains
user_jane
ORmanager_vienna
inaccess_roles
;Doc1
andDoc2
.When Bob, (
user_bob
), member of a special team (specia_team
) searches,which fetches
Doc2
for him.Queries adapted from http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParams