20989

MongoDb speed decrease

I use mongodb to store compressed html files . Basically, a complete document of mongod is like:

{'_id': 1, 'p1': data, 'p2': data2, 'p3': data3}

where data, data1, data3 are :bson.binary.Binary(zlib_compressed_html)

I have 12 Million ids and dataX are each one average 90KB, so each document has at least size 180KB + sizeof(_id) + some_overhead.

The total data size would be at least 2TB.

I would like to notice that '_id' is index.

I insert to mongo with the following way:

def _save(self, mongo_col, my_id, page, html): doc = mongo_col.find_one({'_id': my_id}) key = 'p%d' % page success = False if doc is None: doc = {'_id': my_id, key: html} try: mongo_col.save(doc, safe=True) success = True except: log.exception('Exception saving to mongodb') else: try: mongo_col.update({'_id': my_id}, {'$set': {key: html}}) success = True except: log.exception('Exception updating mongodb') return success

As you can see first I lookup the collection to see if a document with my_id exists.

If it does not exist then I create it and save it to mongo else I update it.

The problem with the above is that although it was super fast, at some point it became really slow.

I will give you some numbers:

When it was fast I was doing 1.500.000 per 4 hours and after 300.000 per 4 hours.

I suspect that this affects the speed:

Note

When performing update operations that increase the document size beyond the allocated space for that document, the update operation relocates the document on disk and may reorder the document fields depending on the type of update.

As of these driver versions, all write operations will issue a getLastError command to confirm the result of the write operation: { getLastError: 1 } Refer to the documentation on write concern in the Write Operations document for more information.

the above is from : http://docs.mongodb.org/manual/applications/update/

I am saying that because we could have the following :

{'_id: 1, 'p1': some_data}, ...., {'_id': 10000000, 'p2': some_data2}, ...{'_id': N, 'p1': sd3}

and imagine that I am calling the above _save method as:

_save(my_collection, 1, 2, bin_compressed_html)

Then it should update the doc with _id 1 . But if the thing that mongo site is the case, because I am adding a key to the document it does not fit and should rearrange the document.

It is possible to move the document in the end of the collection, which could be very far on the disk. Could this slow things down?

Or speed slow down has to do with the size of the collection?

In any way to you think it should be more efficient to modify my structure to be like:

{'_id': ObjectId, 'mid': 1, 'p': 1, 'd': html}

where mid=my_id, p=page, d=compressed html

and modify _save method to do only inserts?

def _save(self, mongo_col, my_id, page, html): doc = {'mid': my_id, 'p': page, 'd': html} success = False try: mongo_col.save(doc, safe=True) success = True except: log.exception('Exception saving to mongodb') return success

this way I avoid the update (so the rearrange on disk) and one lookup (find_one) but the documents would be 3x mores and I would have 2 indexes ( _id and mid ) .

What do you suggest?

Answer1:

Document relocation could be an issue if you continue to add pages of html as new attributes. Would it really be an issue to move pages to a new collection where you could simply add them one record each? Also I don't really think MongoDB is a good fit for your use case. E.g. Redis would be much more efficient. Another thing you should take care of is to have enough ram for your _id index. Use db.mongocol.stats() to check the index size.

Answer2:

When inserting new Documents into MongoDB, a Document can grow without moving it up to a certain point. Because the DB is analyzing the incoming Data and adds a padding to the Document. So do deal with less Document movements you can do two things:

<ol> <li>

manually tweaking the padding factor

</li> <li>

preallocate space (attributes) for each document.

</li> </ol>

See Article about Padding or MongoDB Docs for more Information about the padding factor.

Btw. insetad of using save for creating new documents, you should use .insert() which will throw a duplicate key error if the _id is already there (.save() will overwrite your document)

Recommend

  • Reloading table causes flickering
  • DNS Lookup failed - Error with all browsers [closed]
  • Is looping through all style sheets and classes a good idea in JavaScript?
  • Excel VBA How to populate a multi-dimensional (3d) array with values from multiple excel ranges?
  • Where these are stored?
  • Ruby on Rails App deployed to heroku showing “We're sorry, but something went wrong”
  • abstracting over a collection
  • How can I tell a form not to dispose a particular control when it closes?
  • Check all values in string[] for length?
  • Updating both a ConcurrentHashMap and an AtomicInteger safely
  • How to add git credentials to the build so it would be able to be used within a shell code?
  • Groovy: Unexpected token “:”
  • Xcode 4 NSLog Macro link in Xcode 3
  • Is there a way to do normal logging with EureakLog?
  • How to create a file in java without a extension
  • Asynchronous UI Testing in Xcode With Swift
  • Is there a javascript serializer for JSON.Net?
  • Jenkins: How To Build multiple projects from a TFS repository?
  • Nant, Vault & Windows Integrated Authentication
  • Regex thinks I'm nesting, but I'm not
  • Fetching methods from BroadcastReceiver to update UI
  • Bug in WPF DataGrid
  • How would I use PHP exceptions to define a redirect?
  • TFS: Get latest causes slow project reloading
  • Controls, properties, events and timers running in design time
  • How to extract text from Word files using C#?
  • MySQL WHERE-condition in procedure ignored
  • Running a C# exe file
  • Join two tables and save into third-sql
  • How to model a transition system with SPIN
  • Large data - storage and query
  • ORA-29908: missing primary invocation for ancillary operator
  • retrieve vertices with no linked edge in arangodb
  • Error creating VM instance in Google Compute Engine
  • Hits per day in Google Big Query
  • how does django model after text[] in postgresql [duplicate]
  • need help with bizarre java.net.HttpURLConnection behavior
  • File not found error Google Drive API
  • Qt: Run a script BEFORE make
  • Converting MP3 duration time