void ReplicationRecoveryImpl::_applyToEndOfOplog(OperationContext* opCtx, Timestamp oplogApplicationStartPoint, Timestamp topOfOplog) { invariant(!oplogApplicationStartPoint.isNull()); invariant(!topOfOplog.isNull()); // Check if we have any unapplied ops in our oplog. It is important that this is done after // deleting the ragged end of the oplog. if (oplogApplicationStartPoint == topOfOplog) { log() << "No oplog entries to apply for recovery. appliedThrough is at the top of the oplog."; return; // We've applied all the valid oplog we have. } else if (oplogApplicationStartPoint > topOfOplog) { severe() << "Applied op " << oplogApplicationStartPoint.toBSON() << " not found. Top of oplog is " << topOfOplog.toBSON() << '.'; fassertFailedNoTrace(40313); } log() << "Replaying stored operations from " << oplogApplicationStartPoint.toBSON() << " (exclusive) to " << topOfOplog.toBSON() << " (inclusive)."; DBDirectClient db(opCtx); auto cursor = db.query(NamespaceString::kRsOplogNamespace.ns(), QUERY("ts" << BSON("$gte" << oplogApplicationStartPoint)), /*batchSize*/ 0, /*skip*/ 0, /*projection*/ nullptr, QueryOption_OplogReplay); // Check that the first document matches our appliedThrough point then skip it since it's // already been applied. if (!cursor->more()) { // This should really be impossible because we check above that the top of the oplog is // strictly > appliedThrough. If this fails it represents a serious bug in either the // storage engine or query's implementation of OplogReplay. severe() << "Couldn't find any entries in the oplog >= " << oplogApplicationStartPoint.toBSON() << " which should be impossible."; fassertFailedNoTrace(40293); } auto firstTimestampFound = fassertStatusOK(40291, OpTime::parseFromOplogEntry(cursor->nextSafe())).getTimestamp(); if (firstTimestampFound != oplogApplicationStartPoint) { severe() << "Oplog entry at " << oplogApplicationStartPoint.toBSON() << " is missing; actual entry found is " << firstTimestampFound.toBSON(); fassertFailedNoTrace(40292); } // Apply remaining ops one at at time, but don't log them because they are already logged. UnreplicatedWritesBlock uwb(opCtx); while (cursor->more()) { auto entry = cursor->nextSafe(); fassertStatusOK(40294, SyncTail::syncApply(opCtx, entry, OplogApplication::Mode::kRecovering)); _consistencyMarkers->setAppliedThrough( opCtx, fassertStatusOK(40295, OpTime::parseFromOplogEntry(entry))); } }
BSONObj ClusterAggregate::aggRunCommand(DBClientBase* conn, const Namespaces& namespaces, BSONObj cmd, int queryOptions) { // Temporary hack. See comment on declaration for details. massert(17016, "should only be running an aggregate command here", str::equals(cmd.firstElementFieldName(), "aggregate")); auto cursor = conn->query(namespaces.executionNss.db() + ".$cmd", cmd, -1, // nToReturn 0, // nToSkip NULL, // fieldsToReturn queryOptions); massert(17014, str::stream() << "aggregate command didn't return results on host: " << conn->toString(), cursor && cursor->more()); BSONObj result = cursor->nextSafe().getOwned(); if (ErrorCodes::SendStaleConfig == getStatusFromCommandResult(result)) { throw RecvStaleConfigException("command failed because of stale config", result); } auto executorPool = grid.getExecutorPool(); result = uassertStatusOK(storePossibleCursor(HostAndPort(cursor->originalHost()), result, namespaces.requestedNss, executorPool->getArbitraryExecutor(), grid.getCursorManager())); return result; }
Status OplogReader::_compareRequiredOpTimeWithQueryResponse(const OpTime& requiredOpTime) { auto containsMinValid = more(); if (!containsMinValid) { return Status( ErrorCodes::NoMatchingDocument, "remote oplog does not contain entry with optime matching our required optime"); } auto doc = nextSafe(); const auto opTime = fassertStatusOK(40351, OpTime::parseFromOplogEntry(doc)); if (requiredOpTime != opTime) { return Status(ErrorCodes::BadValue, str::stream() << "remote oplog contain entry with matching timestamp " << opTime.getTimestamp().toString() << " but optime " << opTime.toString() << " does not " "match our required optime"); } if (requiredOpTime.getTerm() != opTime.getTerm()) { return Status(ErrorCodes::BadValue, str::stream() << "remote oplog contain entry with term " << opTime.getTerm() << " that does not " "match the term in our required optime"); } return Status::OK(); }
void ReplicationCoordinatorExternalStateImpl::cleanUpLastApplyBatch(OperationContext* txn) { if (_storageInterface->getInitialSyncFlag(txn)) { return; // Initial Sync will take over so no cleanup is needed. } const auto deleteFromPoint = _storageInterface->getOplogDeleteFromPoint(txn); const auto appliedThrough = _storageInterface->getAppliedThrough(txn); const bool needToDeleteEndOfOplog = !deleteFromPoint.isNull() && // This version should never have a non-null deleteFromPoint with a null appliedThrough. // This scenario means that we downgraded after unclean shutdown, then the downgraded node // deleted the ragged end of our oplog, then did a clean shutdown. !appliedThrough.isNull() && // Similarly we should never have an appliedThrough higher than the deleteFromPoint. This // means that the downgraded node deleted our ragged end then applied ahead of our // deleteFromPoint and then had an unclean shutdown before upgrading. We are ok with // applying these ops because older versions wrote to the oplog from a single thread so we // know they are in order. !(appliedThrough.getTimestamp() >= deleteFromPoint); if (needToDeleteEndOfOplog) { log() << "Removing unapplied entries starting at: " << deleteFromPoint; truncateOplogTo(txn, deleteFromPoint); } _storageInterface->setOplogDeleteFromPoint(txn, {}); // clear the deleteFromPoint if (appliedThrough.isNull()) { // No follow-up work to do. return; } // Check if we have any unapplied ops in our oplog. It is important that this is done after // deleting the ragged end of the oplog. const auto topOfOplog = fassertStatusOK(40290, loadLastOpTime(txn)); if (topOfOplog >= appliedThrough) { return; // We've applied all the valid oplog we have. } log() << "Replaying stored operations from " << appliedThrough << " (exclusive) to " << topOfOplog << " (inclusive)."; DBDirectClient db(txn); auto cursor = db.query(rsOplogName, QUERY("ts" << BSON("$gte" << appliedThrough.getTimestamp())), /*batchSize*/ 0, /*skip*/ 0, /*projection*/ nullptr, QueryOption_OplogReplay); // Check that the first document matches our appliedThrough point then skip it since it's // already been applied. if (!cursor->more()) { // This should really be impossible because we check above that the top of the oplog is // strictly > appliedThrough. If this fails it represents a serious bug in either the // storage engine or query's implementation of OplogReplay. severe() << "Couldn't find any entries in the oplog >= " << appliedThrough << " which should be impossible."; fassertFailedNoTrace(40293); } auto firstOpTimeFound = fassertStatusOK(40291, OpTime::parseFromOplogEntry(cursor->nextSafe())); if (firstOpTimeFound != appliedThrough) { severe() << "Oplog entry at " << appliedThrough << " is missing; actual entry found is " << firstOpTimeFound; fassertFailedNoTrace(40292); } // Apply remaining ops one at at time, but don't log them because they are already logged. const bool wereWritesReplicated = txn->writesAreReplicated(); ON_BLOCK_EXIT([&] { txn->setReplicatedWrites(wereWritesReplicated); }); txn->setReplicatedWrites(false); while (cursor->more()) { auto entry = cursor->nextSafe(); fassertStatusOK(40294, SyncTail::syncApply(txn, entry, true)); _storageInterface->setAppliedThrough( txn, fassertStatusOK(40295, OpTime::parseFromOplogEntry(entry))); } }