Open Bug 1900726 Opened 20 days ago Updated 12 days ago

macOS and Linux need an update mutex

Categories

(Toolkit :: Application Update, defect)

defect

Tracking

()

People

(Reporter: bytesized, Unassigned)

References

Details

(Whiteboard: [fidedi-ope])

We have an update mutex that prevents multiple Firefox instances from updating simultaneously and interfering with one another. Interestingly, this mutex was actually added for a different reason and it was only incidental that it also fixed this issue. This probably explains why it was only ever implemented on Windows.

But this functionality is necessary across platforms. I was originally going to update Bug 1278252 to use for this, but I decided to create a new bug instead because that one is not quite right. It correctly identifies that multi-user updating is a problem, but fails to recognize that multi-profile updating causes all the same problems (and sometimes more).

To be a bit more clear, this is what the mutex is needed for: The files in the update directory represent the update state. We should never make changes to the update state from multiple Firefox instances at once. One instance gets to take the mutex successfully and all the others should refuse to update and allow the instance with the mutex to run the update process uninterrupted.

I recall that I looked into adding an update mutex for macOS once in the past and had some trouble finding a mechanism that fulfilled all the requirements. IIRC, the requirements were:

  1. Concurrency safety, of course. Exactly one instance should be able to hold it at once, regardless of the timing of when each instance attempts to take it.
  2. It should be file-based since what it is protecting is file-based. Ideally we would place it on the same filesystem as what we are protecting.
  3. If the Firefox instance holding the mutex crashes, the mutex must be released. Ideally it should be released immediately.
  4. Broad compatibility. We don't want this to fail just because someone is using a non-standard filesystem.

Someone raised the idea of using sqlite to do this. I never looked into how reasonable that was. Another option that was raised is nsProfileLock. Reportedly there are some pretty rare issues with certain file systems, but I think this might still be our best option. I'm still looking into whether those issues are documented somewhere.

There is also a macOS specific issue here. macOS is the one platform that still uses per-user update directories. Migrating to a per-installation update directory is something that we want to do, but I know from experience that it's difficult and I'd rather not make it a prerequisite of this work. But if we put the macOS update mutex in the update directory (where it is on Windows), it won't actually solve Bug 1278252. I'm not aware of any directory that Firefox already uses that is installation specific. We might want to do something like /Library/Application Support/<bundle-id>/installHash/mutex? I'm not totally sure.

But once we have answered the relevant questions, I believe the work should be pretty straightforward: Implement createMutex for the remaining platforms, remove this check, and add some testing.


I'd like to quickly mention one related thing: nsIUpdateSyncManager. This is also known as the Multi Instance Lock (MIL) and is very commonly confused with the update mutex. The MIL does exist on all platforms, but it does not work the same way or solve the same problem.

Confusingly, it does exist to address a very similar, very closely related problem that sometimes causes it to appear to fix this problem. When it detects other instances of Firefox running, it introduces very long delays into the update system. This mitigates Bug 1480452, but it also kind of makes it look a bit like this bug is fixed. But those delays are only temporary. Eventually Firefox will stop showing the "Another instances is updating" in the update UI, and allow manual or automatic updates to proceed, potentially causing this bug.

The MIL works using GNU file locking. Paraphrased from here:
A write lock gives a process exclusive access. While a write lock is held, no other process can lock the file at all.
A read lock prohibits any other process from requesting a write lock. However, other processes can request read locks.

This allows the MIL to do this:

  1. Very early in Firefox launch, take a read lock
  2. If we want to know if other instances are running, query the lock to find out if we could take a write lock (but don't actually take one).
  3. If we could take the lock, we are the only Firefox instance. If we could not, there are other instances running.

But this does not help determine which instance should drive update.

An obvious followup question is "why not use GNU file write locks for this?"
It's possible that it is reasonable to do so. We would need to look into how well that fulfills the requirements above. I'm not sure off the top of my head.

Duplicate of this bug: 1278252

(In reply to Robin Steuber (they/them) [:bytesized] from comment #0)

Someone raised the idea of using sqlite to do this. I never looked into how reasonable that was. Another option that was raised is nsProfileLock. Reportedly there are some pretty rare issues with certain file systems, but I think this might still be our best option. I'm still looking into whether those issues are documented somewhere.

To elaborate on what nsProfileLock does. Despite its name it allows taking an exclusive lock on any directory, it is not specific to the profile directory. It does so by creating a file in the directory and using one of a few strategies for taking an exclusive lock on it. It is used in early startup to only allow one Firefox instance to be in early startup at once (controlling the remoting service and profile selection such that only one instance can select a given profile). It is later used to lock the profile folder itself. It has been around for literally decades and so is battle tested.

On Windows it works by taking an OS level file lock by opening a file for exclusive writing. On linux and macOS it first attempts to use fcntl. If that fails because the filesystem doesn't support locking (networked filesystems generally) then it creates a symlink (which can be done so atomically). The symlink case is the only case that is not guaranteed to release the lock in the event of the process crashing. But there is still some protection, the target of the symlink is a special string that includes the pid of the process that takes the lock. When another process attempts to take the lock it verifies that a process with that pid exists, if it doesn't then it assumes the lock is stale and deletes it. This is not foolproof, we don't verify that the process is a Firefox process. There are probably improvements that could be made but there hasn't been enough of a need at this point as so few users use networked filesystems for their home directories. Possibly there are some filesystems that don't support fcntl or symlinks, locking would fail in this case.

You need to log in before you can comment on or make changes to this bug.