Archive: Posts Tagged ‘Recovery’

A good friend…

No comments March 15th, 2011

In Star Wars, “R2-D2” was Luke Skywalker’s good friend. If you’re running a domain with FRS, D2 is your good friend. Even thought (2008) R2 (and DFSR) should be your buddy.

So when should you call your D2 buddy and give him a run?

You experience:

  • One of your DCs are in Journal Wrap
  • The local FRS jet database has become corrupt
  • Assertions in the FRS service
  • Missing FRS junction points
  • Missing FRS attributes/objects
  • Missing SYSVOL/NETLOGON share
  • Corrupt/missing NTFS journal
  • You are bored… (meaning the list is long)

Setting the backup/restore flag , a.k.a. “Burflags”, to D2, and you restart the NTFRS service things start moving.

HKLM\System\CurrentControlSet\Services\NtFrs\Parameters\Backup/Restore\Process at Startup

The bad DC will move all its SYSVOL data, if it holds any, into the “NtFrs_PreExisting_See_EventLog” folder. The bad DC will compare all these files with the ones of an upstream partner. It will compare the file IDs and the MD5 checksum from the upstream partner with the local ones. If a match is found, it will copy this file from the Pre-Existing folder into the original location. If it don’t match, it will copy the file from its partner.

When the replication has finished (Event ID 13516 is logged), you can delete the content in the Pre-Existing folder to free up space.

Journal Wrap

6 comments June 6th, 2010

Environment:

– Windows 2000/2003/2008 domain controllers using FRS (not DFSR).
– More than one Domain Controller
– Atleast one DC with a healthy SYSVOL

Why do Journal Wraps occur?

Instan at the AD Troubleshooting blog made an excellent blog entry about:

What happens in a Journal Wrap?

You should give it a read to understand what is going on under the hood.

Symptoms that might occur:

  • Event ID 13568 is logged in the NtFrs event log
  • A generic Event ID 1058 may be logged
  • You make changes to a logon script but not all users got the change
  • Changing a GPO or creating a new GPO is not applied to all users or computers
  • Missing SYSVOL share
  • A RSoP or gpresult report that data or policy object is missing or corrupt

If you take a look at the 13568 event you’ll see that there is a “solution” to this problem:

Set the “Enable Journal Wrap Automatic Restore” registry parameter to 1

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\NtFrs\Parameters”

Restart ntfrs service.

This is not a good solution for post-Server 2000 SP3.
I don’t know why Microsoft still have this “how-to-fix” in event 13568, but they say in KB 290762:

Important: Microsoft does not recommend that you use this registry setting, and it should not be used post-Windows 2000 SP3. Appropriate options to reduce journal wrap errors include…

Update: I had to ask around about this since it was nagging me:

The event was never changed because the product group didn’t want to pay for the localization cost, nor admit that this registry setting caused more problems than it fixed. It actually came down to ego – the developer of FRS was a real piece of work. So instead the public docs were updated to state not to use that autorecovery registry setting.


Instead you should go for the Burflags method. This will kick start your SYSVOL up and running. Most often a “non-authoritative” (D2) approach will fix you up.

The “D2” key can be set two places in registry:

Global re-initialization:

HKLM\SYSTEM\CurrentControlSet\Services\NtFrs\Parameters\Backup/Restore\Process at Startup

or

Replica set specific re-initialization:

HKLM\System\CurrentControlSet\Services\NtFrs\Parameters\Cumulative Replica Sets\GUID

If you’re using DFS replica sets that holds a large amount of data that is healthy, go for the “Replica set specific re-initialization”. If you set the Global Burflags, FRS will re-initialize all replica sets, including the DFS namespace the member holds. If they hold a large amount of data… that might take some time.

To find the GUID of SYSVOL, look for the “Replica Set Name” named “Domain System Volume (SYSVOL SHARE)” under the subkey “HKLM\..\..\Replica Sets”:

This screenshot have only one GUID since I don’t use DFS in my lab.

Change the value of Burflags to D2 (hex).
If you don’t uses DFS you could just set the Global Burflags to D2. It will not make any difference under what subkey you set it. This will re-initialize all replica sets the member holds (in this case the SYSVOL).

After you have set the Burflags key to D2, you have to restart the NTFRS service on the affected DC.

Overview of what happens:

1. The Burflags is set to 0
2. Event ID 13565 is logged. non-authoritative restore has started
3. The content of SYSVOL are moved to the pre-existing folder
4. Event ID 13520 is logged
5. The local FRS database is rebuilt
6. It re-join (vvjoin) the replica set
7.  The “bad DC” will compare all files (file ID and MD5 sum) it has in the Pre-existing folder with the files from an upstream partner.
8. If a match is found, it will copy the file from the Pre-Existing folder to the original location. If they don’t match, it will pull the file from the upstream partner.
9. Event ID 13553 is logged
10. FRS notifies (SysvolReady reg.key = 1) the Netlogon service that SYSVOL is ready and can be shared.
11. The Netlogon service will share SYSVOL and Netlogon.
12. Event ID 13516 is logged (finished)

 

When you have verified that SYSVOL is shared and in sync, you can delete the content in the Pre-Existing folder to free up space.


Authoritative restore (D4):

If your SYSVOL is all messed up on every DC’s, you might have to do an “authoritative restore” using both the D4 and D2 values.

By the way you should never, ever use the D4 flag on more than one DC as you will have a lot of collisions and morphed folders. The D4 flag should only be set like Microsoft says, as a last resort.

Quick overview:

1. Stop the NtFrs service on every DC
2. Set the D4 flag on one DC that will be authoritative for the replica set(s). The SYSVOL content will not be moved to the pre-existing folder on the authoritative member.
3. Set the D2 flag on the other DC’s (non-authoritative)
4. Start the NtFrs service on the “D4” DC.
5. Check that Event ID 13553 and 13516 is logged.
6. If step 5 is ok, start NtFrs on the “D2” DC’s.

For detailed steps, see “How to rebuild the SYSVOL tree and its content in a domain”


References
:

FRS event codes: http://support.microsoft.com/kb/308406

What happens in a Journal Wrap?
http://blogs.technet.com/instan/archive/2009/07/14/what-happens-in-a-journal-wrap.aspx

How to rebuild the SYSVOL tree and its content in a domain
http://support.microsoft.com/kb/315457

Using the BurFlags registry key to reinitialize File Replication Service replica sets
http://support.microsoft.com/kb/290762

Backing Up and Restoring an FRS-Replicated SYSVOL Folder
http://msdn.microsoft.com/en-us/library/cc507518(VS.85).aspx

FRS and the A-record and CNAME

No comments May 29th, 2010

Case:

‘DC01test’ has a modified object that should be replicated to its partner ‘DC4test’:

1. ‘DC01test’ queries AD for a configured replication partner (default defined by the KCC service)

2. ‘DC01test’ knows the name (‘DC4test’) of his replication partner, but needs to find the GUID of ‘DC4test’.

3. ‘DC01test’ compare all CNAME record in the “_msdcs zone” and finds the GUID that match the name ‘DC4test’

4. Next step ‘DC01test’ needs to find is the IP of ‘DC4test’ so it can connect to ‘DC4test’.

5. ‘DC01test’ sends a recursive DNS query to its primary configured DNS server asking for a CNAME (the alias of the GUID).

Query: guid._msdcs.spurs.local
DNS server respond with: ‘DC4test.spurs.local’

6. ‘DC01test’ ask his DNS for the A-record for ‘DC4test.spurs.local’
DNS server returns the IP: 10.1.88.50

7. ‘DC01test’ connects to ‘DC4test’ and flags that “I have a change you need to get from me”.

8. Since FRS is based on PULL (not push), ‘DC4test’ will pull the changes on the object from ‘DC01test’.

If the A-record or the CNAME is missing or not correct, this process will fail. As a result, the replication will fail.
A handy tool that will test that all records are registered on all authoritative DNS servers is “dnslint”. It will create a HTM-report and highlight errors/warnings.

ie. dnslint /ad /s 10.1.88.150 /v

If a CNAME is missing:

Ref:
DNSLint usage: http://support.microsoft.com/kb/321045
Troubleshooting with DNSLint: http://support.microsoft.com/kb/321046

USN-rollback

1 comment April 26th, 2010

1. You discover that the netlogon service is in a pause state
2. Event ID 2095 / 2103 is logged in the directory service event log
3. Inbound and outbound replication is disabled

You’re in a now in an USN-Rollback state!

(This situation will never happen in a single DC forest)

The most common reasons for this state is that an unsupported restore method has occurred.

Like:

– You clone a DC with ie. VMWare Converter
– You revert to a snapshot of a DC and start the DC in normal mode without setting the “Database restored from backup” registry key [1][2]

If your HW is set to do buffered disc writes and an unexpected power failure occur, may also put you in an USN-Rollback state

The way to get out of this state is to remove the bad DC out of your domain:

1. on the bad DC: dcpromo /forceremoval
2. on a healthy DC: Run a Metadata cleanup [3]
3. on a healthy DC: seize the FSMO if necessary [4]
4. rebuild the “bad” DC and promote it back

I’m living in a freezing country (bah!) and we use to say “pee in your pants to get warm (for about 5 seconds…)”.

I have seen some recommendation at some other blogs and forums that provides a “quick” solution to recover from a USN-Rollback state.

Here is a “pee in your pants” solution:

– Delete the “DSA Not Writable” registry key
– Enable replication with repadmin
– Reboot
– Fixed? Nah, not really!

Here is what Microsoft says about this in an extention to KB 875495 [5] that for some reason is hidden for the general public:

Deleting or manually changing the Dsa Not Writable registry entry value puts the rollback domain controller in a permanently unsupported state. Therefore, such changes are not supported. Specifically, modifying the value removes the quarantine behavior added by the USN rollback detection code. The Active Directory partitions on the rollback domain controller will be permanently inconsistent with direct and transitive replication partners in the same Active Directory forest.”

References:

[1] DC’s and VM’s – Avoiding the Do-over
[2] Backup and Restore Considerations for Virtualized Domain Controllers
[3] Metadata cleanup
[4] Seizing the FSMO’s
[5] KB 875495 – How to detect and recover from a USN rollback

Restore an OU

No comments March 11th, 2010

Assuming you have a 2003 DC SP1< and a good System state backup. Not older than your domains tombstone lifetime.

Start the DC in DSRM (F8 at boot)

fig.1

Start NTBackup

Restore Wizard > Next

Choose the System state backup file > Next > Advanced

Restore files to “Original Location”

Leave exsisting files > Next

fig.2

If you have only one DC in your domain. Tick the last checkbox (fig.2). If you have more than one, don’t tick it.

Press “Finish”

Do not restart the DC at this moment.

Mark the object as authoritative (meaning the object(s) will get replicated to other DC’s because it’s authoritative)

http://technet.microsoft.com/en-us/library/cc757068(WS.10).aspx

Open cmd >ntdsutil

> authoritative restore

> restore subtree destinguishedName

ie. An OU accidentally got deleted called “Reserves” holding all the Tottenhams reserves user objects. (What the heck. They aren’t good enough for the first team, but maybe someday they will so let’s get them back).

fig.3

> quit

Restart the DC in normal mode (SP1 or newer). The AD Replication will do the job to get the OU and the user objects replicated to the other DC’s in the domain.

If you have a Windows 2003 SP1 or newer DC, the ntdsutil will create two files if the restored user object have any back-links to group membership. If they do you have to restore the back-link aswell. But wait until all DC’s have got the users replicated.

Syntax: ldifde -i -f <ar*.ldf>

ie: ldifde -i -f ar_20100129-081113_links_spurs.local.ldf

If you have a Win2008 R2 domain and restore a user from the Recycle Bin you don’t have to worry about the back-links. The process will do it for you.