This page provides a description of nfsreplay internals and other developer information.

The NFS benchmarking project page is here: NFSBenchmarking

I can be reached at <shehjart AT gelato DOT NO SPAM unsw DOT edu GREEBLIES DOT au>

Replay overview

The sequence of operations for nfsreplay tool chain is presented in the figure below.

For now, only NFS version 3 is supported by all the tools.

NFS Anonymizer

The anonymizer is a modified NFSv3 dissector from Wireshark source code. It uses Wireshark infrastructure to access and anonymize the fields in the different types of NFS requests. The various fields anonymized for different types of NFS requests are classified at AnonymizedFields.

The anonymizer uses pcap format as the wrapper format and within each pcap frame uses a anonymizer specific format defined in AnonymizedTraceFormat.

For more on the pcap format, see Libpcap File Format at Wireshark Wiki.

tracedigester

tracedigester is used for performing various functions, of which the most important one is generation of Replay Dump files. These files contain synthesized or tracedigest'ed traces from the Anonymized Trace files. The Anonymized Trace files serve as the input for all operations performed by the tracedigester.

ReplayDumpFormat format is described on a separate page.

nfsreplay

nfsreplay finally replays the trace in Replay Dump files.

Apart from the rdump or Replay Dump for a trace, nfsreplay also takes an Orphans file as input. Orphans file contains information about orphaned filehandles found during tracedigestion. See OrphansFileFormat for more info.

rdumpinfo

A simple tool for inspecting rdump files generated by the tracedigester.

Overview

The figure below shows an overview of nfsreplay internals.

All the code needed by the tools above is organized into libnfsreplay. Its just a container for the following components:

The nfs_frames are the fundamental entities on which the tracedigester, nfsreplay and rdumpinfo work. Replay Dumps are basically portable serialized versions of these nfs_frames. Here portable implies that a Replay Dump can be replayed from a machine of a different endian-ness than the one on which it was generated using tracedigester.

Besides libnfsreplay, tracedigester also needs two other parts, these are:

nfs_frame

nfs_frames are defined in include/nfsframe.h. Each nfs_frame contains information about an individual NFS request or a reply.

Once an anonymized trace is run through the tracedigester to generate a Replay Dump, each frame is assigned a fixed frame id that is unique within that trace. This frame id is stored in the nf_id member and helps identify individual frames during replay mainly for error checking and progress reporting purposes.

Each frame also contains a general purpose index that is used differently by tracedigester and nfsreplay.

During a run of tracedigester, the replies are paired with their requests by storing a reference to the reply nfs_frame in the request nfs_frame's nf_reply member.

rdump_writer

rdump writer State

Operations List

Dependency Tree

rdump_reader

rdump_reader State

fshier

FS Hier State

Building FS Tree

Creating the FS Hierarchy

partition

tracedigester

Overview

Generating Replay Dumps

File System Hierarchy Creation

Trace Partitioning

nfsreplay

Overview

Replay State

Timescale Scheduler

Pipeline Scheduler

rdumpinfo

File Formats

Archive

Query Maybe provide other partitionings as well: on file/directory destination perhaps? Otherwise there may be a synchronisation issue with several machines read/writing the same file or modifying the smae directory. (PeterChubb) Will do too.

nfsreplay

Here's the design of nfsreplay in its current state.

  1. trace_reader FIXME The trace_reader is responsible for reading in the meta info and organising the info into Operations list and Dependency list(..described later..). I don't want the trace player to be tasked with performing large disk IO, so it's better to offload it and perform it before the replayer runs. It reads in the Ops info and builds the Ops list and Dependency list and dumps the structures into a binary file, which is deserialized by the replayer to build its own data structures.

1. The partitioned trace will be in the Ops info format.

  1. Operations List and Dependency List Operations list will be a linked list of operations in the order they appear in the Ops Info file.

    Dependency list will be a doubly linked list that stores pointers to the nodes in the Operations list which depend on operations in the Operations list. 2 First, definitions:

    • trace_reply_status: Status of the response to NFS request in the captured trace.

    • replay_reply_status: Status of the response to NFS request during replay. The contents of the Operations Dependency list provide answers to the three questions below:

    • What's the next request/operation that I need to send?
    • What's the trace_reply_status of the current request? This needs request-reply matching which is performed in the Ops Info collection phase. We need this info to determine whether the response in the replay reply is the same as the response in trace reply and then decide whether this was a succesful replay. Failures need special handling and is discussed later.
    • What are the subsequent operations/requests that are dependent on the current request? We need to determine the subsequent requests that depend on the success of a previous request on the same file or file handle. This structure will allow us to maintain a mapping between an operation and all the subsequent operations that depend on the success of this one.
    The two parts are:
    1. Operations list: The order in which requests are found in the Ops Info file.
    2. Dependency list: The list of Operations dependent on this operation.

Replay Issues

These are some of the details that need to be handled during replay.

Handling failed requests

The question here is, how to handle a request that was successful in the trace but failed during replay.

Firstly, to a large extent, I expect that the requests which have been successful in the trace but not during replay, generally fail due to differences/inconsistencies in the way the file system was re-created using the information from the trace.

It's not a problem if the failed request is what is classified as idempotent. A bigger cause of concern are non-indempotent requests which modify the file system in a particular way that subsequent requests depend on this operation being successful. For eg, if a CREATE request fails, we do not have the file handle that will be required for performing any further operations which need this file handle. Similarly, LOOKUP requests also lead to such problems, even though its an idempotent operation.

Handling requests that fail in replay

This is the main task of the response processor. It checks the status of the reply and for certain operations, ensures that the replay reply is a success. If not, it disables the subsequent operations that depend on this particular operation's success. This is performed by maintaining the Operations dependency tree, which sets a flag in the operation structure's node signifying to the scheduler that it should move on to scheduling the next active request. This flag is set for all the requests that are present in the failed operation's dependency list.

Mounting

The replayer will have to implement the mount protocol to explicitly mount the file system being accessed, before the replay can begin.

Scaling

There can be two types of scaling as explained in the TBBT paper.

Temporal scaling is simple, spatial scaling can be performed only if there are enough independent operations in the Operations list to be carried out in parallel.

Asynchronous and Non-blocking RPC

The Sun RPC library thats part of the glibc package only allows blocking-wait RPC which is the ideal way to carry out remote procedure calls since we need the repalyer to be highly scalable, this library is not ideal. An ideal library would provide non-blocking calls to sending functions and asynchronous notifications of replies and responses from the server. I've developed such a library that is an extension of the Sun RPC library. The Sun RPC library design allows writing different handlers for different types of transport protocols. The built-in types are the basic ones for UDP called clnt_udp and for TCP called clnt_tcp. I've written the new non-blocking and asynchronous extension called clnt_tcp_nb. Its asynchronicity is due to the use of SIGIO for event notification on TCP socket. For more info, see the Asynchronous RPC project page, AsyncRPC

Client side NFS library

I've written a library for client NFS operations from user-space as compared to the Linux kernel implementation that sits underneath the VFS layer. Its available here, libnfsclient

TODO

Archived Notes

  • 1 Why? Why not do all the trace_reader work offline before starting the replay? Stick the result into a file and mmap it. Then you need no locking, and your replay machine is more lightly loaded. Done, Will change the figures soon.

  • 2 Why not do a topological sort and not bother with a list?, The operations are already sorted on the time-of-capture, so I dont need to perform any re-ordering/sorting. The need for a Dependency list arises only because I need to ensure that the scheduler is as efficient as possible. For eg, in case an op fails, we dont want the scheduler to traverse the list of operations to find an op that doesnt depend on the failed op. We also need the dependency list so that we can determine the ops that can be pipelined in case of spatial scaling, at least thats the intention.Of course, we can use an array instead of a list. The whole concept of Ops and Dep list is not clear yet, since I am running into some cases where it fails to prevent a list traversal in the scheduler. Working on it.

IA64wiki: nfsreplayInternals (last edited 2008-01-01 05:09:41 by ShehjarTikoo)

Gelato@UNSW is sponsored by
the University of New South Wales National ICT Australia The Gelato Federation Hewlett-Packard Company Australian Research Council
Please contact us with any questions or comments.