Contents
This page provides a description of nfsreplay internals and other developer information.
The NFS benchmarking project page is here: NFSBenchmarking
I can be reached at <shehjart AT gelato DOT NO SPAM unsw DOT edu GREEBLIES DOT au>
This page is not complete yet. Contact ShehjarTikoo for ETA. In the mean time, please see nfsreplayTR.
Replay overview
The sequence of operations for nfsreplay tool chain is presented in the figure below.
For now, only NFS version 3 is supported by all the tools.
NFS Anonymizer
The anonymizer is a modified NFSv3 dissector from Wireshark source code. It uses Wireshark infrastructure to access and anonymize the fields in the different types of NFS requests. The various fields anonymized for different types of NFS requests are classified at AnonymizedFields.
The anonymizer uses pcap format as the wrapper format and within each pcap frame uses a anonymizer specific format defined in AnonymizedTraceFormat.
For more on the pcap format, see Libpcap File Format at Wireshark Wiki.
tracedigester
tracedigester is used for performing various functions, of which the most important one is generation of Replay Dump files. These files contain synthesized or tracedigest'ed traces from the Anonymized Trace files. The Anonymized Trace files serve as the input for all operations performed by the tracedigester.
ReplayDumpFormat format is described on a separate page.
nfsreplay
nfsreplay finally replays the trace in Replay Dump files.
Apart from the rdump or Replay Dump for a trace, nfsreplay also takes an Orphans file as input. Orphans file contains information about orphaned filehandles found during tracedigestion. See OrphansFileFormat for more info.
rdumpinfo
A simple tool for inspecting rdump files generated by the tracedigester.
Overview
The figure below shows an overview of nfsreplay internals.
All the code needed by the tools above is organized into libnfsreplay. Its just a container for the following components:
libnfsclient: The user space NFS version 3 client operations library. This also contains the AsyncRPC library.
nfs_frames: nfs_frames are container for individual NFS requests and their replies, if any. This component also contains various operations on these nfs_frames.
rdump reader/writer: The library used by tracedigester to write Replay Dump files and by nfsreplay and rdumpinfo to read Replay Dump files.
nfsstat: The stats collection code and the interface to dump these into a file or terminal.
Replay logic: Replay logic code uses the previous components to build nfs_frames from Replay Dumps, schedule the nfs_frames and replay them using libnfsclient.
The nfs_frames are the fundamental entities on which the tracedigester, nfsreplay and rdumpinfo work. Replay Dumps are basically portable serialized versions of these nfs_frames. Here portable implies that a Replay Dump can be replayed from a machine of a different endian-ness than the one on which it was generated using tracedigester.
Besides libnfsreplay, tracedigester also needs two other parts, these are:
fshier: The module that builds the FS tree from a given trace and contains functionality to create a replica of the FS hierarchy given this FS tree.
partition: The module that allows creating source-address based partitions. It also generates separate Replay Dumps for each partition so that each can be replayed independently.
nfs_frame
nfs_frames are defined in include/nfsframe.h. Each nfs_frame contains information about an individual NFS request or a reply.
Once an anonymized trace is run through the tracedigester to generate a Replay Dump, each frame is assigned a fixed frame id that is unique within that trace. This frame id is stored in the nf_id member and helps identify individual frames during replay mainly for error checking and progress reporting purposes.
Each frame also contains a general purpose index that is used differently by tracedigester and nfsreplay.
During a run of tracedigester, the replies are paired with their requests by storing a reference to the reply nfs_frame in the request nfs_frame's nf_reply member.
rdump_writer
rdump writer State
Operations List
Dependency Tree
rdump_reader
rdump_reader State
fshier
FS Hier State
Building FS Tree
Creating the FS Hierarchy
partition
tracedigester
Overview
Generating Replay Dumps
File System Hierarchy Creation
Trace Partitioning
nfsreplay
Overview
Replay State
Timescale Scheduler
Pipeline Scheduler
rdumpinfo
File Formats
Archive
1.Meta info, now known as Ops Info. We need to extract relevant meta-info from the anonymized traces so that the trace parsing and storage overhead inside the trace player can be minimized. We can also perform the pairing(..matching requests to replies..) in this phase. This will further reduce the storage space needed because we're only interested in knowing the status of the response not the contents. Again, performing the pairing here can reduce the number of passes needed during trace replay to get the reply status for a request.
- Ops info contents
- Send time
- Source and Destination addresses
- RPC Message XID
- NFS Op type
- NFS Op Args
- NFS Response Status
- Ops Info file format
- 8 bytes - Send time
- 4 bytes - Source IP address
- 4 bytes - Destination IP address
- 4 bytes - RPC Xid
- 4 bytes - NFS Version, currently, we support only NFSv3
- 4 bytes - NFS Procedure number, use the RFC 1813 numbers
- Variable - Op Args, size depends on the NFS procedure being called. See RFC1813 for exact size. We follow the same format but with a few exceptions.
- 4 bytes - NFS Response Status, reponse status from the anonymized trace.
- Ops info contents
Trace partitioning The original anonymized trace will be partitioned on source address so that multiple client hosts can replay different subsets of the trace dump. We can also provide partitioning on per-file/dir handle basis.
Query Maybe provide other partitionings as well: on file/directory destination perhaps? Otherwise there may be a synchronisation issue with several machines read/writing the same file or modifying the smae directory. (PeterChubb) Will do too.
nfsreplay
Here's the design of nfsreplay in its current state.
trace_reader FIXME The trace_reader is responsible for reading in the meta info and organising the info into Operations list and Dependency list(..described later..). I don't want the trace player to be tasked with performing large disk IO, so it's better to offload it and perform it before the replayer runs. It reads in the Ops info and builds the Ops list and Dependency list and dumps the structures into a binary file, which is deserialized by the replayer to build its own data structures.
1. The partitioned trace will be in the Ops info format.
Operations List and Dependency List Operations list will be a linked list of operations in the order they appear in the Ops Info file.
Dependency list will be a doubly linked list that stores pointers to the nodes in the Operations list which depend on operations in the Operations list. 2 First, definitions:
trace_reply_status: Status of the response to NFS request in the captured trace.
replay_reply_status: Status of the response to NFS request during replay. The contents of the Operations Dependency list provide answers to the three questions below:
- What's the next request/operation that I need to send?
- What's the trace_reply_status of the current request? This needs request-reply matching which is performed in the Ops Info collection phase. We need this info to determine whether the response in the replay reply is the same as the response in trace reply and then decide whether this was a succesful replay. Failures need special handling and is discussed later.
- What are the subsequent operations/requests that are dependent on the current request? We need to determine the subsequent requests that depend on the success of a previous request on the same file or file handle. This structure will allow us to maintain a mapping between an operation and all the subsequent operations that depend on the success of this one.
- Operations list: The order in which requests are found in the Ops Info file.
- Dependency list: The list of Operations dependent on this operation.
- The Ops list above consists of the operations as read in from the Ops Info file. There are 6 operations, with Ops 2, 3, 4 dependent on Op1 and Op6 dependent on Op5. The Dependency list of Op1 consists of pointers to Ops 2,3,4. The structure of the Dependency list is such that the last element is a pointer to the next independent operation in the Operations list. This helps in locating the next operation if Op1 response fails, in which none of the ops in Op1's Dependency list can be carried out. To locate the next operation that can be scheduled, we can skip over to the next independent op(..by extracting the last element of a Dependency list..), instead of traversing the Operations list. Op6 is the last op found in the Ops Info file, so the last element in the Dependency list for Op5 is NULL.
- See hardcopy notes about special case where dependent ops also have dependent and independent ops. Each element of the list is a structure consisting of:
- Operation info, info about the NFS request and its args
- Dependency list, of subsequent operations that depend on this op.
- trace_reply_status, as above
- replay_reply_status, as above
- node_status, Used by the scheduler to determine whether to replay this or not. Takes two values:
- ACTIVE, stays in memory
- FREE, will be freed soon
- See hardcopy notes about special case where dependent ops also have dependent and independent ops. Each element of the list is a structure consisting of:
Scheduler and Scaler
- Scheduler decides the time at which a request will be sent out. A sub-component is the scaler which interacts with the scheduler to time the requests in a way that makes NFS server operate on multiple directory hierarchies in parallel. It handles the spatial scaling functionality. Discussed later. Scheduler maintains a separate list of operations, called Replay list, which determines the actual order in which the Sender module transmits requests. The list consists of pointers to the nodes in Operations list.
Stats collector, collects basic information about the requests being sent and the corresponding replies.(FIXME: We dont have a global view of the trace, since we replay partitioned trace, so is it any benefit to collect a distribution or other figures about the partitioned trace?..)
Send, the module responsible for packing/encapsulating the requests in a RPC and NFS message.
Receive, module for de-encapsulating and mainly extracting the response status.
Response processor, Performs various functions depending on the status of the reply received from the NFS server. It special-cases the following types of operations since they all lead to increase or decrease in the state being maintained and the future schedule.
- CREATE, MKDIR, MKNOD, LINK, SYMLINK, RENAME, READDIRPLUS, READDIR
- LOOKUP
trace_free, the thread that is handed-off the nodes/elements of the Operations and Dependency list for free'ing the memory allocated to them. We need a separate thread because, the freeing operation will require traversing through the lists and is not something that the replayer should spend its time on.
Replay State, consists of the following data
- Anonymized FH to Replay FH map
- file/dir name to FH map
- FH to Xid map, maps a file handle to the RPC xid in which it was first seen. Its used for determining dependency between operations. More later.
- Xid to Outstanding NFS ops map, maps the Xid to NFS requests, for which no response has been received yet.
Replay Issues
These are some of the details that need to be handled during replay.
Handling failed requests
The question here is, how to handle a request that was successful in the trace but failed during replay.
Firstly, to a large extent, I expect that the requests which have been successful in the trace but not during replay, generally fail due to differences/inconsistencies in the way the file system was re-created using the information from the trace.
It's not a problem if the failed request is what is classified as idempotent. A bigger cause of concern are non-indempotent requests which modify the file system in a particular way that subsequent requests depend on this operation being successful. For eg, if a CREATE request fails, we do not have the file handle that will be required for performing any further operations which need this file handle. Similarly, LOOKUP requests also lead to such problems, even though its an idempotent operation.
Handling requests that fail in replay
This is the main task of the response processor. It checks the status of the reply and for certain operations, ensures that the replay reply is a success. If not, it disables the subsequent operations that depend on this particular operation's success. This is performed by maintaining the Operations dependency tree, which sets a flag in the operation structure's node signifying to the scheduler that it should move on to scheduling the next active request. This flag is set for all the requests that are present in the failed operation's dependency list.
Mounting
The replayer will have to implement the mount protocol to explicitly mount the file system being accessed, before the replay can begin.
Scaling
There can be two types of scaling as explained in the TBBT paper.
Temporal, increase the rate of operations
Spatial, increase the number of files/dirs being operated upon in parallel.
Temporal scaling is simple, spatial scaling can be performed only if there are enough independent operations in the Operations list to be carried out in parallel.
Asynchronous and Non-blocking RPC
The Sun RPC library thats part of the glibc package only allows blocking-wait RPC which is the ideal way to carry out remote procedure calls since we need the repalyer to be highly scalable, this library is not ideal. An ideal library would provide non-blocking calls to sending functions and asynchronous notifications of replies and responses from the server. I've developed such a library that is an extension of the Sun RPC library. The Sun RPC library design allows writing different handlers for different types of transport protocols. The built-in types are the basic ones for UDP called clnt_udp and for TCP called clnt_tcp. I've written the new non-blocking and asynchronous extension called clnt_tcp_nb. Its asynchronicity is due to the use of SIGIO for event notification on TCP socket. For more info, see the Asynchronous RPC project page, AsyncRPC
Client side NFS library
I've written a library for client NFS operations from user-space as compared to the Linux kernel implementation that sits underneath the VFS layer. Its available here, libnfsclient
TODO
Contents of meta info and also find a better name, DONE
Ops info file format, DONE
Detailed Operations and Dependency list structure, DONE
Replay related state info and its mappings to the trace info in the operations list/tree, DONE
Operations to be handled specially by the Response Processor, DONE
Handling spatial scaling, DONE
Stats to be collected, DONE
RPC and NFS lib for (en|de)capsulating messages in the Send and Receive function, DONE
- Check out IPBench timing code for scheduler
Archived Notes
1 Why? Why not do all the trace_reader work offline before starting the replay? Stick the result into a file and mmap it. Then you need no locking, and your replay machine is more lightly loaded. Done, Will change the figures soon.
2 Why not do a topological sort and not bother with a list?, The operations are already sorted on the time-of-capture, so I dont need to perform any re-ordering/sorting. The need for a Dependency list arises only because I need to ensure that the scheduler is as efficient as possible. For eg, in case an op fails, we dont want the scheduler to traverse the list of operations to find an op that doesnt depend on the failed op. We also need the dependency list so that we can determine the ops that can be pipelined in case of spatial scaling, at least thats the intention.Of course, we can use an array instead of a list. The whole concept of Ops and Dep list is not clear yet, since I am running into some cases where it fails to prevent a list traversal in the scheduler. Working on it.
