In this project, you will be required to: (1) significantly extend an existing 3-tier cluster-based auction service, and (2) operate the service to see what level of availability you can achieve.
The TAs (545 & 553) are currently setting up an instance of the auction service for you. You should coordinate with them to understand how to set up and run the service yourselves as you will need to do this later on. You can find a brief description of the service in our paper:
State Maintenance and its Impact on the Performability of Multi-tiered Internet Services. G. Gama, K. Nagaraja, R. Bianchini, R. P. Martin, W. Meira Jr., T. D. Nguyen. In Proceedings of the 23rd Symposium on Reliable Distributed Systems (SRDS), October 2004.
Your first task will be to understand the service, which is structured using the industry's de facto 3-tier cluster architecture, and propose an interesting extension that will require you to extensively modify/extend the existing code so that you will get a good taste of what it feels like to build one of these services. Once you decide on an extension, you should clear it with me and Kien.
Your next task is then to implement and carefully test your proposed extension. Make sure to not take too long to do this as there is much more to be done afterward.
Once you have the extension done (perhaps even while some of you are still working on the extension), you should create a client emulator for your service. We already have a client emulator but I would like you to modify it somewhat to place a more realistic load on your service (e.g., periods of light and peak loads, perhaps with diurnal cycles, etc.).
Your initial project proposal should include your proposed extension of the service and of the client emulator.
Finally, you are to set up and use our validation environment so that the TAs, and probably a red team from 553, can perform various maintenance tasks. During the maintenance tasks, the TAs (and red team) will on purpose inject various imaginative mistakes. Your task is to set your system up in a way to mask as many of these mistakes from the end clients as possible. You should read the following papers to understand this part of the project:
Understanding and Dealing with Operator Mistakes in Internet Services. K. Nagaraja, F. Oliveira, R. Bianchini, R. P. Martin, T. D. Nguyen. To appear in Proceedings of 6th Symposium on Operating Systems Design and Implementation (OSDI '04), December 2004.
Understanding and Validating Database System Administration. Fábio Oliveira, Kiran Nagaraja, Rekha Bachwani, Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen. Technical Report DCS-TR-584, Department of Computer Science, Rutgers University, October 2005, Revised December 2005.
You will also probably be asked to be a red team for a similar 553 project. You might skim the following papers to increase your understanding of human mistakes and how they can impact the availability of Internet services.
Oppenheimer, D., Archana Ganapathi, and David A. Patterson. Why do Internet services fail, and what can be done about it? 4th USENIX Symposium on Internet Technologies and Systems (USITS '03), March 2003.
"Failure from the Field: Complexity Kills" George Herbert. Proceedings of the Second Workshop on Evaluating and Architecting System dependabilitY (EASY), 2002.
Here's information on this project from the last time I ran 545. One team from that semester took this on and generated the following project proposal and final report. They had quite a lot of fun doing this, I think. And, if you read the final report, there's obviously quite a lot left to be done. In particular, I would like to extend their effort on to a "regular" distributed environment like the department's computing environment, rather than just looking at a 3-tier service, which has a much too regular architecture.
A well known quote is Lamport's statement: "A distributed system is one in which the failure of a machine I’ve never heard of can prevent me from doing my work." This is becoming more and more true with all the network and web services that we are increasingly depending on. I can be prevented from working not only by a failure of a machine I have never heard of but by a configuration change by someone I have never heard of on a piece of equipment I have never heard of. So, the point of this project is to build a forensics center where users can get useful answers about the current state of the system and how did it get there. Basically, we should start by learning about the various monitoring tools that already exist, perhaps writing a web service front end to these tools to build an automatic forensics center. Then, we need to analyze what current monitoring tools can tell us and what they can't. Readings for the "What's going on in my distributed system?" topic is particularly relevant to this project. Finally, we improve the state-of-the-art by extending our forensics center to be able to answer questions that current monitoring tools cannot. Identifying the proper questions will be part of the research.
For a description of what we currently have, read:
A Cost-Effective Distributed File Service with QoS Guarantees. Kien Le, Ricardo Bianchini, and Thu D. Nguyen. DRAFT -- Please do not distribute.
Your task for this project is pretty well defined: change the single-node front-end of the file system into a distributed, replicated service. Your system should be NFS-mountable so that any arbitrary NFS client can connect to your file system by mounting it. Kien can easily let you look at the code. He can also be quite a valuable resource for this project as he understand this implementation and the overall project thoroughly. The system is currently written in Java.