This commit is contained in:
renjue
2026-02-25 23:02:08 +08:00
parent 7e5eb65220
commit ed0efc8477
33 changed files with 5336 additions and 0 deletions

View File

@@ -0,0 +1,94 @@
# 课程基本信息
## 课程结构
6.5840 是一门 12 学分的核心研究生课程,包含课堂讲授、实验、可选项目、期中考试和期末考试。
课程于 TR1-2:30 在 54-100 教室线下进行。多数课堂将部分用于讲授、部分用于论文讨论。你应在课前阅读指定论文并准备参与讨论。课表会标明每次课对应的论文。
我们将在课表上每篇论文对应课程开始前 24 小时发布一个关于该论文的问题(见每篇论文的 Question 链接)。你的回答只需足够长以表明你理解了论文,通常一两段即可。我们不会逐条反馈,但会浏览你的回答以确认言之有理,这些回答会计入成绩。若对论文有疑问,你也可以(可选)提交问题;我们可能会回答和/或调整讲授内容以解答。
6.5840 将在正常上课时间进行期中考试,在期末考试周进行期末考试。你必须参加两场考试。考试不设补考或冲突时段。若选修 6.5840,请勿选修与上课时间冲突的其他课程。
本学期每隔一两周会有编程实验截止。实验旨在帮助你更深入理解 6.5840 中讨论的部分思想;更一般的目标是让你积累分布式系统编程与调试经验。学期中我们会要求你参加五次随机抽取实验的检查会,届时我们会就你的实验代码如何工作提问。
学期末你可以在基于自己想法的期末项目与做 Lab 5 之间二选一。若选择做项目,须组成 23 人小组,项目须与 6.5840 主题紧密相关,且须事先经我们批准。你需要提交简短的项目提案;若获批准,则设计并实现系统;学期末提交结果摘要(我们会公布)与代码,并在课上做简短展示与演示。
要在 6.5840 中取得好成绩,你应已具备 6.19106.004)水平的计算机系统基础,以及 6.18006.033)或 6.1810 至少其一,并擅长调试、实现与设计软件,例如修读过 6.1810、6.11006.035)等编程密集型课程。
---
## 成绩评定
最终成绩由以下部分构成:
- 40% 实验(编程作业),含可选项目
- 20% 期中考试
- 20% 期末考试
- 15% 实验检查
- 5% 论文问题回答
- 各次实验提交按完成该作业所给周数(不含开学第一周、期中周和春假周)加权。
为应对突发情况Lab 1、2、3、4 和 5A 可以迟交,但所有实验迟交时间总和不得超过 72 小时。这 72 小时可在各次实验间任意分配,无需事先告知我们。迟交时长仅适用于 Lab 1、2、3、4 和 5A不能用于 Lab 5BD 或项目的任何部分。
若某次实验迟交,且你的总迟交时间(含该次)超过 72 小时,但在学期最后一天前提交,我们将按按时提交所得分数的一半给分。若迟交超过 72 小时仍希望我们批改,请发邮件说明。无论迟交时长多少,学期最后一天之后我们不再接受任何作业。若在学期最后一天前未提交某次作业,该次作业记零分。
若需豁免上述规则,请让 S3 向我们发送说明函。
---
## 合作政策
请独立完成课程实验:欢迎与他人讨论实验内容,但请勿查看或提交他人代码。若考虑用 AI 助手代写代码,请注意你会相应减少从实验中的学习。无论如何,我们要求你理解并能解释所提交的全部代码,并能回答考试中与实验相关的问题。
请勿公开你的代码或提供给当前或未来的 6.5840 学生。github.com 上仓库默认公开,因此除非将仓库设为私有,否则请勿将代码放在该处。使用 MIT 的 GitHub 时请务必创建私有仓库。
论文问题可与他人讨论,但不得查看他人答案。答案必须由本人撰写。
## 课程人员
有关课程的问题或意见请发至 6824-staff@lists.csail.mit.edu。
### 主讲教师
Frans Kaashoek 32-G992 kaashoek@csail.mit.edu
Robert Morris 32-G972 rtm@csail.mit.edu
### 助教
Baltasar Dinis
Ayana Alemayehu
Upamanyu Sharma
Yun-Sheng Chang
Danny Villanueva
Brian Shi
Nour Massri
Beshr Islam Bouli
---
## 答疑时间
日期 时间 地点 助教
待定
如需在所列答疑时间之外与课程人员见面,可通过邮件或私人 Piazza 帖子预约。
---
## 致谢
6.5840 课程材料的很大一部分由 Robert Morris、Frans Kaashoek 和 Nickolai Zeldovich 开发。该课程在 2023 年前名为 6.824。
有关 6.5840 的问题或意见?请发送邮件至 6824-staff@lists.csail.mit.edu。
---
*来源: [General Information](https://pdos.csail.mit.edu/6.824/general.html)*

View File

@@ -0,0 +1,94 @@
# General Information
## Structure
6.5840 is a 12-unit core graduate subject with lectures, labs, an optional project, a mid-term exam, and a final exam.
Class meets TR1-2:30 in person in 54-100. Most class meetings will be part lecture and part paper discussion. You should read the paper before coming to class, and be prepared to discuss it. The schedule indicates the paper to read for each meeting.
We will post a question about each paper 24 hours before the beginning of class on the schedule (see the Question link for each paper). Your answer need only be long enough to demonstrate that you understand the paper; a paragraph or two will usually be enough. We won't give feedback, but we will glance at your answers to make sure they make sense, and they will contribute to your grade. If you have a question about a paper, you may also (optionally) submit it; we may answer and/or adjust the lecture to answer your question.
6.5840 will have a midterm exam during the ordinary lecture time, and a final exam during finals week. You must attend both exams. There will be no make-up or alternate conflict times for the exams. If you take 6.5840, please do not register for any other class with a conflicting lecture time.
There are programming labs due every week or two throughout the term. The labs will help you understand more deeply some of the ideas discussed in 6.5840; a more general goal is for you to gain experience programming and debugging distributed systems. During the semester we will ask you to attend five check-off meetings, for randomly chosen labs, in which we will ask you questions about how your lab code works.
Towards the end of the term you can choose between doing a final project based on your own ideas, or doing Lab 5. If you want to do a project, you must form a team of two or three people, the project must be closely related to 6.5840 topics, and we must approve it in advance. You'll hand in a short project proposal, and, if we approve, you'll design and build a system; at the end of the term you'll hand in a summary of your results (which we'll post) and your code, and do a short presentation and demo in class.
To do well in 6.5840, you should already be familiar with computer systems to the level of 6.1910 (6.004) and at least one of 6.1800 (6.033) or 6.1810, and you should be good at debugging, implementing, and designing software, perhaps as a result of taking programming-intensive courses such as 6.1810 and 6.1100 (6.035).
---
## Grading
Final course grades will be based on:
- 40% labs (programming assignments), including optional project
- 20% mid-term exam
- 20% final exam
- 15% lab check-offs
- 5% paper question answers
- Each lab submission is weighted proportionally to the number of weeks that you have to complete the assignment, excluding the first week of classes, midterm week, and spring break week.
To help you cope with unexpected emergencies, you can hand in your Lab 1, 2, 3, 4, and 5A solutions late, but the total amount of lateness summed over all the lab deadlines must not exceed 72 hours. You can divide up your 72 hours among the labs however you like; you don't have to ask or tell us. You can only use late hours for Labs 1, 2, 3, 4, and 5A; you cannot use late hours for Lab 5B-D or for any aspect of the project.
If you hand a lab in late, and your total late time (including the late time for that assignment) exceeds 72 hours, and you hand it in by the last day of classes, then we'll give it half the credit we would have given if you had handed it in on time. Please send us e-mail if you want us to grade an assignment that's more than 72 hours late. We will not accept any work after the last day of classes, regardless of late hours. If you don't hand in an assignment by the last day of classes, we'll give the assignment zero credit.
If you want an exception to these rules, please ask S3 to send us an excuse note.
---
## Collaboration policy
Please do the course labs individually: you are welcome to discuss the labs with others, but please do not look at (or hand in) anyone else's solution. If you are tempted to use an AI assistant to write code for you, consider that you'll then learn correspondingly less from the labs. Regardless, we will expect you to understand and be able to explain all of the code that you hand in, and to be able to reason about lab-related questions on the exams.
Please do not publish your code or make it available to current or future 6.5840 students. github.com repositories are public by default, so please don't put your code there unless you make the repository private. You may find it convenient to use MIT's GitHub, but be sure to create a private repository.
You may discuss the paper questions with other students, but you may not look at other students' answers. You must write your answers yourself.
## Staff
Please use 6824-staff@lists.csail.mit.edu to send questions or comments about the course to the staff.
### Lecturer
Frans Kaashoek 32-G992 kaashoek@csail.mit.edu
Robert Morris 32-G972 rtm@csail.mit.edu
### Teaching assistants
Baltasar Dinis
Ayana Alemayehu
Upamanyu Sharma
Yun-Sheng Chang
Danny Villanueva
Brian Shi
Nour Massri
Beshr Islam Bouli
---
## Office hours
Day Time Location TA
TBD
Appointments with staff outside of the listed office hours can be setup via email or private Piazza post.
---
## Acknowledgements
Robert Morris, Frans Kaashoek, and Nickolai Zeldovich developed much of the 6.5840 course material. The course was called 6.824 before 2023.
Questions or comments regarding 6.5840? Send e-mail to 6824-staff@lists.csail.mit.edu.
---
*From: [General Information](https://pdos.csail.mit.edu/6.824/general.html)*

View File

@@ -0,0 +1,48 @@
# 实验指南
---
## 作业难度
每个实验任务都标有大致预计用时:
* **Easy简单**:数小时。
* **Moderate中等**:约每周 6 小时。
* **Hard困难**:每周超过 6 小时。若起步较晚,你的实现很可能无法通过全部测试。
多数实验只需适度的代码量(每个实验部分可能几百行),但**概念上可能较难**,需要较多思考和调试。部分测试较难通过。
**不要在截止前一晚才开始做实验**;分多天、多次完成会更高效。由于并发、崩溃和不可靠网络,在分布式系统中排查 bug 较为困难。
---
## 提示
* 完成 [Go 在线教程](https://go.dev/tour/)并参考 [Effective Go](https://go.dev/doc/effective_go)。参见 [Editors](https://go.dev/doc/editors.html) 配置 Go 编辑器。
* 实验的 Makefile 已配置为使用 Go 的 **race detector**。请修复其报告的所有数据竞争。参见 [race detector 博客文章](https://go.dev/blog/race-detector)。
* 实验中的 [加锁建议](../papers/raft-locking-cn.txt)。
* [Raft 实验结构建议](../papers/raft-structure-cn.txt)。
* 该 [Raft 交互示意图](../papers/raft_diagram.pdf) 有助于理解系统各部分之间的代码流程。
* 学习 Go 的 Printf 格式字符串:[Go format strings](https://pkg.go.dev/fmt)。
* 进一步了解 git可参阅 [Pro Git 书](https://git-scm.com/book/en/v2) 或 [git 用户手册](https://git-scm.com/docs/user-manual)。
---
## 调试
高效调试需要经验。**系统化**会很有帮助:对可能原因形成假设;收集可能相关的证据;分析已收集的信息;按需重复。长时间调试时做笔记有助于积累证据并提醒自己为何排除之前的假设。
最有效的调试方法之一往往是在代码中**添加 print 语句**,运行失败的测试并将输出保存到文件,然后浏览输出找出开始出错的位置。可能需多轮迭代,随问题更清晰而增加更多 print。
不同节点之间以及同一节点内多线程的**并发**会使操作以意想不到的方式交错。例如,前一个 leader 仍认为自己是 leader 时,某个 Raft 节点可能已被选为新 leader或 leader 发出 RPC 但在失去 leader 身份后才收到回复。添加 print 有助于发现这类情况。
可以**查看测试代码**`mr/mt_test.go``raft1/raft_test.go` 等)以理解测试在测什么。可以在测试中加 print 帮助理解其行为及失败原因,但**提交前请确保你的代码在原始测试代码下能通过**。
Raft 论文的 **Figure 2** 必须较为严格地遵守。很容易漏掉 Figure 2 要求检查的条件或要求发生的状态变化。若有 bug**请再次核对你的代码是否严格遵循 Figure 2**。
在写代码时(即尚未出现 bug 时),对代码假定成立的条件添加**显式检查**可能很有用,例如使用 Go 的 [panic](https://go.dev/blog/defer-panic-and-recover)。这类检查有助于发现后续代码无意违反假设的情况。
助教乐于在答疑时间帮你思考代码,但若你**已经尽可能深入分析过**,有限的答疑时间才能发挥最大作用。
---
*来源: [Lab guidance](https://pdos.csail.mit.edu/6.824/labs/guidance.html)*

View File

@@ -0,0 +1,48 @@
# Lab Guidance
---
## Hardness of Assignments
Each lab task is tagged to indicate roughly how long we expect the task to take:
* **Easy**: A few hours.
* **Moderate**: ~ 6 hours (per week).
* **Hard**: More than 6 hours (per week). If you start late, your solution is unlikely to pass all tests.
Most of the labs require only a modest amount of code (perhaps a few hundred lines per lab part), but can be **conceptually difficult** and may require a good deal of thought and debugging. Some of the tests are difficult to pass.
**Don't start a lab the night before it is due**; it's more efficient to do the labs in several sessions spread over multiple days. Tracking down bugs in distributed systems is difficult, because of concurrency, crashes, and an unreliable network.
---
## Tips
* Do the [Online Go tutorial](https://go.dev/tour/) and consult [Effective Go](https://go.dev/doc/effective_go). See [Editors](https://go.dev/doc/editors.html) to set up your editor for Go.
* The lab Makefiles are set up to use Go's **race detector**. Fix any races it reports. See the [race detector blog post](https://go.dev/blog/race-detector).
* Advice on [locking](../papers/raft-locking.txt) in labs.
* Advice on [structuring your Raft lab](../papers/raft-structure.txt).
* This [Diagram of Raft interactions](../papers/raft_diagram.pdf) may help you understand code flow between different parts of the system.
* Learn about Go's Printf format strings: [Go format strings](https://pkg.go.dev/fmt).
* To learn more about git, look at the [Pro Git book](https://git-scm.com/book/en/v2) or the [git user's manual](https://git-scm.com/docs/user-manual).
---
## Debugging
Efficient debugging takes experience. It helps to be **systematic**: form a hypothesis about a possible cause of the problem; collect evidence that might be relevant; think about the information you've gathered; repeat as needed. For extended debugging sessions it helps to keep notes, both to accumulate evidence and to remind yourself why you've discarded specific earlier hypotheses.
The most effective debugging technique is often to **add print statements** to your code, run the test that is failing and collect the print output in a file, and then look through the output file to identify the point at which things start to go wrong. You may need to iterate, adding more print statements as you learn more about what is going wrong.
**Concurrency** among different peers and among the threads in a single peer can cause actions to be interleaved in unexpected ways. For example, it's quite possible for a Raft peer to be elected leader while the previous leader still thinks it is the leader, or for a leader to send an RPC but receive the reply after it has lost leadership. Adding print statements may help you spot such situations.
Feel free to **examine the test code** (`mr/mt_test.go`, `raft1/raft_test.go`, &c) to understand what the tests are exploring. You can add print statements to the tests to help you understand what they are doing and why they are failing, but **be sure your code passes with the original test code before submitting**.
The Raft paper's **Figure 2** must be followed fairly exactly. It is easy to miss a condition that Figure 2 says must be checked, or a state change that it says must be made. If you have a bug, **re-check that all of your code adheres closely to Figure 2**.
As you're writing code (i.e., before you have a bug), it may be worth adding **explicit checks** for conditions that the code assumes to be true, perhaps using Go's [panic](https://go.dev/blog/defer-panic-and-recover). Such checks may help detect situations where later code unwittingly violates the assumptions.
The TAs are happy to help you think about your code during office hours, but you're likely to get the most mileage out of limited office hour time if you've **already dug as deep as you can** into the situation.
---
*From: [Lab guidance](https://pdos.csail.mit.edu/6.824/labs/guidance.html)*

View File

@@ -0,0 +1,221 @@
# 6.5840 Lab 1: MapReduce
## 简介
在本实验中你将实现一个 MapReduce 系统。你需要实现:调用应用 Map 和 Reduce 函数并负责读写文件的 worker 进程,以及向 worker 分配任务并应对 worker 失败的 coordinator 进程。你将实现与 [MapReduce 论文](../papers/mapreduce-cn.md) 中类似的内容。(注:本实验使用 "coordinator" 代替论文中的 "master"。)
## 起步
你需要先安装并配置 Go 才能完成实验。
用 git 获取初始实验代码。进一步了解 git 可参阅 [Pro Git 书](https://git-scm.com/book/en/v2) 或 [git 用户手册](https://git-scm.com/docs/user-manual)。
```bash
$ git clone git://g.csail.mit.edu/6.5840-golabs-2026 6.5840
$ cd 6.5840
$ ls
Makefile src
$
```
我们在 `src/main/mrsequential.go` 中提供了一个简单的单进程顺序 MapReduce 实现,在一个进程内逐个执行 map 和 reduce。我们还提供了若干 MapReduce 应用:`mrapps/wc.go` 中的词频统计,以及 `mrapps/indexer.go` 中的文本索引。可以按如下方式顺序运行词频统计:
```bash
$ cd ~/6.5840
$ cd src/main
$ go build -buildmode=plugin ../mrapps/wc.go
$ rm mr-out*
$ go run mrsequential.go wc.so pg*.txt
$ sort mr-out-0
A 509
ABOUT 2
ACT 8
ACTRESS 1
...
```
(若希望 sort 产生上述输出,可能需设置环境变量 `LC_COLLATE=C``LC_COLLATE=C sort mr-out-0`
`mrsequential.go` 将输出写入文件 `mr-out-0`。输入来自名为 `pg-xxx.txt` 的文本文件。
可以复用 `mrsequential.go` 中的代码。也可查看 `mrapps/wc.go` 了解 MapReduce 应用代码的形式。
对本实验及后续实验,我们可能会对提供的代码进行更新。为便于用 `git pull` 获取并合并更新,建议保留我们提供的代码在原始文件中。你可以按实验说明在现有代码上增补,但不要移动它们。可以把你自己新增的函数放在新文件中。
## 你的任务
你的任务是实现一个分布式 MapReduce包含两个程序**coordinator** 和 **worker**。只有一个 coordinator 进程,以及一个或多个并行运行的 worker 进程。在实际系统中 worker 会运行在多台机器上本实验中将全部在同一台机器上运行。Worker 通过 RPC 与 coordinator 通信。每个 worker 进程循环向 coordinator 请求任务,从若干文件中读取该任务的输入,执行任务,将输出写入若干文件,然后再次向 coordinator 请求新任务。Coordinator 应在合理时间内(本实验为十秒)检测 worker 是否未完成任务,并将同一任务交给其他 worker。
我们已提供少量起步代码。Coordinator 和 worker 的 "main" 例程在 `main/mrcoordinator.go``main/mrworker.go` 中;**不要修改这两个文件**。你的实现应放在 `mr/coordinator.go``mr/worker.go``mr/rpc.go` 中。
在词频统计 MapReduce 应用上运行你的代码的步骤。首先构建词频统计插件:
```bash
$ cd main
$ go build -buildmode=plugin ../mrapps/wc.go
```
在一个终端中运行 coordinator
```bash
$ rm mr-out*
$ go run mrcoordinator.go sock123 pg-*.txt
```
参数 `sock123` 指定 coordinator 接收 worker RPC 的 socket。传给 `mrcoordinator.go``pg-*.txt` 是输入文件;每个文件对应一个 "split",即一个 Map 任务的输入。
在另一个或多个终端中运行若干 worker
```bash
$ go run mrworker.go wc.so sock123
```
当 worker 和 coordinator 都结束后,查看 `mr-out-*` 中的输出。完成实验后,对所有输出文件排序后的并集应与顺序实现的输出一致,例如:
```bash
$ cat mr-out-* | sort | more
A 509
ABOUT 2
ACT 8
ACTRESS 1
...
```
我们提供了批改时将使用的全部测试。测试源码在 `mr/mr_test.go`。可在 src 目录下运行测试:
```bash
$ cd src
$ make mr
...
```
测试会检查:在给定 pg-xxx.txt 作为输入时wc 和 indexer 两个 MapReduce 应用是否产生正确输出;你的实现是否并行执行 Map 和 Reduce 任务;以及是否能在运行任务的 worker 崩溃后恢复。
若现在运行测试,会在第一个测试中卡住:
```bash
$ cd ~/6.5840/src
$ make mr
...
cd mr; go test -v -race
=== RUN TestWc
...
```
你可以把 `mr/coordinator.go``Done` 函数里的 `ret := false` 改为 `true`,这样 coordinator 会立即退出。然后:
```bash
$ make mr
...
=== RUN TestWc
2026/01/22 14:56:24 reduce created no mr-out-X output files!
exit status 1
FAIL 6.5840/mr 4.516s
make: *** [Makefile:44: mr] Error 1
$
```
测试期望看到名为 `mr-out-X` 的输出文件,每个 reduce 任务一个。`mr/coordinator.go``mr/worker.go` 的空实现不会生成这些文件(也几乎不做别的事),因此测试会失败。
完成后,测试输出应类似:
```bash
$ make mr
...
=== RUN TestWc
--- PASS: TestWc (8.64s)
=== RUN TestIndexer
--- PASS: TestIndexer (5.90s)
=== RUN TestMapParallel
--- PASS: TestMapParallel (7.05s)
=== RUN TestReduceParallel
--- PASS: TestReduceParallel (8.05s)
=== RUN TestJobCount
--- PASS: TestJobCount (10.04s)
=== RUN TestEarlyExit
--- PASS: TestEarlyExit (6.05s)
=== RUN TestCrashWorker
2026/01/22 14:58:14 *re*-starting map ../../main/pg-tom_sawyer.txt 0
2026/01/22 14:58:14 *re*-starting map ../../main/pg-metamorphosis.txt 2
2026/01/22 14:58:39 *re*-starting map ../../main/pg-metamorphosis.txt 2
2026/01/22 14:58:40 map 2 already done
2026/01/22 14:58:45 *re*-starting reduce 0
--- PASS: TestCrashWorker (40.18s)
PASS
ok 6.5840/mr 86.932s
$
```
根据你终止 worker 进程的方式,可能会看到类似错误:
```
2026/02/11 16:21:32 dialing:dial unix /var/tmp/5840-mr-501: connect: connection refused
```
每个测试中出现少量这类消息是可以的;它们出现在 coordinator 已退出后 worker 无法联系到 coordinator 的 RPC 服务时。
## 若干规则
- Map 阶段应将中间 key 划分到 **nReduce** 个 reduce 任务的桶中,其中 nReduce 是 reduce 任务数量,即 `main/mrcoordinator.go` 传给 `MakeCoordinator()` 的参数。每个 mapper 应为 reduce 任务创建 nReduce 个中间文件。
- Worker 实现应把第 X 个 reduce 任务的输出放在文件 **mr-out-X** 中。
- **mr-out-X** 文件应包含 Reduce 函数输出的每一行。该行应由 Go 的 `"%v %v"` 格式生成,传入 key 和 value。可参考 `main/mrsequential.go` 中注释为 "this is the correct format" 的那行。若你的实现与该格式偏差过大,测试会失败。
- 可以修改 `mr/worker.go``mr/coordinator.go``mr/rpc.go`。可以临时修改其他文件做测试,但须确保在原始版本下你的代码能正确运行;我们会用原始版本测试。
- Worker 应将 Map 的中间输出放在当前目录的文件中,以便之后在 Reduce 任务中读取。
- `main/mrcoordinator.go` 期望 `mr/coordinator.go` 实现 **Done()** 方法,在 MapReduce 作业完全结束时返回 true此时 mrcoordinator.go 会退出。
- 当作业完全结束时worker 进程应退出。一种简单做法是利用 `call()` 的返回值:若 worker 无法联系到 coordinator可认为 coordinator 因作业结束已退出,于是 worker 也可终止。根据你的设计,也可以让 coordinator 给 worker 一个 "please exit" 的伪任务。
## 提示
- [Guidance](./2.%20Lab%20Guidance-cn.md) 页有一些开发和调试建议。
- 一种起步方式是修改 `mr/worker.go``Worker()`,向 coordinator 发 RPC 请求任务。然后修改 coordinator用尚未开始的 map 任务的文件名回复。再修改 worker 读取该文件并调用应用的 Map 函数,如 `mrsequential.go` 中所示。
- 应用的 Map 和 Reduce 函数在运行时通过 Go 的 plugin 包从以 `.so` 结尾的文件加载。
- 若修改了 `mr/` 目录下的任何内容,很可能需要重新构建所用 MapReduce 插件,例如 `go build -buildmode=plugin ../mrapps/wc.go``make mr` 会为你构建插件。可用 `make RUN="-run Wc" mr` 运行单个测试,该命令会把 `-run Wc` 传给 go test只运行 `mr/mr_test.go` 中匹配 Wc 的测试。
- 本实验依赖 worker 共享文件系统。所有 worker 在同一台机器上时很简单;若 worker 在不同机器上,则需要 GFS 之类的全局文件系统。
- 中间文件的合理命名是 **mr-X-Y**,其中 X 为 Map 任务编号Y 为 reduce 任务编号。
- Worker 的 map 任务代码需要一种方式将中间 key/value 对写入文件,以便在 reduce 任务中正确读回。一种做法是使用 Go 的 `encoding/json` 包。将 key/value 对以 JSON 格式写入已打开的文件:
```go
enc := json.NewEncoder(file)
for _, kv := ... {
err := enc.Encode(&kv)
}
```
读回该文件:
```go
dec := json.NewDecoder(file)
for {
var kv KeyValue
if err := dec.Decode(&kv); err != nil {
break
}
kva = append(kva, kv)
}
```
- Worker 的 map 部分可使用 **ihash(key)** 函数(在 worker.go 中)为给定 key 选择对应的 reduce 任务。
- 可从 `mrsequential.go` 借鉴读取 Map 输入文件、在 Map 和 Reduce 之间排序中间 key/value 对、以及将 Reduce 输出写入文件的代码。
- Coordinator 作为 RPC 服务器是并发的;别忘了**对共享数据加锁**。
- Worker 有时需要等待,例如 reduce 须等最后一个 map 完成才能开始。一种做法是 worker 周期性地向 coordinator 请求工作,每次请求之间用 `time.Sleep()` 休眠。另一种做法是 coordinator 中相应的 RPC 处理函数里用循环等待,可用 `time.Sleep()` 或 `sync.Cond`。Go 为每个 RPC 在独立线程中运行处理函数,因此一个处理函数在等待不会阻止 coordinator 处理其他 RPC。
- Coordinator 无法可靠区分崩溃的 worker、存活但卡住的 worker、以及执行过慢的 worker。能做的是让 coordinator 等待一段时间后放弃,并把任务重新发给其他 worker。本实验中请让 coordinator 等待**十秒**;之后应假定该 worker 已死(当然也可能没死)。
- 若选择实现 Backup Tasks论文 3.6 节),请注意我们测试在 worker 不崩溃时你的代码不会调度多余任务。Backup tasks 应只在相对较长时间(例如 10 秒)后才调度。
- 测试崩溃恢复可使用 **mrapps/crash.go** 应用插件,它会在 Map 和 Reduce 函数中随机退出。
- 为确保在崩溃情况下无人看到未写完的文件MapReduce 论文提到使用临时文件并在完全写完后原子重命名的技巧。可用 `ioutil.TempFile`(或 Go 1.17 及以上的 `os.CreateTemp`)创建临时文件,用 `os.Rename` 原子重命名。
- Go RPC 只发送**首字母大写**的 struct 字段名。子结构体的字段名也须大写。
- 调用 RPC 的 `call()` 时reply 结构体应包含全部默认值。RPC 调用应类似:
```go
reply := SomeType{}
call(..., &reply)
```
在 call 之前不要设置 reply 的任何字段。若传入的 reply 结构体含有非默认字段RPC 系统可能静默返回错误值。
## 不计分挑战
- 实现你自己的 MapReduce 应用(参考 `mrapps/*` 中的示例),例如分布式 GrepMapReduce 论文 2.3 节)。
- 让 MapReduce coordinator 和 worker 在不同机器上运行,与实际部署一致。需要将 RPC 改为通过 TCP/IP 而非 Unix socket 通信(参见 `Coordinator.server()` 中的注释行),并通过共享文件系统读写文件。例如可 ssh 到 MIT 的多台 Athena 集群机器,它们使用 AFS 共享文件;或租用几台 AWS 实例并用 S3 存储。
---
*来源: [6.5840 Lab 1: MapReduce](https://pdos.csail.mit.edu/6.824/labs/lab-mr.html)*

View File

@@ -0,0 +1,221 @@
# 6.5840 Lab 1: MapReduce
## Introduction
In this lab you'll build a MapReduce system. You'll implement a worker process that calls application Map and Reduce functions and handles reading and writing files, and a coordinator process that hands out tasks to workers and copes with failed workers. You'll be building something similar to the [MapReduce paper](../papers/mapreduce.md). (Note: this lab uses "coordinator" instead of the paper's "master".)
## Getting started
You need to setup Go to do the labs.
Fetch the initial lab software with git (a version control system). To learn more about git, look at the [Pro Git book](https://git-scm.com/book/en/v2) or the [git user's manual](https://git-scm.com/docs/user-manual).
```bash
$ git clone git://g.csail.mit.edu/6.5840-golabs-2026 6.5840
$ cd 6.5840
$ ls
Makefile src
$
```
We supply you with a simple sequential mapreduce implementation in `src/main/mrsequential.go`. It runs the maps and reduces one at a time, in a single process. We also provide you with a couple of MapReduce applications: word-count in `mrapps/wc.go`, and a text indexer in `mrapps/indexer.go`. You can run word count sequentially as follows:
```bash
$ cd ~/6.5840
$ cd src/main
$ go build -buildmode=plugin ../mrapps/wc.go
$ rm mr-out*
$ go run mrsequential.go wc.so pg*.txt
$ sort mr-out-0
A 509
ABOUT 2
ACT 8
ACTRESS 1
...
```
(You might need to set `LC_COLLATE=C` environment variable for sort to produce the above output: `LC_COLLATE=C sort mr-out-0`)
`mrsequential.go` leaves its output in the file `mr-out-0`. The input is from the text files named `pg-xxx.txt`.
Feel free to borrow code from `mrsequential.go`. You should also have a look at `mrapps/wc.go` to see what MapReduce application code looks like.
For this lab and all the others, we might issue updates to the code we provide you. To ensure that you can fetch those updates and easily merge them using `git pull`, it's best to leave the code we provide in the original files. You can add to the code we provide as directed in the lab write-ups; just don't move it. It's OK to put your own new functions in new files.
## Your Job
Your job is to implement a distributed MapReduce, consisting of two programs, the **coordinator** and the **worker**. There will be just one coordinator process, and one or more worker processes executing in parallel. In a real system the workers would run on a bunch of different machines, but for this lab you'll run them all on a single machine. The workers will talk to the coordinator via RPC. Each worker process will, in a loop, ask the coordinator for a task, read the task's input from one or more files, execute the task, write the task's output to one or more files, and again ask the coordinator for a new task. The coordinator should notice if a worker hasn't completed its task in a reasonable amount of time (for this lab, use ten seconds), and give the same task to a different worker.
We have given you a little code to start you off. The "main" routines for the coordinator and worker are in `main/mrcoordinator.go` and `main/mrworker.go`; **don't change these files**. You should put your implementation in `mr/coordinator.go`, `mr/worker.go`, and `mr/rpc.go`.
Here's how to run your code on the word-count MapReduce application. First, build the word-count plugin:
```bash
$ cd main
$ go build -buildmode=plugin ../mrapps/wc.go
```
In one window, run the coordinator:
```bash
$ rm mr-out*
$ go run mrcoordinator.go sock123 pg-*.txt
```
The `sock123` argument specifies a socket on which the coordinator receives RPCs from workers. The `pg-*.txt` arguments to `mrcoordinator.go` are the input files; each file corresponds to one "split", and is the input to one Map task.
In one or more other windows, run some workers:
```bash
$ go run mrworker.go wc.so sock123
```
When the workers and coordinator have finished, look at the output in `mr-out-*`. When you've completed the lab, the sorted union of the output files should match the sequential output, like this:
```bash
$ cat mr-out-* | sort | more
A 509
ABOUT 2
ACT 8
ACTRESS 1
...
```
We supply you with all the tests that we'll use to grade your submitted lab. The source code for the tests are in `mr/mr_test.go`. You can run the tests in the src directory:
```bash
$ cd src
$ make mr
...
```
The tests check that the wc and indexer MapReduce applications produce the correct output when given the pg-xxx.txt files as input. The tests also check that your implementation runs the Map and Reduce tasks in parallel, and that your implementation recovers from workers that crash while running tasks.
If you run the tests now, they will hang in the first test:
```bash
$ cd ~/6.5840/src
$ make mr
...
cd mr; go test -v -race
=== RUN TestWc
...
```
You can change `ret := false` to `true` in the `Done` function in `mr/coordinator.go` so that the coordinator exits immediately. Then:
```bash
$ make mr
...
=== RUN TestWc
2026/01/22 14:56:24 reduce created no mr-out-X output files!
exit status 1
FAIL 6.5840/mr 4.516s
make: *** [Makefile:44: mr] Error 1
$
```
The tests expect to see output in files named `mr-out-X`, one for each reduce task. The empty implementations of `mr/coordinator.go` and `mr/worker.go` don't produce those files (or do much of anything else), so the test fails.
When you've finished, the test output should look like this:
```bash
$ make mr
...
=== RUN TestWc
--- PASS: TestWc (8.64s)
=== RUN TestIndexer
--- PASS: TestIndexer (5.90s)
=== RUN TestMapParallel
--- PASS: TestMapParallel (7.05s)
=== RUN TestReduceParallel
--- PASS: TestReduceParallel (8.05s)
=== RUN TestJobCount
--- PASS: TestJobCount (10.04s)
=== RUN TestEarlyExit
--- PASS: TestEarlyExit (6.05s)
=== RUN TestCrashWorker
2026/01/22 14:58:14 *re*-starting map ../../main/pg-tom_sawyer.txt 0
2026/01/22 14:58:14 *re*-starting map ../../main/pg-metamorphosis.txt 2
2026/01/22 14:58:39 *re*-starting map ../../main/pg-metamorphosis.txt 2
2026/01/22 14:58:40 map 2 already done
2026/01/22 14:58:45 *re*-starting reduce 0
--- PASS: TestCrashWorker (40.18s)
PASS
ok 6.5840/mr 86.932s
$
```
Depending on your strategy for terminating worker processes, you may see errors like:
```
2026/02/11 16:21:32 dialing:dial unix /var/tmp/5840-mr-501: connect: connection refused
```
It is fine to see a handful of these messages per test; they arise when the worker is unable to contact the coordinator RPC server after the coordinator has exited.
## A few rules
- The map phase should divide the intermediate keys into buckets for **nReduce** reduce tasks, where nReduce is the number of reduce tasks -- the argument that `main/mrcoordinator.go` passes to `MakeCoordinator()`. Each mapper should create nReduce intermediate files for consumption by the reduce tasks.
- The worker implementation should put the output of the X'th reduce task in the file **mr-out-X**.
- A **mr-out-X** file should contain one line per Reduce function output. The line should be generated with the Go `"%v %v"` format, called with the key and value. Have a look in `main/mrsequential.go` for the line commented "this is the correct format". The tests will fail if your implementation deviates too much from this format.
- You can modify `mr/worker.go`, `mr/coordinator.go`, and `mr/rpc.go`. You can temporarily modify other files for testing, but make sure your code works with the original versions; we'll test with the original versions.
- The worker should put intermediate Map output in files in the current directory, where your worker can later read them as input to Reduce tasks.
- `main/mrcoordinator.go` expects `mr/coordinator.go` to implement a **Done()** method that returns true when the MapReduce job is completely finished; at that point, mrcoordinator.go will exit.
- When the job is completely finished, the worker processes should exit. A simple way to implement this is to use the return value from `call()`: if the worker fails to contact the coordinator, it can assume that the coordinator has exited because the job is done, so the worker can terminate too. Depending on your design, you might also find it helpful to have a "please exit" pseudo-task that the coordinator can give to workers.
## Hints
- The [Guidance](./2.%20Lab%20Guidance.md) page has some tips on developing and debugging.
- One way to get started is to modify `mr/worker.go`'s `Worker()` to send an RPC to the coordinator asking for a task. Then modify the coordinator to respond with the file name of an as-yet-unstarted map task. Then modify the worker to read that file and call the application Map function, as in `mrsequential.go`.
- The application Map and Reduce functions are loaded at run-time using the Go plugin package, from files whose names end in `.so`.
- If you change anything in the `mr/` directory, you will probably have to re-build any MapReduce plugins you use, with something like `go build -buildmode=plugin ../mrapps/wc.go`. `make mr` builds the plugins for you. You can run an individual test using `make RUN="-run Wc" mr`, which passes `-run Wc` to go test, and selects any test from `mr/mr_test.go` matching Wc.
- This lab relies on the workers sharing a file system. That's straightforward when all workers run on the same machine, but would require a global filesystem like GFS if the workers ran on different machines.
- A reasonable naming convention for intermediate files is **mr-X-Y**, where X is the Map task number, and Y is the reduce task number.
- The worker's map task code will need a way to store intermediate key/value pairs in files in a way that can be correctly read back during reduce tasks. One possibility is to use Go's `encoding/json` package. To write key/value pairs in JSON format to an open file:
```go
enc := json.NewEncoder(file)
for _, kv := ... {
err := enc.Encode(&kv)
}
```
and to read such a file back:
```go
dec := json.NewDecoder(file)
for {
var kv KeyValue
if err := dec.Decode(&kv); err != nil {
break
}
kva = append(kva, kv)
}
```
- The map part of your worker can use the **ihash(key)** function (in worker.go) to pick the reduce task for a given key.
- You can steal some code from `mrsequential.go` for reading Map input files, for sorting intermediate key/value pairs between the Map and Reduce, and for storing Reduce output in files.
- The coordinator, as an RPC server, will be concurrent; don't forget to **lock shared data**.
- Workers will sometimes need to wait, e.g. reduces can't start until the last map has finished. One possibility is for workers to periodically ask the coordinator for work, sleeping with `time.Sleep()` between each request. Another possibility is for the relevant RPC handler in the coordinator to have a loop that waits, either with `time.Sleep()` or `sync.Cond`. Go runs the handler for each RPC in its own thread, so the fact that one handler is waiting needn't prevent the coordinator from processing other RPCs.
- The coordinator can't reliably distinguish between crashed workers, workers that are alive but have stalled for some reason, and workers that are executing but too slowly to be useful. The best you can do is have the coordinator wait for some amount of time, and then give up and re-issue the task to a different worker. For this lab, have the coordinator wait for **ten seconds**; after that the coordinator should assume the worker has died (of course, it might not have).
- If you choose to implement Backup Tasks (Section 3.6), note that we test that your code doesn't schedule extraneous tasks when workers execute tasks without crashing. Backup tasks should only be scheduled after some relatively long period of time (e.g., 10s).
- To test crash recovery, you can use the **mrapps/crash.go** application plugin. It randomly exits in the Map and Reduce functions.
- To ensure that nobody observes partially written files in the presence of crashes, the MapReduce paper mentions the trick of using a temporary file and atomically renaming it once it is completely written. You can use `ioutil.TempFile` (or `os.CreateTemp` if you are running Go 1.17 or later) to create a temporary file and `os.Rename` to atomically rename it.
- Go RPC sends only struct fields whose names start with **capital letters**. Sub-structures must also have capitalized field names.
- When calling the RPC `call()` function, the reply struct should contain all default values. RPC calls should look like this:
```go
reply := SomeType{}
call(..., &reply)
```
without setting any fields of reply before the call. If you pass reply structures that have non-default fields, the RPC system may silently return incorrect values.
## No-credit challenge exercises
- Implement your own MapReduce application (see examples in `mrapps/*`), e.g., Distributed Grep (Section 2.3 of the MapReduce paper).
- Get your MapReduce coordinator and workers to run on separate machines, as they would in practice. You will need to set up your RPCs to communicate over TCP/IP instead of Unix sockets (see the commented out line in `Coordinator.server()`), and read/write files using a shared file system. For example, you can ssh into multiple Athena cluster machines at MIT, which use AFS to share files; or you could rent a couple AWS instances and use S3 for storage.
---
*From: [6.5840 Lab 1: MapReduce](https://pdos.csail.mit.edu/6.824/labs/lab-mr.html)*

View File

@@ -0,0 +1,178 @@
# 6.5840 Lab 2: Key/Value Server
## 简介
在本实验中你将构建一个单机 key/value 服务器,在网络故障下保证每次 Put 操作**至多执行一次**,并保证操作满足 **linearizable**(线性一致性)。你将用该 KV 服务器实现一把锁。后续实验会复制此类服务器以应对服务器崩溃。
## KV 服务器
每个客户端通过 **Clerk**(一组库例程)与 key/value 服务器交互Clerk 向服务器发送 RPC。客户端可向服务器发送两种 RPC**Put(key, value, version)** 和 **Get(key)**。服务器在内存中维护一个 map为每个 key 记录 **(value, version)** 二元组。key 和 value 均为字符串。version 记录该 key 被写入的次数。
- **Put(key, value, version)** 仅当该 Put 的 version 与服务器上该 key 的 version 一致时,才在 map 中安装或替换该 key 的值。若 version 一致,服务器还会将该 key 的 version 加一。若 version 不一致,服务器应返回 `rpc.ErrVersion`。客户端可通过 version 为 0 的 Put 创建新 key服务器存储的 version 将变为 1。若 Put 的 version 大于 0 且 key 不存在,服务器应返回 `rpc.ErrNoKey`
- **Get(key)** 获取该 key 的当前值及其 version。若 key 在服务器上不存在,服务器应返回 `rpc.ErrNoKey`
为每个 key 维护 version 有助于用 Put 实现锁,并在网络不可靠、客户端重传时保证 Put 的至多一次语义。
完成本实验并通过全部测试后,从调用 `Clerk.Get``Clerk.Put` 的客户端角度看,你将得到一个 **linearizable** 的 key/value 服务。即:若客户端操作不并发,每个 Clerk.Get 和 Clerk.Put 将观察到由先前操作序列所蕴含的状态修改。对于并发操作,返回值和最终状态将等同于这些操作以某种顺序一次执行一个的结果。若两操作在时间上重叠则视为并发,例如客户端 X 调用 Clerk.Put()、客户端 Y 调用 Clerk.Put(),然后 X 的调用返回。一个操作必须观察到在该操作开始前已完成的全部操作的效果。更多背景见 [linearizability 常见问题](../papers/linearizability-faq-cn.txt)。
Linearizability 对应用很方便,因为其行为与单台一次处理一个请求的服务器一致。例如,若某客户端从服务器得到一次更新请求的成功响应,之后其他客户端发起的读保证能看到该更新的效果。对单机服务器而言,提供 linearizability 相对容易。
## 起步
我们在 `src/kvsrv1` 中提供了骨架代码和测试。`kvsrv1/client.go` 实现了客户端用于与服务器管理 RPC 交互的 Clerk提供 Put 和 Get 方法。`kvsrv1/server.go` 包含服务器代码,包括实现 RPC 请求服务端的 Put 和 Get 处理函数。你需要修改 `client.go``server.go`。RPC 请求、回复和错误值在 `kvsrv1/rpc` 包的 `kvsrv1/rpc/rpc.go` 中定义,建议阅读但不必修改 rpc.go。
运行以下命令即可开始。别忘了 `git pull` 获取最新代码。
```bash
$ cd ~/6.5840
$ git pull
...
$ cd src
$ make kvsrv1
go build -race -o main/kvsrv1d main/kvsrv1d.go
cd kvsrv1 && go test -v -race
=== RUN TestReliablePut
One client and reliable Put (reliable network)...
kvsrv_test.go:25: Put err ErrNoKey
--- FAIL: TestReliablePut (0.31s)
...
$
```
## 可靠网络下的 key/value 服务器(简单)
第一个任务是在无丢包时实现正确行为。你需要在 `client.go` 的 Clerk Put/Get 方法中加入发送 RPC 的代码,并在 `server.go` 中实现 Put 和 Get 的 RPC 处理函数。
当通过测试套件中的 Reliable 测试时,该任务即完成:
```bash
$ cd src
$ make RUN="-run Reliable" kvsrv1
go build -race -o main/kvsrv1d main/kvsrv1d.go
cd kvsrv1 && go test -v -race -run Reliable
=== RUN TestReliablePut
One client and reliable Put (reliable network)...
... Passed -- time 0.0s #peers 1 #RPCs 5 #Ops 5
--- PASS: TestReliablePut (0.12s)
=== RUN TestPutConcurrentReliable
Test: many clients racing to put values to the same key (reliable network)...
... Passed -- time 6.3s #peers 1 #RPCs 11025 #Ops 22050
--- PASS: TestPutConcurrentReliable (6.36s)
=== RUN TestMemPutManyClientsReliable
Test: memory use many put clients (reliable network)...
... Passed -- time 29.0s #peers 1 #RPCs 50000 #Ops 50000
--- PASS: TestMemPutManyClientsReliable (52.91s)
PASS
ok 6.5840/kvsrv1 60.732s
$
```
每个 Passed 后的数字依次为:实际时间(秒)、常数 1、发送的 RPC 数(含客户端 RPC、执行的 key/value 操作数Clerk Get 和 Put 调用)。
## 用 key/value clerk 实现锁(中等)
许多分布式应用中,不同机器上的客户端通过 key/value 服务器协调。例如 ZooKeeper 和 Etcd 允许客户端用分布式锁协调,类似于 Go 程序中线程用锁(如 `sync.Mutex`协调。Zookeeper 和 Etcd 用条件 Put 实现这种锁。
你的任务是用 key/value 服务器存储锁所需的每把锁的状态,从而实现锁。可以有多把独立的锁,每把锁有各自的名称,作为 `MakeLock` 的参数。锁支持两个方法:**Acquire** 和 **Release**。规范是:同一时刻只有一个客户端能成功 acquire 某把锁;其他客户端须等第一个客户端用 Release 释放后才能 acquire。
我们在 `src/kvsrv1/lock/` 中提供了骨架和测试。你需要修改 `src/kvsrv1/lock/lock.go`。你的 Acquire 和 Release 应通过调用 `lk.ck.Put()``lk.ck.Get()` 在 key/value 服务器中存储每把锁的状态。
若客户端在持有锁时崩溃,锁将永远不会被释放。在比本实验更复杂的设计中,客户端会为锁附加 [lease](https://en.wikipedia.org/wiki/Lease_(computer_science))lease 过期后锁服务器会代客户端释放锁。本实验中客户端不会崩溃,可忽略该问题。
实现 Acquire 和 Release。当你的代码通过以下测试时该练习即完成
```bash
$ cd src
$ make RUN="-run Reliable" lock1
go build -race -o main/kvsrv1d main/kvsrv1d.go
cd kvsrv1/lock; go test -v -race -run Reliable
=== RUN TestReliableBasic
Test: a single Acquire and Release (reliable network)...
... Passed -- time 0.0s #peers 1 #RPCs 4 #Ops 4
--- PASS: TestReliableBasic (0.13s)
=== RUN TestReliableNested
Test: one client, two locks (reliable network)...
... Passed -- time 0.1s #peers 1 #RPCs 17 #Ops 17
--- PASS: TestReliableNested (0.17s)
=== RUN TestOneClientReliable
Test: 1 lock clients (reliable network)...
... Passed -- time 2.0s #peers 1 #RPCs 477 #Ops 477
--- PASS: TestOneClientReliable (2.14s)
=== RUN TestManyClientsReliable
Test: 10 lock clients (reliable network)...
... Passed -- time 2.2s #peers 1 #RPCs 5704 #Ops 5704
--- PASS: TestManyClientsReliable (2.36s)
PASS
ok 6.5840/kvsrv1/lock 5.817s
$
```
若尚未实现锁,前两个测试也会通过。
该练习代码量不大,但比前一练习需要更多独立思考。
- 每个锁客户端需要一个唯一标识;可调用 **kvtest.RandValue(8)** 生成随机字符串。
## 存在丢包时的 key/value 服务器(中等)
本练习的主要挑战是网络可能重排、延迟或丢弃 RPC 请求和/或回复。为从丢弃的请求/回复中恢复Clerk 必须不断重试每个 RPC 直到收到服务器回复。
- 若网络丢弃了 **RPC 请求**,客户端重发请求即可:服务器只会收到并执行一次重发的请求。
- 但网络也可能丢弃 **RPC 回复**。客户端无法区分哪种情况,只能观察到没收到回复。若是回复被丢弃且客户端重发 RPC 请求,服务器会收到两份请求。对 Get 没问题,因为 Get 不修改服务器状态。用相同 version 重发 Put RPC 也是安全的,因为服务器按 version 条件执行 Put若服务器已收到并执行过该 Put RPC会对重传的同一 RPC 回复 `rpc.ErrVersion` 而不会再次执行。
一个棘手情况是Clerk 重试后,服务器用 `rpc.ErrVersion` 回复。此时 Clerk 无法确定自己的 Put 是否已被执行:可能是第一次 RPC 已被执行但服务器发出的成功回复被网络丢弃,所以服务器仅对重传的 RPC 回复了 `rpc.ErrVersion`;也可能是另一个 Clerk 在该 Clerk 的第一次 RPC 到达前更新了 key所以服务器两次都没执行该 Clerk 的 RPC并对两次都回复 `rpc.ErrVersion`。因此,若 Clerk 对**重传的** Put RPC 收到 `rpc.ErrVersion`**Clerk.Put 必须向应用返回 `rpc.ErrMaybe`** 而不是 `rpc.ErrVersion`,因为请求可能已执行。应用负责处理这种情况。若服务器对**首次**非重传Put RPC 回复 `rpc.ErrVersion`,则 Clerk 应向应用返回 `rpc.ErrVersion`,因为该 RPC 确定未被服务器执行。
若 Put 能实现恰好一次(即没有 `rpc.ErrMaybe` 错误)会对应用开发者更友好,但在不为每个 Clerk 在服务器维护状态的情况下难以保证。本实验最后一个练习中,你将用 Clerk 实现锁,以体会在至多一次 Clerk.Put 下如何编程。
现在应修改 `kvsrv1/client.go`,在 RPC 请求或回复被丢弃时继续重试。客户端 `ck.clnt.Call()` 返回 **true** 表示收到了服务器的 RPC 回复;返回 **false** 表示未收到回复更准确地说Call() 在超时时间内等待回复,超时内未收到则返回 false。你的 Clerk 应持续重发 RPC 直到收到回复。请牢记上面关于 `rpc.ErrMaybe` 的讨论。你的方案不应要求修改服务器。
在 Clerk 中加入未收到回复时的重试逻辑。当你的代码通过 kvsrv1 的全部测试时,该任务即完成:
```bash
$ make kvsrv1
go build -race -o main/kvsrv1d main/kvsrv1d.go
cd kvsrv1 && go test -v -race
=== RUN TestReliablePut
One client and reliable Put (reliable network)...
... Passed -- time 0.0s #peers 1 #RPCs 5 #Ops 5
--- PASS: TestReliablePut (0.12s)
=== RUN TestPutConcurrentReliable
...
=== RUN TestUnreliableNet
One client (unreliable network)...
... Passed -- time 4.0s #peers 1 #RPCs 268 #Ops 422
--- PASS: TestUnreliableNet (4.13s)
PASS
ok 6.5840/kvsrv1 64.442s
$
```
- 重试前客户端应稍等;可使用 Go 的 time 包并调用 **time.Sleep(100 * time.Millisecond)**
## 不可靠网络下用 key/value clerk 实现锁(简单)
修改你的锁实现,使其在网络不可靠时能与修改后的 key/value 客户端正确配合。当你的代码通过 lock1 的全部测试时,该练习即完成:
```bash
$ make lock1
go build -race -o main/kvsrv1d main/kvsrv1d.go
cd kvsrv1/lock; go test -v -race
=== RUN TestReliableBasic
...
=== RUN TestOneClientUnreliable
Test: 1 lock clients (unreliable network)...
... Passed -- time 2.1s #peers 1 #RPCs 66 #Ops 57
--- PASS: TestOneClientUnreliable (2.18s)
=== RUN TestManyClientsUnreliable
Test: 10 lock clients (unreliable network)...
... Passed -- time 4.1s #peers 1 #RPCs 778 #Ops 617
--- PASS: TestManyClientsUnreliable (4.23s)
PASS
ok 6.5840/kvsrv1/lock 12.227s
$
```
---
*来源: [6.5840 Lab 2: Key/Value Server](https://pdos.csail.mit.edu/6.824/labs/lab-kvsrv1.html)*

View File

@@ -0,0 +1,178 @@
# 6.5840 Lab 2: Key/Value Server
## Introduction
In this lab you will build a key/value server for a single machine that ensures that each Put operation is executed at-most-once despite network failures and that the operations are linearizable. You will use this KV server to implement a lock. Later labs will replicate a server like this one to handle server crashes.
## KV server
Each client interacts with the key/value server using a **Clerk**, a set of library routines which sends RPCs to the server. Clients can send two different RPCs to the server: **Put(key, value, version)** and **Get(key)**. The server maintains an in-memory map that records for each key a **(value, version)** tuple. Keys and values are strings. The version number records the number of times the key has been written.
- **Put(key, value, version)** installs or replaces the value for a particular key in the map only if the Put's version number matches the server's version number for the key. If the version numbers match, the server also increments the version number of the key. If the version numbers don't match, the server should return `rpc.ErrVersion`. A client can create a new key by invoking Put with version number 0 (and the resulting version stored by the server will be 1). If the version number of the Put is larger than 0 and the key doesn't exist, the server should return `rpc.ErrNoKey`.
- **Get(key)** fetches the current value for the key and its associated version. If the key doesn't exist at the server, the server should return `rpc.ErrNoKey`.
Maintaining a version number for each key will be useful for implementing locks using Put and ensuring at-most-once semantics for Put's when the network is unreliable and the client retransmits.
When you've finished this lab and passed all the tests, you'll have a **linearizable** key/value service from the point of view of clients calling `Clerk.Get` and `Clerk.Put`. That is, if client operations aren't concurrent, each client Clerk.Get and Clerk.Put will observe the modifications to the state implied by the preceding sequence of operations. For concurrent operations, the return values and final state will be the same as if the operations had executed one at a time in some order. Operations are concurrent if they overlap in time: for example, if client X calls Clerk.Put(), and client Y calls Clerk.Put(), and then client X's call returns. An operation must observe the effects of all operations that have completed before the operation starts. See the FAQ on [linearizability](../papers/linearizability-faq.txt) for more background.
Linearizability is convenient for applications because it's the behavior you'd see from a single server that processes requests one at a time. For example, if one client gets a successful response from the server for an update request, subsequently launched reads from other clients are guaranteed to see the effects of that update. Providing linearizability is relatively easy for a single server.
## Getting Started
We supply you with skeleton code and tests in `src/kvsrv1`. `kvsrv1/client.go` implements a Clerk that clients use to manage RPC interactions with the server; the Clerk provides Put and Get methods. `kvsrv1/server.go` contains the server code, including the Put and Get handlers that implement the server side of RPC requests. You will need to modify `client.go` and `server.go`. The RPC requests, replies, and error values are defined in the `kvsrv1/rpc` package in the file `kvsrv1/rpc/rpc.go`, which you should look at, though you don't have to modify rpc.go.
To get up and running, execute the following commands. Don't forget the `git pull` to get the latest software.
```bash
$ cd ~/6.5840
$ git pull
...
$ cd src
$ make kvsrv1
go build -race -o main/kvsrv1d main/kvsrv1d.go
cd kvsrv1 && go test -v -race
=== RUN TestReliablePut
One client and reliable Put (reliable network)...
kvsrv_test.go:25: Put err ErrNoKey
--- FAIL: TestReliablePut (0.31s)
...
$
```
## Key/value server with reliable network (easy)
Your first task is to implement a solution that works when there are no dropped messages. You'll need to add RPC-sending code to the Clerk Put/Get methods in `client.go`, and implement Put and Get RPC handlers in `server.go`.
You have completed this task when you pass the Reliable tests in the test suite:
```bash
$ cd src
$ make RUN="-run Reliable" kvsrv1
go build -race -o main/kvsrv1d main/kvsrv1d.go
cd kvsrv1 && go test -v -race -run Reliable
=== RUN TestReliablePut
One client and reliable Put (reliable network)...
... Passed -- time 0.0s #peers 1 #RPCs 5 #Ops 5
--- PASS: TestReliablePut (0.12s)
=== RUN TestPutConcurrentReliable
Test: many clients racing to put values to the same key (reliable network)...
... Passed -- time 6.3s #peers 1 #RPCs 11025 #Ops 22050
--- PASS: TestPutConcurrentReliable (6.36s)
=== RUN TestMemPutManyClientsReliable
Test: memory use many put clients (reliable network)...
... Passed -- time 29.0s #peers 1 #RPCs 50000 #Ops 50000
--- PASS: TestMemPutManyClientsReliable (52.91s)
PASS
ok 6.5840/kvsrv1 60.732s
$
```
The numbers after each Passed are real time in seconds, the constant 1, the number of RPCs sent (including client RPCs), and the number of key/value operations executed (Clerk Get and Put calls).
## Implementing a lock using key/value clerk (moderate)
In many distributed applications, clients running on different machines use a key/value server to coordinate their activities. For example, ZooKeeper and Etcd allow clients to coordinate using a distributed lock, in analogy with how threads in a Go program can coordinate with locks (i.e., `sync.Mutex`). Zookeeper and Etcd implement such a lock with conditional put.
Your task is to implement locks, using your key/value server to store whatever per-lock state your design needs. There can be multiple independent locks, each with its own name, passed as an argument to `MakeLock`. A lock supports two methods: **Acquire** and **Release**. The specification is that only one client can successfully acquire a given lock at a time; other clients must wait until the first client has released the lock using Release.
We supply you with skeleton code and tests in `src/kvsrv1/lock/`. You will need to modify `src/kvsrv1/lock/lock.go`. Your Acquire and Release should store each lock's state in your key/value server, by calling `lk.ck.Put()` and `lk.ck.Get()`.
If a client crashes while holding a lock, the lock will never be released. In a design more sophisticated than this lab, the client would attach a [lease](https://en.wikipedia.org/wiki/Lease_(computer_science)) to a lock. When the lease expires, the lock server would release the lock on behalf of the client. In this lab clients don't crash and you can ignore this problem.
Implement Acquire and Release. You have completed this exercise when your code passes these tests:
```bash
$ cd src
$ make RUN="-run Reliable" lock1
go build -race -o main/kvsrv1d main/kvsrv1d.go
cd kvsrv1/lock; go test -v -race -run Reliable
=== RUN TestReliableBasic
Test: a single Acquire and Release (reliable network)...
... Passed -- time 0.0s #peers 1 #RPCs 4 #Ops 4
--- PASS: TestReliableBasic (0.13s)
=== RUN TestReliableNested
Test: one client, two locks (reliable network)...
... Passed -- time 0.1s #peers 1 #RPCs 17 #Ops 17
--- PASS: TestReliableNested (0.17s)
=== RUN TestOneClientReliable
Test: 1 lock clients (reliable network)...
... Passed -- time 2.0s #peers 1 #RPCs 477 #Ops 477
--- PASS: TestOneClientReliable (2.14s)
=== RUN TestManyClientsReliable
Test: 10 lock clients (reliable network)...
... Passed -- time 2.2s #peers 1 #RPCs 5704 #Ops 5704
--- PASS: TestManyClientsReliable (2.36s)
PASS
ok 6.5840/kvsrv1/lock 5.817s
$
```
If you haven't implemented the lock yet, the first two tests will succeed.
This exercise requires little code but a bit more independent thought than the previous exercise.
- You will need a unique identifier for each lock client; call **kvtest.RandValue(8)** to generate a random string.
## Key/value server with dropped messages (moderate)
The main challenge in this exercise is that the network may re-order, delay, or discard RPC requests and/or replies. To recover from discarded requests/replies, the Clerk must keep re-trying each RPC until it receives a reply from the server.
- If the network discards an **RPC request** message, then the client re-sending the request will solve the problem: the server will receive and execute just the re-sent request.
- However, the network might instead discard an **RPC reply** message. The client does not know which message was discarded; the client only observes that it received no reply. If it was the reply that was discarded, and the client re-sends the RPC request, then the server will receive two copies of the request. That's OK for a Get, since Get doesn't modify the server state. It is safe to resend a Put RPC with the same version number, since the server executes Put conditionally on the version number; if the server received and executed a Put RPC, it will respond to a re-transmitted copy of that RPC with `rpc.ErrVersion` rather than executing the Put a second time.
A tricky case is if the server replies with an `rpc.ErrVersion` in a response to an RPC that the Clerk retried. In this case, the Clerk cannot know if the Clerk's Put was executed by the server or not: the first RPC might have been executed by the server but the network may have discarded the successful response from the server, so that the server sent `rpc.ErrVersion` only for the retransmitted RPC. Or, it might be that another Clerk updated the key before the Clerk's first RPC arrived at the server, so that the server executed neither of the Clerk's RPCs and replied `rpc.ErrVersion` to both. Therefore, if a Clerk receives `rpc.ErrVersion` for a **retransmitted** Put RPC, **Clerk.Put must return `rpc.ErrMaybe` to the application** instead of `rpc.ErrVersion` since the request may have been executed. It is then up to the application to handle this case. If the server responds to an **initial** (not retransmitted) Put RPC with `rpc.ErrVersion`, then the Clerk should return `rpc.ErrVersion` to the application, since the RPC was definitely not executed by the server.
It would be more convenient for application developers if Put's were exactly-once (i.e., no `rpc.ErrMaybe` errors) but that is difficult to guarantee without maintaining state at the server for each Clerk. In the last exercise of this lab, you will implement a lock using your Clerk to explore how to program with at-most-once Clerk.Put.
Now you should modify your `kvsrv1/client.go` to continue in the face of dropped RPC requests and replies. A return value of **true** from the client's `ck.clnt.Call()` indicates that the client received an RPC reply from the server; a return value of **false** indicates that it did not receive a reply (more precisely, Call() waits for a reply message for a timeout interval, and returns false if no reply arrives within that time). Your Clerk should keep re-sending an RPC until it receives a reply. Keep in mind the discussion of `rpc.ErrMaybe` above. Your solution shouldn't require any changes to the server.
Add code to Clerk to retry if it doesn't receive a reply. You have completed this task if your code passes all the tests for kvsrv1:
```bash
$ make kvsrv1
go build -race -o main/kvsrv1d main/kvsrv1d.go
cd kvsrv1 && go test -v -race
=== RUN TestReliablePut
One client and reliable Put (reliable network)...
... Passed -- time 0.0s #peers 1 #RPCs 5 #Ops 5
--- PASS: TestReliablePut (0.12s)
=== RUN TestPutConcurrentReliable
...
=== RUN TestUnreliableNet
One client (unreliable network)...
... Passed -- time 4.0s #peers 1 #RPCs 268 #Ops 422
--- PASS: TestUnreliableNet (4.13s)
PASS
ok 6.5840/kvsrv1 64.442s
$
```
- Before the client retries, it should wait a little bit; you can use Go's time package and call **time.Sleep(100 * time.Millisecond)**.
## Implementing a lock using key/value clerk and unreliable network (easy)
Modify your lock implementation to work correctly with your modified key/value client when the network is not reliable. You have completed this exercise when your code passes all the lock1 tests:
```bash
$ make lock1
go build -race -o main/kvsrv1d main/kvsrv1d.go
cd kvsrv1/lock; go test -v -race
=== RUN TestReliableBasic
...
=== RUN TestOneClientUnreliable
Test: 1 lock clients (unreliable network)...
... Passed -- time 2.1s #peers 1 #RPCs 66 #Ops 57
--- PASS: TestOneClientUnreliable (2.18s)
=== RUN TestManyClientsUnreliable
Test: 10 lock clients (unreliable network)...
... Passed -- time 4.1s #peers 1 #RPCs 778 #Ops 617
--- PASS: TestManyClientsUnreliable (4.23s)
PASS
ok 6.5840/kvsrv1/lock 12.227s
$
```
---
*From: [6.5840 Lab 2: Key/Value Server](https://pdos.csail.mit.edu/6.824/labs/lab-kvsrv1.html)*

View File

@@ -0,0 +1,247 @@
# 6.5840 Lab 3: Raft
## 简介
这是构建容错 key/value 存储系统系列实验中的第一个。本实验中你将实现 Raft一种复制状态机协议。下一个实验你将在 Raft 之上构建 key/value 服务。之后你将对服务进行 "shard"(分片),在多个复制状态机上获得更高性能。
复制服务通过在多台副本服务器上存储完整状态(即数据)副本来实现容错。复制使服务在部分服务器发生故障(崩溃或网络中断、不稳定)时仍能继续运行。难点在于故障可能导致各副本持有不同的数据副本。
Raft 将客户端请求组织成称为 **log**日志的序列并保证所有副本服务器看到相同的日志。每个副本按日志顺序执行客户端请求并应用到其本地服务状态副本。由于所有存活副本看到相同的日志内容它们以相同顺序执行相同请求从而保持相同的服务状态。若某服务器失败后恢复Raft 会负责使其日志跟上。只要至少**多数**majority服务器存活且能相互通信Raft 就会持续运行。若不存在这样的多数Raft 不会取得进展,但一旦多数能再次通信就会从中断处继续。
本实验中你将把 Raft 实现为带有关联方法的 Go 对象类型,作为更大服务中的一个模块使用。一组 Raft 实例通过 RPC 相互通信以维护复制日志。你的 Raft 接口将支持无限长的、带编号的命令序列,也称为 **log entries**。条目用 *index* 编号。给定 index 的日志条目最终会被 **committed**。届时你的 Raft 应将该日志条目交给上层服务执行。
你应遵循 [Raft 扩展论文](../papers/raft-extended-cn.md) 的设计,尤其注意 **Figure 2**。你将实现论文中的大部分内容,包括保存持久状态并在节点失败重启后读取。不需要实现集群成员变更(第 6 节)。
本实验分**四个部分**提交。你必须在各自截止日前提交对应部分。
---
## 起步
若已完成 Lab 1你已有实验源码。若没有可在 Lab 1 说明中查看通过 git 获取源码的方法。
我们提供了骨架代码 `src/raft1/raft.go`,以及一组用于驱动实现并用于批改的测试,测试在 `src/raft1/raft_test.go`
批改时我们会在**不带** `-race` 标志下运行测试。但你自己**应用 `-race` 测试**。
运行以下命令即可开始。别忘了 `git pull` 获取最新代码。
```bash
$ cd ~/6.5840
$ git pull
...
$ cd src
$ make raft1
go build -race -o main/raft1d main/raft1d.go
cd raft1 && go test -v -race
=== RUN TestInitialElection3A
Test (3A): initial election (reliable network)...
Fatal: expected one leader, got none
/Users/rtm/824-process-raft/src/raft1/test.go:151
/Users/rtm/824-process-raft/src/raft1/raft_test.go:36
info: wrote visualization to /var/folders/x_/vk0xmxwn1sj91m89wsn5b1yh0000gr/T/porcupine-2242138501.html
--- FAIL: TestInitialElection3A (5.51s)
...
$
```
---
## 代码结构
`raft1/raft.go` 中补充代码实现 Raft。该文件中有骨架代码以及发送、接收 RPC 的示例。
你的实现必须支持下列接口,测试程序以及(最终)你的 key/value 服务器会使用。更多细节见 `raft.go``raftapi/raftapi.go` 中的注释。
```go
// create a new Raft server instance:
rf := Make(peers, me, persister, applyCh)
// start agreement on a new log entry:
rf.Start(command interface{}) (index, term, isleader)
// ask a Raft for its current term, and whether it thinks it is leader
rf.GetState() (term, isLeader)
// each time a new entry is committed to the log, each Raft peer
// should send an ApplyMsg to the service (or tester).
type ApplyMsg
```
服务通过调用 `Make(peers, me, …)` 创建 Raft 节点。**peers** 是 Raft 节点(含本节点)的网络标识数组,用于 RPC。**me** 是本节点在 peers 数组中的 index。**Start(command)** 请求 Raft 开始将命令追加到复制日志的流程。**Start()** 应立即返回,不等待日志追加完成。服务期望你的实现在每条新提交的日志条目时向 **Make()****applyCh** 参数发送一条 **ApplyMsg**
`raft.go` 中有发送 RPC`sendRequestVote()`)和处理传入 RPC`RequestVote()`)的示例代码。你的 Raft 节点应使用 labrpc Go 包(源码在 `src/labrpc`)交换 RPC。测试可以指示 labrpc 延迟、重排或丢弃 RPC 以模拟各种网络故障。可以临时修改 labrpc但须确保你的 Raft 在原始 labrpc 下能工作因为我们会用其测试和批改。Raft 实例之间只能通过 RPC 交互;例如不允许通过共享 Go 变量或文件通信。
后续实验建立在本实验之上,因此留足时间写出可靠代码很重要。
---
## Part 3A: Leader Election
实现 Raft 的 leader 选举与心跳(不含日志条目的 AppendEntries RPC。Part 3A 的目标是选出一个 leader、在无故障时保持该 leader、以及当旧 leader 失败或与之相关的包丢失时由新 leader 接管。在 `src` 目录下运行 `make RUN="-run 3A" raft1` 测试 3A 代码。
* 遵循论文 **Figure 2**。当前阶段关注发送和接收 RequestVote RPC、与选举相关的 Rules for Servers以及 leader 选举相关的 State。
*`raft.go` 的 Raft 结构体中加入 Figure 2 中与 leader 选举相关的状态。
* 填写 **RequestVoteArgs****RequestVoteReply** 结构体。修改 **Make()**,创建一个后台 goroutine在一段时间未收到其他节点消息时定期发起 leader 选举、发送 RequestVote RPC。实现 **RequestVote()** RPC 处理函数,使服务器能相互投票。
* 为实现心跳,定义 **AppendEntries** RPC 结构体(可能暂时不需要所有参数),并让 leader 定期发送。实现 **AppendEntries** RPC 处理函数。
* 测试要求 leader 发送心跳 RPC **每秒不超过十次**
* 测试要求你的 Raft 在旧 leader 失败后**五秒内**选出新 leader若多数节点仍能通信
* 论文 5.2 节提到选举超时在 150 到 300 毫秒范围。该范围仅在 leader 发送心跳远高于每 150 毫秒一次(例如每 10 毫秒)时合理。因测试限制为每秒十次心跳,你须使用**大于**论文 150300 毫秒的选举超时,但不宜过大,否则可能无法在五秒内选出 leader。
* 可使用 Go 的 **rand**
* 需要编写定期或延迟执行动作的代码。最简单的方式是创建一个带循环的 goroutine 并调用 **time.Sleep()**;参见 **Make()** 中为此创建的 `ticker()` goroutine。**不要使用 Go 的 time.Timer 或 time.Ticker**,它们难以正确使用。
* 若测试难以通过,请再次阅读论文 Figure 2leader 选举的完整逻辑分布在图的多个部分。
* 别忘了实现 **GetState()**
* Go RPC 只发送**首字母大写**的 struct 字段。子结构字段名也须大写(例如数组中日志记录的字段)。**labgob** 包会对此给出警告;不要忽略。
* 本实验最具挑战的部分可能是调试。调试建议见 [Guidance](./2.%20Lab%20Guidance-cn.md) 页。
* 若测试失败,测试程序会生成一个可视化时间线文件,标出事件、网络分区、崩溃的服务器和执行的检查。参见[可视化示例](https://pdos.csail.mit.edu/6.824/labs/raft-tester.html)。你也可以添加自己的标注,例如 `tester.Annotate("Server 0", "short description", "details")`
提交 Part 3A 前请确保通过 3A 测试,看到类似输出:
```bash
$ make RUN="-run 3A" raft1
go build -race -o main/raft1d main/raft1d.go
cd raft1 && go test -v -race -run 3A
=== RUN TestInitialElection3A
Test (3A): initial election (reliable network)...
... Passed -- time 3.5s #peers 3 #RPCs 32 #Ops 0
--- PASS: TestInitialElection3A (3.84s)
=== RUN TestReElection3A
Test (3A): election after network failure (reliable network)...
... Passed -- time 6.2s #peers 3 #RPCs 68 #Ops 0
--- PASS: TestReElection3A (6.54s)
=== RUN TestManyElections3A
Test (3A): multiple elections (reliable network)...
... Passed -- time 9.8s #peers 7 #RPCs 684 #Ops 0
--- PASS: TestManyElections3A (10.68s)
PASS
ok 6.5840/raft1 22.095s
$
```
每行 "Passed" 包含五个数字测试耗时、Raft 节点数、测试期间发送的 RPC 数、RPC 消息总字节数、Raft 报告已提交的日志条数。你的数字会与示例不同。可以忽略这些数字,但它们有助于 sanity-check 实现发送的 RPC 数量。对 Lab 3、4、5 全部测试,若总耗时超过 600 秒或任一测试超过 120 秒,批改脚本会判为不通过。
批改时我们会在不带 `-race` 下运行测试。但请确保你的代码**在带 `-race` 时能稳定通过测试**。
---
## Part 3B: Log
实现 leader 和 follower 追加新日志条目的逻辑,使 `make RUN="-run 3B" raft1` 通过全部测试。
* 运行 `git pull` 获取最新实验代码。
* Raft 论文中日志从 1 开始编号,但我们建议实现为**从 0 开始**,在 index=0 放一个 term 为 0 的哑元条目。这样第一次 AppendEntries RPC 可以包含 PrevLogIndex 为 0且是日志中的有效 index。
* 首要目标应是通过 **TestBasicAgree3B()**。先实现 **Start()**,然后按 Figure 2 编写通过 AppendEntries RPC 发送和接收新日志条目的代码。在每个节点上对每条新提交的条目向 **applyCh** 发送。
* 需要实现**选举限制**(论文 5.4.1 节)。
* 代码中可能有反复检查某事件的循环。不要让这些循环无暂停地连续执行,否则会拖慢实现导致测试失败。使用 Go 的**条件变量**,或在每次循环迭代中 **time.Sleep(10 * time.Millisecond)**
* 为后续实验着想,尽量把代码写清楚。
* 若测试失败,查看 `raft_test.go` 并沿测试代码追踪,理解在测什么。
后续实验的测试可能会因代码过慢而判为不通过。可用 `time` 命令查看实际时间和 CPU 时间。典型输出:
```bash
$ make RUN="-run 3B" raft1
go build -race -o main/raft1d main/raft1d.go
cd raft1 && go test -v -race -run 3B
=== RUN TestBasicAgree3B
Test (3B): basic agreement (reliable network)...
... Passed -- time 1.6s #peers 3 #RPCs 18 #Ops 3
--- PASS: TestBasicAgree3B (1.96s)
=== RUN TestRPCBytes3B
...
=== RUN TestCount3B
Test (3B): RPC counts aren't too high (reliable network)...
... Passed -- time 2.7s #peers 3 #RPCs 32 #Ops 0
--- PASS: TestCount3B (3.05s)
PASS
ok 6.5840/raft1 71.716s
$
```
"ok 6.5840/raft 71.716s" 表示 Go 测得的 3B 测试实际(墙上)时间为 71.716 秒。若 3B 测试实际时间远超过几分钟,后续可能出问题。检查是否有长时间 sleep 或等待 RPC 超时、是否有不 sleep 或不等待条件/ channel 的循环、或是否发送了过多 RPC。
---
## Part 3C: Persistence
基于 Raft 的服务器重启后应从断点恢复。这要求 Raft 维护在重启后仍存在的**持久状态**。论文 Figure 2 指明了哪些状态应持久化。
真实实现会在每次状态变化时将 Raft 的持久状态写入磁盘,并在重启时从磁盘读取。你的实现不使用磁盘,而是从 **Persister** 对象(见 `tester1/persister.go`)保存和恢复持久状态。调用 **Raft.Make()** 的一方提供 Persister其初始内容为 Raft 最近持久化的状态若有。Raft 应从该 Persister 初始化状态,并在每次状态变化时用它保存持久状态。使用 Persister 的 **ReadRaftState()****Save()** 方法。
`raft.go` 中完成 **persist()****readPersist()**,添加保存和恢复持久状态的代码。需要将状态编码(或“序列化”)为字节数组才能传给 Persister。使用 **labgob** 编码器;参见 **persist()****readPersist()** 中的注释。labgob 类似 Go 的 gob但若尝试编码小写字段名的结构体会打印错误。目前将 **nil** 作为第二个参数传给 **persister.Save()**。在实现修改持久状态的位置插入对 **persist()** 的调用。完成上述工作且其余实现正确时,应能通过全部 3C 测试。
你可能需要**每次将 nextIndex 回退多于一个条目的优化**。参见 Raft 扩展论文第 7 页末、第 8 页初(灰线标记处)。论文对细节描述较模糊,需要自行补全。一种做法是让拒绝消息包含:
* **XTerm**:冲突条目的 term若有
* **XIndex**:该 term 第一条目的 index若有
* **XLen**:日志长度
则 leader 的逻辑可以是:
* **Case 1**leader 没有 XTerm → `nextIndex = XIndex`
* **Case 2**leader 有 XTerm → `nextIndex = (leader 中 XTerm 最后一条的 index) + 1`
* **Case 3**follower 日志过短 → `nextIndex = XLen`
其他提示:
* 运行 `git pull` 获取最新实验代码。
* 3C 测试比 3A 或 3B 更苛刻,失败可能由 3A 或 3B 代码中的问题引起。
你的代码应通过全部 3C 测试(如下所示),以及 3A 和 3B 测试。
```bash
$ make RUN="-run 3C" raft1
...
PASS
ok 6.5840/raft1 180.983s
$
```
提交前多跑几遍测试是个好习惯。
---
## Part 3D: Log Compaction
目前重启的服务器会重放完整 Raft 日志以恢复状态。但对长期运行的服务而言,永远记住完整 Raft 日志不现实。你需要修改 Raft使其与不定期持久化状态**快照snapshot**的服务协作,届时 Raft 丢弃快照之前的日志条目。结果是持久数据更少、重启更快。但可能出现 follower 落后太多leader 已丢弃其赶上来所需的日志;此时 leader 必须发送快照以及从快照时刻起的日志。Raft 扩展论文 [**Section 7**](../papers/raft-extended-cn.md) 概述了该方案;你需要设计细节。
你的 Raft 必须提供服务可调用的以下函数,传入其状态的序列化快照:
```go
Snapshot(index int, snapshot []byte)
```
在 Lab 3D 中,测试程序会定期调用 **Snapshot()**。在 Lab 4 中,你将编写会调用 **Snapshot()** 的 key/value 服务器;快照将包含完整的 key/value 表。服务层在每个节点(不仅是 leader上调用 **Snapshot()**
**index** 参数表示快照所反映的日志中最高条目的 index。Raft 应丢弃该点之前的日志条目。需要修改 Raft 代码,使其在只保存日志**尾部**的情况下运行。
需要实现论文中讨论的 **InstallSnapshot** RPC使 Raft leader 能告知落后的 Raft 节点用快照替换其状态。可能需要理清 InstallSnapshot 与 Figure 2 中的状态和规则如何交互。
当 follower 的 Raft 代码收到 **InstallSnapshot** RPC 时,可通过 **applyCh** 将快照以 **ApplyMsg** 形式发给服务。`raftapi/raftapi.go` 中 ApplyMsg 结构体定义已包含所需字段(也是测试期望的)。注意这些快照只能推进服务状态,不能使其回退。
若服务器崩溃,必须从持久数据重启。你的 Raft 应**同时持久化 Raft 状态和对应快照**。使用 **persister.Save()** 的第二个参数保存快照。若无快照,第二个参数传 **nil**
服务器重启时,应用层读取持久化的快照并恢复其保存的应用状态。重启后,应用层期望 **applyCh** 上的第一条消息要么是 **SnapshotIndex** 高于初始恢复快照的快照,要么是 **CommandIndex** 紧接在初始恢复快照 index 之后的普通命令。
实现 **Snapshot()****InstallSnapshot** RPC以及 Raft 为支持它们所需的修改(例如在截断日志下运行)。当通过 3D 测试(及此前全部 Lab 3 测试)时,你的方案即完成。
* `git pull` 确保使用最新代码。
* 一个好的起点是修改代码使其能只保存从某 index X 开始的日志部分。初始可将 X 设为 0 并跑 3B/3C 测试。然后让 **Snapshot(index)** 丢弃 index 之前的日志,并将 X 设为 index。顺利的话应能通过第一个 3D 测试。
* 第一个 3D 测试失败的常见原因是 follower 追上 leader 耗时过长。
* 接下来:若 leader 没有使 follower 赶上所需的日志条目,则发送 **InstallSnapshot** RPC。
* **在单次 InstallSnapshot RPC 中发送完整快照**。不要实现 Figure 13 中分片快照的 offset 机制。
* Raft 必须以允许 Go 垃圾回收器释放并重用内存的方式丢弃旧日志条目;这要求**不存在对已丢弃日志条目的可达引用(指针)**。
* Raft 节点重启时,传给 **Make()** 的 persister 会包含应用状态快照以及 Raft 保存的状态。若日志已被截断Raft 每次调用 **persister.Save()** 时都必须包含非 nil 快照,因此 **Make()** 中调用 **persister.ReadSnapshot()** 并保存结果是好做法。
* 在不带 `-race` 时,完整 Lab 3 测试3A+3B+3C+3D合理耗时约为**实际时间 6 分钟**、**CPU 时间 1 分钟**。带 `-race` 时约为**实际时间 10 分钟**、**CPU 时间 2 分钟**。
你的代码应通过全部 3D 测试(如下所示),以及 3A、3B、3C 测试。
```bash
$ make RUN="-run 3D" raft1
...
PASS
ok 6.5840/raft1 301.406s
$
```
---
*来源: [6.5840 Lab 3: Raft](https://pdos.csail.mit.edu/6.824/labs/lab-raft1.html)*

View File

@@ -0,0 +1,247 @@
# 6.5840 Lab 3: Raft
## Introduction
This is the first in a series of labs in which you'll build a fault-tolerant key/value storage system. In this lab you'll implement Raft, a replicated state machine protocol. In the next lab you'll build a key/value service on top of Raft. Then you will "shard" your service over multiple replicated state machines for higher performance.
A replicated service achieves fault tolerance by storing complete copies of its state (i.e., data) on multiple replica servers. Replication allows the service to continue operating even if some of its servers experience failures (crashes or a broken or flaky network). The challenge is that failures may cause the replicas to hold differing copies of the data.
Raft organizes client requests into a sequence, called the **log**, and ensures that all the replica servers see the same log. Each replica executes client requests in log order, applying them to its local copy of the service's state. Since all the live replicas see the same log contents, they all execute the same requests in the same order, and thus continue to have identical service state. If a server fails but later recovers, Raft takes care of bringing its log up to date. Raft will continue to operate as long as at least a **majority** of the servers are alive and can talk to each other. If there is no such majority, Raft will make no progress, but will pick up where it left off as soon as a majority can communicate again.
In this lab you'll implement Raft as a Go object type with associated methods, meant to be used as a module in a larger service. A set of Raft instances talk to each other with RPC to maintain replicated logs. Your Raft interface will support an indefinite sequence of numbered commands, also called **log entries**. The entries are numbered with *index numbers*. The log entry with a given index will eventually be **committed**. At that point, your Raft should send the log entry to the larger service for it to execute.
You should follow the design in the [extended Raft paper](../papers/raft-extended.md), with particular attention to **Figure 2**. You'll implement most of what's in the paper, including saving persistent state and reading it after a node fails and then restarts. You will not implement cluster membership changes (Section 6).
This lab is due in **four parts**. You must submit each part on the corresponding due date.
---
## Getting Started
If you have done Lab 1, you already have a copy of the lab source code. If not, you can find directions for obtaining the source via git in the Lab 1 instructions.
We supply you with skeleton code `src/raft1/raft.go`. We also supply a set of tests, which you should use to drive your implementation efforts, and which we'll use to grade your submitted lab. The tests are in `src/raft1/raft_test.go`.
When we grade your submissions, we will run the tests **without** the `-race` flag. However, you should **test with `-race`**.
To get up and running, execute the following commands. Don't forget the `git pull` to get the latest software.
```bash
$ cd ~/6.5840
$ git pull
...
$ cd src
$ make raft1
go build -race -o main/raft1d main/raft1d.go
cd raft1 && go test -v -race
=== RUN TestInitialElection3A
Test (3A): initial election (reliable network)...
Fatal: expected one leader, got none
/Users/rtm/824-process-raft/src/raft1/test.go:151
/Users/rtm/824-process-raft/src/raft1/raft_test.go:36
info: wrote visualization to /var/folders/x_/vk0xmxwn1sj91m89wsn5b1yh0000gr/T/porcupine-2242138501.html
--- FAIL: TestInitialElection3A (5.51s)
...
$
```
---
## The Code
Implement Raft by adding code to `raft1/raft.go`. In that file you'll find skeleton code, plus examples of how to send and receive RPCs.
Your implementation must support the following interface, which the tester and (eventually) your key/value server will use. You'll find more details in comments in `raft.go` and in `raftapi/raftapi.go`.
```go
// create a new Raft server instance:
rf := Make(peers, me, persister, applyCh)
// start agreement on a new log entry:
rf.Start(command interface{}) (index, term, isleader)
// ask a Raft for its current term, and whether it thinks it is leader
rf.GetState() (term, isLeader)
// each time a new entry is committed to the log, each Raft peer
// should send an ApplyMsg to the service (or tester).
type ApplyMsg
```
A service calls `Make(peers, me, …)` to create a Raft peer. The **peers** argument is an array of network identifiers of the Raft peers (including this one), for use with RPC. The **me** argument is the index of this peer in the peers array. **Start(command)** asks Raft to start the processing to append the command to the replicated log. **Start()** should return immediately, without waiting for the log appends to complete. The service expects your implementation to send an **ApplyMsg** for each newly committed log entry to the **applyCh** channel argument to **Make()**.
`raft.go` contains example code that sends an RPC (`sendRequestVote()`) and that handles an incoming RPC (`RequestVote()`). Your Raft peers should exchange RPCs using the labrpc Go package (source in `src/labrpc`). The tester can tell labrpc to delay RPCs, re-order them, and discard them to simulate various network failures. While you can temporarily modify labrpc, make sure your Raft works with the original labrpc, since that's what we'll use to test and grade your lab. Your Raft instances must interact only through RPC; for example, they are not allowed to communicate using shared Go variables or files.
Subsequent labs build on this lab, so it is important to give yourself enough time to write solid code.
---
## Part 3A: Leader Election
Implement Raft leader election and heartbeats (AppendEntries RPCs with no log entries). The goal for Part 3A is for a single leader to be elected, for the leader to remain the leader if there are no failures, and for a new leader to take over if the old leader fails or if packets to/from the old leader are lost. Run `make RUN="-run 3A" raft1` in the `src` directory to test your 3A code.
* Follow the paper's **Figure 2**. At this point you care about sending and receiving RequestVote RPCs, the Rules for Servers that relate to elections, and the State related to leader election.
* Add the Figure 2 state for leader election to the Raft struct in `raft.go`.
* Fill in the **RequestVoteArgs** and **RequestVoteReply** structs. Modify **Make()** to create a background goroutine that will kick off leader election periodically by sending out RequestVote RPCs when it hasn't heard from another peer for a while. Implement the **RequestVote()** RPC handler so that servers will vote for one another.
* To implement heartbeats, define an **AppendEntries** RPC struct (though you may not need all the arguments yet), and have the leader send them out periodically. Write an **AppendEntries** RPC handler method.
* The tester requires that the leader send heartbeat RPCs **no more than ten times per second**.
* The tester requires your Raft to **elect a new leader within five seconds** of the failure of the old leader (if a majority of peers can still communicate).
* The paper's Section 5.2 mentions election timeouts in the range of 150 to 300 milliseconds. Such a range only makes sense if the leader sends heartbeats considerably more often than once per 150 milliseconds (e.g., once per 10 milliseconds). Because the tester limits you tens of heartbeats per second, you will have to use an election timeout **larger** than the paper's 150 to 300 milliseconds, but not too large, because then you may fail to elect a leader within five seconds.
* You may find Go's **rand** useful.
* You'll need to write code that takes actions periodically or after delays in time. The easiest way to do this is to create a goroutine with a loop that calls **time.Sleep()**; see the `ticker()` goroutine that **Make()** creates for this purpose. **Don't use Go's time.Timer or time.Ticker**, which are difficult to use correctly.
* If your code has trouble passing the tests, read the paper's Figure 2 again; the full logic for leader election is spread over multiple parts of the figure.
* Don't forget to implement **GetState()**.
* Go RPC sends only struct fields whose names start with capital letters. Sub-structures must also have capitalized field names (e.g. fields of log records in an array). The **labgob** package will warn you about this; don't ignore the warnings.
* The most challenging part of this lab may be the debugging. Refer to the [Guidance](./2.%20Lab%20Guidance.md) page for debugging tips.
* If you fail a test, the tester produces a file that visualizes a timeline with events marked along it, including network partitions, crashed servers, and checks performed. Here's an [example of the visualization](https://pdos.csail.mit.edu/6.824/labs/raft-tester.html). Further, you can add your own annotations by writing, for example, `tester.Annotate("Server 0", "short description", "details")`.
Be sure you pass the 3A tests before submitting Part 3A, so that you see something like this:
```bash
$ make RUN="-run 3A" raft1
go build -race -o main/raft1d main/raft1d.go
cd raft1 && go test -v -race -run 3A
=== RUN TestInitialElection3A
Test (3A): initial election (reliable network)...
... Passed -- time 3.5s #peers 3 #RPCs 32 #Ops 0
--- PASS: TestInitialElection3A (3.84s)
=== RUN TestReElection3A
Test (3A): election after network failure (reliable network)...
... Passed -- time 6.2s #peers 3 #RPCs 68 #Ops 0
--- PASS: TestReElection3A (6.54s)
=== RUN TestManyElections3A
Test (3A): multiple elections (reliable network)...
... Passed -- time 9.8s #peers 7 #RPCs 684 #Ops 0
--- PASS: TestManyElections3A (10.68s)
PASS
ok 6.5840/raft1 22.095s
$
```
Each "Passed" line contains five numbers; these are the time that the test took in seconds, the number of Raft peers, the number of RPCs sent during the test, the total number of bytes in the RPC messages, and the number of log entries that Raft reports were committed. Your numbers will differ from those shown here. You can ignore the numbers if you like, but they may help you sanity-check the number of RPCs that your implementation sends. For all of labs 3, 4, and 5, the grading script will fail your solution if it takes more than 600 seconds for all of the tests, or if any individual test takes more than 120 seconds.
When we grade your submissions, we will run the tests without the `-race` flag. However, you should make sure that your code **consistently passes the tests with the -race flag**.
---
## Part 3B: Log
Implement the leader and follower code to append new log entries, so that `make RUN="-run 3B" raft1` passes all tests.
* Run `git pull` to get the latest lab software.
* The Raft paper views the log as 1-indexed, but we suggest that you implement it as **0-indexed**, starting with a dummy entry at index=0 that has term 0. That allows the very first AppendEntries RPC to contain 0 as PrevLogIndex, and be a valid index into the log.
* Your first goal should be to pass **TestBasicAgree3B()**. Start by implementing **Start()**, then write the code to send and receive new log entries via AppendEntries RPCs, following Figure 2. Send each newly committed entry on **applyCh** on each peer.
* You will need to implement the **election restriction** (section 5.4.1 in the paper).
* Your code may have loops that repeatedly check for certain events. Don't have these loops execute continuously without pausing, since that will slow your implementation enough that it fails tests. Use Go's **condition variables**, or insert a **time.Sleep(10 * time.Millisecond)** in each loop iteration.
* Do yourself a favor for future labs and write (or re-write) code that's clean and clear.
* If you fail a test, look at `raft_test.go` and trace the test code from there to understand what's being tested.
The tests for upcoming labs may fail your code if it runs too slowly. You can check how much real time and CPU time your solution uses with the `time` command. Here's typical output:
```bash
$ make RUN="-run 3B" raft1
go build -race -o main/raft1d main/raft1d.go
cd raft1 && go test -v -race -run 3B
=== RUN TestBasicAgree3B
Test (3B): basic agreement (reliable network)...
... Passed -- time 1.6s #peers 3 #RPCs 18 #Ops 3
--- PASS: TestBasicAgree3B (1.96s)
=== RUN TestRPCBytes3B
...
=== RUN TestCount3B
Test (3B): RPC counts aren't too high (reliable network)...
... Passed -- time 2.7s #peers 3 #RPCs 32 #Ops 0
--- PASS: TestCount3B (3.05s)
PASS
ok 6.5840/raft1 71.716s
$
```
The "ok 6.5840/raft 71.716s" means that Go measured the time taken for the 3B tests to be 71.716 seconds of real (wall-clock) time. If your solution uses much more than a few minutes of real time for the 3B tests, you may run into trouble later on. Look for time spent sleeping or waiting for RPC timeouts, loops that run without sleeping or waiting for conditions or channel messages, or large numbers of RPCs sent.
---
## Part 3C: Persistence
If a Raft-based server reboots it should resume service where it left off. This requires that Raft keep **persistent state** that survives a reboot. The paper's Figure 2 mentions which state should be persistent.
A real implementation would write Raft's persistent state to disk each time it changed, and would read the state from disk when restarting after a reboot. Your implementation won't use the disk; instead, it will save and restore persistent state from a **Persister** object (see `tester1/persister.go`). Whoever calls **Raft.Make()** supplies a Persister that initially holds Raft's most recently persisted state (if any). Raft should initialize its state from that Persister, and should use it to save its persistent state each time the state changes. Use the Persister's **ReadRaftState()** and **Save()** methods.
Complete the functions **persist()** and **readPersist()** in `raft.go` by adding code to save and restore persistent state. You will need to encode (or "serialize") the state as an array of bytes in order to pass it to the Persister. Use the **labgob** encoder; see the comments in **persist()** and **readPersist()**. labgob is like Go's gob encoder but prints error messages if you try to encode structures with lower-case field names. For now, pass **nil** as the second argument to **persister.Save()**. Insert calls to **persist()** at the points where your implementation changes persistent state. Once you've done this, and if the rest of your implementation is correct, you should pass all of the 3C tests.
You will probably need the **optimization that backs up nextIndex by more than one entry at a time**. Look at the extended Raft paper starting at the bottom of page 7 and top of page 8 (marked by a gray line). The paper is vague about the details; you will need to fill in the gaps. One possibility is to have a rejection message include:
* **XTerm**: term in the conflicting entry (if any)
* **XIndex**: index of first entry with that term (if any)
* **XLen**: log length
Then the leader's logic can be something like:
* **Case 1**: leader doesn't have XTerm → `nextIndex = XIndex`
* **Case 2**: leader has XTerm → `nextIndex = (index of leader's last entry for XTerm) + 1`
* **Case 3**: follower's log is too short → `nextIndex = XLen`
A few other hints:
* Run `git pull` to get the latest lab software.
* The 3C tests are more demanding than those for 3A or 3B, and failures may be caused by problems in your code for 3A or 3B.
Your code should pass all the 3C tests (as shown below), as well as the 3A and 3B tests.
```bash
$ make RUN="-run 3C" raft1
...
PASS
ok 6.5840/raft1 180.983s
$
```
It is a good idea to run the tests multiple times before submitting.
---
## Part 3D: Log Compaction
As things stand now, a rebooting server replays the complete Raft log in order to restore its state. However, it's not practical for a long-running service to remember the complete Raft log forever. Instead, you'll modify Raft to cooperate with services that persistently store a **"snapshot"** of their state from time to time, at which point Raft discards log entries that precede the snapshot. The result is a smaller amount of persistent data and faster restart. However, it's now possible for a follower to fall so far behind that the leader has discarded the log entries it needs to catch up; the leader must then send a snapshot plus the log starting at the time of the snapshot. **Section 7** of the [extended Raft paper](../papers/raft-extended.md) outlines the scheme; you will have to design the details.
Your Raft must provide the following function that the service can call with a serialized snapshot of its state:
```go
Snapshot(index int, snapshot []byte)
```
In Lab 3D, the tester calls **Snapshot()** periodically. In Lab 4, you will write a key/value server that calls **Snapshot()**; the snapshot will contain the complete table of key/value pairs. The service layer calls **Snapshot()** on every peer (not just on the leader).
The **index** argument indicates the highest log entry that's reflected in the snapshot. Raft should discard its log entries before that point. You'll need to revise your Raft code to operate while storing only the **tail** of the log.
You'll need to implement the **InstallSnapshot** RPC discussed in the paper that allows a Raft leader to tell a lagging Raft peer to replace its state with a snapshot. You will likely need to think through how InstallSnapshot should interact with the state and rules in Figure 2.
When a follower's Raft code receives an **InstallSnapshot** RPC, it can use the **applyCh** to send the snapshot to the service in an **ApplyMsg**. The ApplyMsg struct definition in `raftapi/raftapi.go` already contains the fields you will need (and which the tester expects). Take care that these snapshots only advance the service's state, and don't cause it to move backwards.
If a server crashes, it must restart from persisted data. Your Raft should **persist both Raft state and the corresponding snapshot**. Use the second argument to **persister.Save()** to save the snapshot. If there's no snapshot, pass **nil** as the second argument.
When a server restarts, the application layer reads the persisted snapshot and restores its saved application state. After a restart, the application layer expects the first message on **applyCh** to either contain a snapshot with a **SnapshotIndex** higher than that of the initial restored snapshot, or an ordinary command with **CommandIndex** immediately following the index of the initial restored snapshot.
Implement **Snapshot()** and the **InstallSnapshot** RPC, as well as the changes to Raft to support these (e.g, operation with a trimmed log). Your solution is complete when it passes the 3D tests (and all the previous Lab 3 tests).
* `git pull` to make sure you have the latest software.
* A good place to start is to modify your code to so that it is able to store just the part of the log starting at some index X. Initially you can set X to zero and run the 3B/3C tests. Then make **Snapshot(index)** discard the log before index, and set X equal to index. If all goes well you should now pass the first 3D test.
* A common reason for failing the first 3D test is that followers take too long to catch up to the leader.
* Next: have the leader send an **InstallSnapshot** RPC if it doesn't have the log entries required to bring a follower up to date.
* Send the **entire snapshot in a single InstallSnapshot RPC**. Don't implement Figure 13's offset mechanism for splitting up the snapshot.
* Raft must discard old log entries in a way that allows the Go garbage collector to free and re-use the memory; this requires that there be **no reachable references (pointers)** to the discarded log entries.
* When a Raft peer is re-started, the persister passed to **Make()** will contain a snapshot of application state as well as Raft's saved state. Raft must include a non-nil snapshot with every call to **persister.Save()** (if the log has been trimmed), which means that it's a good idea for **Make()** to call **persister.ReadSnapshot()** and save the result.
* A reasonable amount of time to consume for the full set of Lab 3 tests (3A+3B+3C+3D) without `-race` is **6 minutes** of real time and **one minute** of CPU time. When running with `-race`, it is about **10 minutes** of real time and **two minutes** of CPU time.
Your code should pass all the 3D tests (as shown below), as well as the 3A, 3B, and 3C tests.
```bash
$ make RUN="-run 3D" raft1
...
PASS
ok 6.5840/raft1 301.406s
$
```
---
*From: [6.5840 Lab 3: Raft](https://pdos.csail.mit.edu/6.824/labs/lab-raft1.html)*

View File

@@ -0,0 +1,224 @@
# 6.5840 Lab 4: Fault-tolerant Key/Value Service
## 简介
在本实验中你将使用 Lab 3 的 Raft 库构建一个容错 key/value 存储服务。对客户端而言,该服务与 Lab 2 的服务器类似。但服务不是单台服务器,而是由一组使用 Raft 保持数据库一致的服务器组成。只要**多数**majority服务器存活且能通信你的 key/value 服务就应继续处理客户端请求,即使存在其他故障或网络分区。完成 Lab 4 后,你将实现 Raft 交互图中所示的全部部分Clerk、Service 和 Raft
客户端通过 **Clerk** 与你的 key/value 服务交互,与 Lab 2 相同。Clerk 实现的 **Put****Get** 方法与 Lab 2 语义一致Put 为**至多一次**Put/Get 必须形成 **linearizable** 历史。
对单机而言提供 linearizability 相对容易。在复制服务中更难,因为所有服务器必须对并发请求选择相同的执行顺序、必须避免用未更新的状态回复客户端、且必须在故障后以保留所有已确认客户端更新的方式恢复状态。
本实验分**三个部分**。在 **Part A** 中,你将用你的 Raft 实现一个与具体请求无关的复制状态机包 **rsm**。在 **Part B** 中,你将用 rsm 实现一个复制的 key/value 服务,但不使用快照。在 **Part C** 中,你将使用 Lab 3D 的快照实现,使 Raft 能丢弃旧日志条目。请分别在各自截止日前提交各部分。
建议复习 [Raft 扩展论文](../papers/raft-extended-cn.md),尤其是 **Section 7**(不含 8。更广视角可参阅 Chubby、Paxos Made Live、Spanner、Zookeeper、Harp、Viewstamped Replication 以及 [Bolosky et al](https://pdos.csail.mit.edu/6.824/papers/bolosky-usenix2010.pdf)。
**尽早开始。**
---
## 起步
我们在 `src/kvraft1` 中提供了骨架代码和测试。骨架使用 `src/kvraft1/rsm` 包复制服务器。服务器必须实现 rsm 中定义的 **StateMachine** 接口才能通过 rsm 复制自身。你的主要工作是在 rsm 中实现与具体服务器无关的复制逻辑。此外需要修改 `kvraft1/client.go``kvraft1/server.go` 实现服务器相关部分。这种拆分便于在下一实验复用 rsm。可以复用部分 Lab 2 代码(例如复制或导入 "src/kvsrv1" 包中的服务器代码),但非必须。
运行以下命令即可开始。别忘了 `git pull` 获取最新代码。
```bash
$ cd ~/6.5840
$ git pull
...
```
---
## Part A: Replicated State Machine (RSM)
```bash
$ cd src/kvraft1/rsm
$ go test -v
=== RUN TestBasic
Test RSM basic (reliable network)...
..
config.go:147: one: took too long
```
在使用 Raft 复制的常见客户端/服务架构中,服务与 Raft 有两种交互:服务 leader 通过调用 **raft.Start()** 提交客户端操作,所有服务副本通过 Raft 的 **applyCh** 接收已提交的操作并执行。在 leader 上,这两类活动会交织:某些服务器 goroutine 在处理客户端请求、已调用 **raft.Start()**,各自在等待其操作提交并得到执行结果;而 **applyCh** 上出现的已提交操作需要由服务执行,且结果需交给曾调用 **raft.Start()** 的 goroutine 以便返回给客户端。
**rsm** 包封装上述交互。它位于服务(如 key/value 数据库)与 Raft 之间。你需要在 **rsm/rsm.go** 中实现:
1. 一个**“reader” goroutine**,读取 applyCh
2. 一个 **rsm.Submit()** 函数,为客户端操作调用 **raft.Start()**,然后等待 reader goroutine 把该操作的执行结果交回
使用 rsm 的服务对 rsm 的 reader goroutine 呈现为提供 **DoOp()** 方法的 **StateMachine** 对象。Reader goroutine 应将每条已提交操作交给 **DoOp()****DoOp()** 的返回值应交给对应的 **rsm.Submit()** 调用并返回。**DoOp()** 的参数和返回值类型为 **any**;实际类型应分别与服务传给 **rsm.Submit()** 的参数和返回值类型一致。
服务应将每个客户端操作传给 **rsm.Submit()**。为便于 reader goroutine 将 applyCh 消息与等待中的 **rsm.Submit()** 调用对应,**Submit()** 应将每个客户端操作与一个**唯一标识**一起包装进 **Op** 结构。**Submit()** 然后等待该操作提交并执行完毕,返回执行结果(**DoOp()** 的返回值)。若 **raft.Start()** 表明当前节点不是 Raft leader**Submit()** 应返回 **rpc.ErrWrongLeader** 错误。**Submit()** 须检测并处理这种情况:在调用 **raft.Start()** 后 leadership 发生变化,导致该操作丢失(从未提交)。
在 Part A 中rsm 测试程序充当服务提交被解释为对单个整数状态做自增的操作。Part B 中你将把 rsm 用作实现 **StateMachine**(及 **DoOp()**)并调用 **rsm.Submit()** 的 key/value 服务的一部分。
顺利时,一次客户端请求的事件序列为:
1. 客户端向服务 leader 发送请求。
2. 服务 leader 用该请求调用 **rsm.Submit()**
3. **rsm.Submit()** 用该请求调用 **raft.Start()**,然后等待。
4. Raft 提交该请求并向所有节点的 applyCh 发送。
5. 每个节点上的 rsm reader goroutine 从 applyCh 读取该请求并交给服务的 **DoOp()**
6. 在 leader 上rsm reader goroutine 将 **DoOp()** 的返回值交给最初提交该请求的 **Submit()** goroutine**Submit()** 返回该值。
你的服务器之间不应直接通信;它们只应通过 Raft 交互。
**实现 rsm.go****Submit()** 方法以及 reader goroutine。当通过 rsm 的 4A 测试时,该任务即完成:
```bash
$ cd src/kvraft1/rsm
$ go test -v -run 4A
=== RUN TestBasic4A
Test RSM basic (reliable network)...
... Passed -- 1.2 3 48 0
--- PASS: TestBasic4A (1.21s)
=== RUN TestLeaderFailure4A
... Passed -- 9223372036.9 3 31 0
--- PASS: TestLeaderFailure4A (1.50s)
PASS
ok 6.5840/kvraft1/rsm 2.887s
```
* 不应需要给 Raft 的 ApplyMsg 或 AppendEntries 等 Raft RPC 增加字段,但允许这样做。
* 你的方案须处理rsm leader 已为 **Submit()** 提交的请求调用了 **Start()**,但在该请求提交到日志前失去了 leadership。一种做法是 rsm 通过发现 Raft 的 term 已变或 **Start()** 返回的 index 处出现了不同请求,检测到已失去 leadership并从 **Submit()** 返回 **rpc.ErrWrongLeader**。若旧 leader 独自处于分区中,它无法得知新 leader但同一分区内的客户端也无法联系新 leader因此服务器无限等待直到分区恢复是可以接受的。
* 测试在关闭节点时会调用你的 Raft 的 **rf.Kill()**。Raft 应**关闭 applyCh**,以便 rsm 得知关闭并退出所有循环。
---
## Part B: 无快照的 Key/value 服务
```bash
$ cd src/kvraft1
$ go test -v -run TestBasic4B
=== RUN TestBasic4B
Test: one client (4B basic) (reliable network)...
kvtest.go:62: Wrong error
$
```
现在用 rsm 包复制 key/value 服务器。每台服务器("**kvserver**")对应一个 rsm/Raft 节点。Clerk 向与 Raft leader 对应的 kvserver 发送 **Put()****Get()** RPC。kvserver 代码将 Put/Get 操作提交给 rsmrsm 通过 Raft 复制并在每个节点调用你服务器的 **DoOp**,将操作应用到该节点的 key/value 数据库;目标是使各服务器维护一致的 key/value 数据库副本。
Clerk 有时不知道哪台 kvserver 是 Raft leader。若 Clerk 向错误的 kvserver 发 RPC 或无法到达该 kvserverClerk 应**重试**,向其他 kvserver 发送。若 key/value 服务将操作提交到其 Raft 日志(从而应用到 key/value 状态机leader 通过回复该 RPC 将结果报告给 Clerk。若操作未提交例如 leader 被替换服务器报告错误Clerk 向其他服务器重试。
你的 kvserver 之间不应直接通信;它们只应通过 Raft 交互。
第一个任务是实现**在无丢包、无服务器失败时**正确的方案。
可以将 Lab 2 的客户端代码(`kvsrv1/client.go`)复制到 `kvraft1/client.go`。需要增加决定每次 RPC 发往哪台 kvserver 的逻辑。
还需在 **server.go** 中实现 **Put()****Get()** 的 RPC 处理函数。这些处理函数应通过 **rsm.Submit()** 将请求提交给 Raft。rsm 包从 **applyCh** 读取命令时,会调用 **DoOp** 方法,你将在 **server.go** 中实现。
当你能**稳定**通过测试套件中第一个测试(`go test -v -run TestBasic4B`)时,该任务即完成。
* 若 kvserver **不处于多数**,则**不应完成 Get() RPC**(以免提供过期数据)。一种简单做法是像每个 **Put()** 一样,通过 **Submit()** 把每个 **Get()** 也写入 Raft 日志。不必实现论文 Section 8 中只读操作的优化。
* 最好从一开始就加锁,因为避免死锁有时会影响整体代码设计。用 **go test -race** 检查代码无数据竞争。
接下来应修改方案以**在网络和服务器故障下继续正确工作**。你会遇到的一个问题是 Clerk 可能需多次发送 RPC 才能找到能正常回复的 kvserver。若 leader 在将条目提交到 Raft 日志后立即失败Clerk 可能收不到回复,从而向另一台 leader 重发请求。对同一 version 的每次 **Clerk.Put()** 调用应**只执行一次**。
**加入故障处理代码。** 你的 Clerk 可采用与 lab 2 类似的重试策略,包括在重试的 Put RPC 的回复丢失时返回 **ErrMaybe**。当你的代码能稳定通过全部 4B 测试(`go test -v -run 4B`)时,即完成。
* 回忆rsm leader 可能失去 leadership 并从 **Submit()** 返回 **rpc.ErrWrongLeader**。此时应让 Clerk 向其他服务器重发请求直到找到新 leader。
* 可能需要修改 Clerk使其**记住上一轮 RPC 中哪台服务器是 leader**,并优先将下一轮 RPC 发往该服务器。这样可避免每次 RPC 都重新找 leader有助于在限定时间内通过部分测试。
你的代码此时应能通过 Lab 4B 测试,例如:
```bash
$ cd kvraft1
$ go test -run 4B
Test: one client (4B basic) ...
... Passed -- 3.2 5 1041 183
Test: one client (4B speed) ...
... Passed -- 15.9 3 3169 0
Test: many clients (4B many clients) ...
... Passed -- 3.9 5 3247 871
Test: unreliable net, many clients (4B unreliable net, many clients) ...
... Passed -- 5.3 5 1035 167
Test: unreliable net, one client (4B progress in majority) ...
... Passed -- 2.9 5 155 3
Test: no progress in minority (4B) ...
... Passed -- 1.6 5 102 3
Test: completion after heal (4B) ...
... Passed -- 1.3 5 67 4
Test: partitions, one client (4B partitions, one client) ...
... Passed -- 6.2 5 958 155
Test: partitions, many clients (4B partitions, many clients (4B)) ...
... Passed -- 6.8 5 3096 855
Test: restarts, one client (4B restarts, one client 4B ) ...
... Passed -- 6.7 5 311 13
Test: restarts, many clients (4B restarts, many clients) ...
... Passed -- 7.5 5 1223 95
Test: unreliable net, restarts, many clients (4B unreliable net, restarts, many clients ) ...
... Passed -- 8.4 5 804 33
Test: restarts, partitions, many clients (4B restarts, partitions, many clients) ...
... Passed -- 10.1 5 1308 105
Test: unreliable net, restarts, partitions, many clients (4B unreliable net, restarts, partitions, many clients) ...
... Passed -- 11.9 5 1040 33
Test: unreliable net, restarts, partitions, random keys, many clients (4B unreliable net, restarts, partitions, random keys, many clients) ...
... Passed -- 12.1 7 2801 93
PASS
ok 6.5840/kvraft1 103.797s
```
每个 Passed 后的数字依次为:**实际时间(秒)**、**节点数**、**发送的 RPC 数**(含客户端 RPC、**执行的 key/value 操作数**Clerk Get/Put 调用)。
---
## Part C: 带快照的 Key/value 服务
目前你的 key/value 服务器没有调用 Raft 库的 **Snapshot()** 方法,因此重启的服务器必须重放完整持久化 Raft 日志才能恢复状态。现在将修改 kvserver 和 rsm与 Raft 协作以节省日志空间并缩短重启时间,使用 Lab 3D 的 Raft **Snapshot()**
测试程序将 **maxraftstate** 传给你的 **StartKVServer()**,你再传给 rsm。**maxraftstate** 表示持久化 Raft 状态的**最大允许大小**(字节),含日志但不含快照。应将 **maxraftstate****rf.PersistBytes()** 比较。每当 rsm 检测到 Raft 状态大小接近该阈值时,应通过调用 Raft 的 **Snapshot** 保存快照。rsm 可通过调用 **StateMachine** 接口的 **Snapshot** 方法获取 kvserver 的快照来创建该快照。若 **maxraftstate****-1**则不必做快照。maxraftstate 限制适用于你的 Raft 作为第一个参数传给 **persister.Save()** 的 GOB 编码字节。
persister 对象的源码在 **tester1/persister.go**
**修改你的 rsm**,使其在检测到持久化 Raft 状态过大时向 Raft 提交快照。rsm 服务器重启时,应用 **persister.ReadSnapshot()** 读取快照,若快照长度大于零则传给 StateMachine 的 **Restore()** 方法。当通过 rsm 的 **TestSnapshot4C** 时,该任务即完成。
```bash
$ cd kvraft1/rsm
$ go test -run TestSnapshot4C
=== RUN TestSnapshot4C
... Passed -- 9223372036.9 3 230 0
--- PASS: TestSnapshot4C (3.88s)
PASS
ok 6.5840/kvraft1/rsm 3.882s
```
* 考虑 rsm **何时**应对状态做快照,以及快照中除服务器状态外还应**包含什么**。Raft 用 **Save()** 将每个快照与对应 Raft 状态一起存入 persister。可用 **ReadSnapshot()** 读取最新存储的快照。
* 快照中存储的结构体**所有字段名首字母大写**。
**实现 kvraft1/server.go 中的 Snapshot() 和 Restore() 方法**,供 rsm 调用。**修改 rsm 以处理 applyCh 上包含快照的消息。**
* 该任务可能暴露出 Raft 和 rsm 库中的 bug。若修改了 Raft 实现,请确保其仍能通过全部 Lab 3 测试。
* Lab 4 测试的合理耗时为**实际时间 400 秒**、**CPU 时间 700 秒**。
你的代码应通过 4C 测试(如下例),以及 4A+B 测试(且 Raft 须继续通过 Lab 3 测试)。
```bash
$ go test -run 4C
Test: snapshots, one client (4C SnapshotsRPC) ...
Test: InstallSnapshot RPC (4C) ...
... Passed -- 4.5 3 241 64
Test: snapshots, one client (4C snapshot size is reasonable) ...
... Passed -- 11.4 3 2526 800
Test: snapshots, one client (4C speed) ...
... Passed -- 14.2 3 3149 0
Test: restarts, snapshots, one client (4C restarts, snapshots, one client) ...
... Passed -- 6.8 5 305 13
Test: restarts, snapshots, many clients (4C restarts, snapshots, many clients ) ...
... Passed -- 9.0 5 5583 795
Test: unreliable net, snapshots, many clients (4C unreliable net, snapshots, many clients) ...
... Passed -- 4.7 5 977 155
Test: unreliable net, restarts, snapshots, many clients (4C unreliable net, restarts, snapshots, many clients) ...
... Passed -- 8.6 5 847 33
Test: unreliable net, restarts, partitions, snapshots, many clients (4C unreliable net, restarts, partitions, snapshots, many clients) ...
... Passed -- 11.5 5 841 33
Test: unreliable net, restarts, partitions, snapshots, random keys, many clients (4C unreliable net, restarts, partitions, snapshots, random keys, many clients) ...
... Passed -- 12.8 7 2903 93
PASS
ok 6.5840/kvraft1 83.543s
```
---
*来源: [6.5840 Lab 4: Fault-tolerant Key/Value Service](https://pdos.csail.mit.edu/6.824/labs/lab-kvraft1.html)*

View File

@@ -0,0 +1,224 @@
# 6.5840 Lab 4: Fault-tolerant Key/Value Service
## Introduction
In this lab you will build a fault-tolerant key/value storage service using your Raft library from Lab 3. To clients, the service looks similar to the server of Lab 2. However, instead of a single server, the service consists of a set of servers that use Raft to help them maintain identical databases. Your key/value service should continue to process client requests as long as a **majority** of the servers are alive and can communicate, in spite of other failures or network partitions. After Lab 4, you will have implemented all parts (Clerk, Service, and Raft) shown in the diagram of Raft interactions.
Clients will interact with your key/value service through a **Clerk**, as in Lab 2. A Clerk implements the **Put** and **Get** methods with the same semantics as Lab 2: Puts are **at-most-once** and the Puts/Gets must form a **linearizable** history.
Providing linearizability is relatively easy for a single server. It is harder if the service is replicated, since all servers must choose the same execution order for concurrent requests, must avoid replying to clients using state that isn't up to date, and must recover their state after a failure in a way that preserves all acknowledged client updates.
This lab has **three parts**. In **part A**, you will implement a replicated-state machine package, **rsm**, using your raft implementation; rsm is agnostic of the requests that it replicates. In **part B**, you will implement a replicated key/value service using rsm, but without using snapshots. In **part C**, you will use your snapshot implementation from Lab 3D, which will allow Raft to discard old log entries. Please submit each part by the respective deadline.
You should review the [extended Raft paper](../papers/raft-extended.md), in particular **Section 7** (but not 8). For a wider perspective, have a look at Chubby, Paxos Made Live, Spanner, Zookeeper, Harp, Viewstamped Replication, and [Bolosky et al](https://pdos.csail.mit.edu/6.824/papers/bolosky-usenix2010.pdf).
**Start early.**
---
## Getting Started
We supply you with skeleton code and tests in `src/kvraft1`. The skeleton code uses the skeleton package `src/kvraft1/rsm` to replicate a server. A server must implement the **StateMachine** interface defined in rsm to replicate itself using rsm. Most of your work will be implementing rsm to provide server-agnostic replication. You will also need to modify `kvraft1/client.go` and `kvraft1/server.go` to implement the server-specific parts. This split allows you to re-use rsm in the next lab. You may be able to re-use some of your Lab 2 code (e.g., re-using the server code by copying or importing the "src/kvsrv1" package) but it is not a requirement.
To get up and running, execute the following commands. Don't forget the `git pull` to get the latest software.
```bash
$ cd ~/6.5840
$ git pull
...
```
---
## Part A: Replicated State Machine (RSM)
```bash
$ cd src/kvraft1/rsm
$ go test -v
=== RUN TestBasic
Test RSM basic (reliable network)...
..
config.go:147: one: took too long
```
In the common situation of a client/server service using Raft for replication, the service interacts with Raft in two ways: the service leader submits client operations by calling **raft.Start()**, and all service replicas receive committed operations via Raft's **applyCh**, which they execute. On the leader, these two activities interact. At any given time, some server goroutines are handling client requests, have called **raft.Start()**, and each is waiting for its operation to commit and to find out what the result of executing the operation is. And as committed operations appear on the **applyCh**, each needs to be executed by the service, and the results need to be handed to the goroutine that called **raft.Start()** so that it can return the result to the client.
The **rsm** package encapsulates the above interaction. It sits as a layer between the service (e.g. a key/value database) and Raft. In **rsm/rsm.go** you will need to implement:
1. A **"reader" goroutine** that reads the applyCh
2. A **rsm.Submit()** function that calls **raft.Start()** for a client operation and then waits for the reader goroutine to hand it the result of executing that operation
The service that is using rsm appears to the rsm reader goroutine as a **StateMachine** object providing a **DoOp()** method. The reader goroutine should hand each committed operation to **DoOp()**; **DoOp()**'s return value should be given to the corresponding **rsm.Submit()** call for it to return. **DoOp()**'s argument and return value have type **any**; the actual values should have the same types as the argument and return values that the service passes to **rsm.Submit()**, respectively.
The service should pass each client operation to **rsm.Submit()**. To help the reader goroutine match applyCh messages with waiting calls to **rsm.Submit()**, **Submit()** should wrap each client operation in an **Op** structure along with a **unique identifier**. **Submit()** should then wait until the operation has committed and been executed, and return the result of execution (the value returned by **DoOp()**). If **raft.Start()** indicates that the current peer is not the Raft leader, **Submit()** should return an **rpc.ErrWrongLeader** error. **Submit()** should detect and handle the situation in which leadership changed just after it called **raft.Start()**, causing the operation to be lost (never committed).
For Part A, the rsm tester acts as the service, submitting operations that it interprets as increments on a state consisting of a single integer. In Part B you'll use rsm as part of a key/value service that implements **StateMachine** (and **DoOp()**), and calls **rsm.Submit()**.
If all goes well, the sequence of events for a client request is:
1. The client sends a request to the service leader.
2. The service leader calls **rsm.Submit()** with the request.
3. **rsm.Submit()** calls **raft.Start()** with the request, and then waits.
4. Raft commits the request and sends it on all peers' applyChs.
5. The rsm reader goroutine on each peer reads the request from the applyCh and passes it to the service's **DoOp()**.
6. On the leader, the rsm reader goroutine hands the **DoOp()** return value to the **Submit()** goroutine that originally submitted the request, and **Submit()** returns that value.
Your servers should not directly communicate; they should only interact with each other through Raft.
**Implement rsm.go:** the **Submit()** method and a reader goroutine. You have completed this task if you pass the rsm 4A tests:
```bash
$ cd src/kvraft1/rsm
$ go test -v -run 4A
=== RUN TestBasic4A
Test RSM basic (reliable network)...
... Passed -- 1.2 3 48 0
--- PASS: TestBasic4A (1.21s)
=== RUN TestLeaderFailure4A
... Passed -- 9223372036.9 3 31 0
--- PASS: TestLeaderFailure4A (1.50s)
PASS
ok 6.5840/kvraft1/rsm 2.887s
```
* You should not need to add any fields to the Raft ApplyMsg, or to Raft RPCs such as AppendEntries, but you are allowed to do so.
* Your solution needs to handle an rsm leader that has called **Start()** for a request submitted with **Submit()** but loses its leadership before the request is committed to the log. One way to do this is for the rsm to detect that it has lost leadership, by noticing that Raft's term has changed or a different request has appeared at the index returned by **Start()**, and return **rpc.ErrWrongLeader** from **Submit()**. If the ex-leader is partitioned by itself, it won't know about new leaders; but any client in the same partition won't be able to talk to a new leader either, so it's OK in this case for the server to wait indefinitely until the partition heals.
* The tester calls your Raft's **rf.Kill()** when it is shutting down a peer. Raft should **close the applyCh** so that your rsm learns about the shutdown, and can exit out of all loops.
---
## Part B: Key/value Service without Snapshots
```bash
$ cd src/kvraft1
$ go test -v -run TestBasic4B
=== RUN TestBasic4B
Test: one client (4B basic) (reliable network)...
kvtest.go:62: Wrong error
$
```
Now you will use the rsm package to replicate a key/value server. Each of the servers ("**kvservers**") will have an associated rsm/Raft peer. Clerks send **Put()** and **Get()** RPCs to the kvserver whose associated Raft is the leader. The kvserver code submits the Put/Get operation to rsm, which replicates it using Raft and invokes your server's **DoOp** at each peer, which should apply the operations to the peer's key/value database; the intent is for the servers to maintain identical replicas of the key/value database.
A Clerk sometimes doesn't know which kvserver is the Raft leader. If the Clerk sends an RPC to the wrong kvserver, or if it cannot reach the kvserver, the Clerk should **re-try** by sending to a different kvserver. If the key/value service commits the operation to its Raft log (and hence applies the operation to the key/value state machine), the leader reports the result to the Clerk by responding to its RPC. If the operation failed to commit (for example, if the leader was replaced), the server reports an error, and the Clerk retries with a different server.
Your kvservers should not directly communicate; they should only interact with each other through Raft.
Your first task is to implement a solution that **works when there are no dropped messages, and no failed servers**.
Feel free to copy your client code from Lab 2 (`kvsrv1/client.go`) into `kvraft1/client.go`. You will need to add logic for deciding which kvserver to send each RPC to.
You'll also need to implement **Put()** and **Get()** RPC handlers in **server.go**. These handlers should submit the request to Raft using **rsm.Submit()**. As the rsm package reads commands from **applyCh**, it should invoke the **DoOp** method, which you will have to implement in **server.go**.
You have completed this task when you **reliably** pass the first test in the test suite, with `go test -v -run TestBasic4B`.
* A kvserver should **not complete a Get() RPC if it is not part of a majority** (so that it does not serve stale data). A simple solution is to enter every **Get()** (as well as each **Put()**) in the Raft log using **Submit()**. You don't have to implement the optimization for read-only operations that is described in Section 8.
* It's best to add locking from the start because the need to avoid deadlocks sometimes affects overall code design. Check that your code is race-free using **go test -race**.
Now you should modify your solution to **continue in the face of network and server failures**. One problem you'll face is that a Clerk may have to send an RPC multiple times until it finds a kvserver that replies positively. If a leader fails just after committing an entry to the Raft log, the Clerk may not receive a reply, and thus may re-send the request to another leader. Each call to **Clerk.Put()** should result in **just a single execution** for a particular version number.
**Add code to handle failures.** Your Clerk can use a similar retry plan as in lab 2, including returning **ErrMaybe** if a response to a retried Put RPC is lost. You are done when your code reliably passes all the 4B tests, with `go test -v -run 4B`.
* Recall that the rsm leader may lose its leadership and return **rpc.ErrWrongLeader** from **Submit()**. In this case you should arrange for the Clerk to re-send the request to other servers until it finds the new leader.
* You will probably have to modify your Clerk to **remember which server turned out to be the leader** for the last RPC, and send the next RPC to that server first. This will avoid wasting time searching for the leader on every RPC, which may help you pass some of the tests quickly enough.
Your code should now pass the Lab 4B tests, like this:
```bash
$ cd kvraft1
$ go test -run 4B
Test: one client (4B basic) ...
... Passed -- 3.2 5 1041 183
Test: one client (4B speed) ...
... Passed -- 15.9 3 3169 0
Test: many clients (4B many clients) ...
... Passed -- 3.9 5 3247 871
Test: unreliable net, many clients (4B unreliable net, many clients) ...
... Passed -- 5.3 5 1035 167
Test: unreliable net, one client (4B progress in majority) ...
... Passed -- 2.9 5 155 3
Test: no progress in minority (4B) ...
... Passed -- 1.6 5 102 3
Test: completion after heal (4B) ...
... Passed -- 1.3 5 67 4
Test: partitions, one client (4B partitions, one client) ...
... Passed -- 6.2 5 958 155
Test: partitions, many clients (4B partitions, many clients (4B)) ...
... Passed -- 6.8 5 3096 855
Test: restarts, one client (4B restarts, one client 4B ) ...
... Passed -- 6.7 5 311 13
Test: restarts, many clients (4B restarts, many clients) ...
... Passed -- 7.5 5 1223 95
Test: unreliable net, restarts, many clients (4B unreliable net, restarts, many clients ) ...
... Passed -- 8.4 5 804 33
Test: restarts, partitions, many clients (4B restarts, partitions, many clients) ...
... Passed -- 10.1 5 1308 105
Test: unreliable net, restarts, partitions, many clients (4B unreliable net, restarts, partitions, many clients) ...
... Passed -- 11.9 5 1040 33
Test: unreliable net, restarts, partitions, random keys, many clients (4B unreliable net, restarts, partitions, random keys, many clients) ...
... Passed -- 12.1 7 2801 93
PASS
ok 6.5840/kvraft1 103.797s
```
The numbers after each Passed are: **real time in seconds**, **number of peers**, **number of RPCs sent** (including client RPCs), and **number of key/value operations executed** (Clerk Get/Put calls).
---
## Part C: Key/value Service with Snapshots
As things stand now, your key/value server doesn't call your Raft library's **Snapshot()** method, so a rebooting server has to replay the complete persisted Raft log in order to restore its state. Now you'll modify kvserver and rsm to cooperate with Raft to save log space and reduce restart time, using Raft's **Snapshot()** from Lab 3D.
The tester passes **maxraftstate** to your **StartKVServer()**, which passes it to rsm. **maxraftstate** indicates the **maximum allowed size** of your persistent Raft state in bytes (including the log, but not including snapshots). You should compare **maxraftstate** to **rf.PersistBytes()**. Whenever your rsm detects that the Raft state size is approaching this threshold, it should save a snapshot by calling Raft's **Snapshot**. rsm can create this snapshot by calling the **Snapshot** method of the **StateMachine** interface to obtain a snapshot of the kvserver. If **maxraftstate** is **-1**, you do not have to snapshot. The maxraftstate limit applies to the GOB-encoded bytes your Raft passes as the first argument to **persister.Save()**.
You can find the source for the persister object in **tester1/persister.go**.
**Modify your rsm** so that it detects when the persisted Raft state grows too large, and then hands a snapshot to Raft. When a rsm server restarts, it should read the snapshot with **persister.ReadSnapshot()** and, if the snapshot's length is greater than zero, pass the snapshot to the StateMachine's **Restore()** method. You complete this task if you pass **TestSnapshot4C** in rsm.
```bash
$ cd kvraft1/rsm
$ go test -run TestSnapshot4C
=== RUN TestSnapshot4C
... Passed -- 9223372036.9 3 230 0
--- PASS: TestSnapshot4C (3.88s)
PASS
ok 6.5840/kvraft1/rsm 3.882s
```
* Think about **when** rsm should snapshot its state and **what** should be included in the snapshot beyond just the server state. Raft stores each snapshot in the persister object using **Save()**, along with corresponding Raft state. You can read the latest stored snapshot using **ReadSnapshot()**.
* **Capitalize all fields** of structures stored in the snapshot.
**Implement the kvraft1/server.go Snapshot() and Restore() methods**, which rsm calls. **Modify rsm to handle applyCh messages that contain snapshots.**
* You may have bugs in your Raft and rsm library that this task exposes. If you make changes to your Raft implementation make sure it continues to pass all of the Lab 3 tests.
* A reasonable amount of time to take for the Lab 4 tests is **400 seconds** of real time and **700 seconds** of CPU time.
Your code should pass the 4C tests (as in the example here) as well as the 4A+B tests (and your Raft must continue to pass the Lab 3 tests).
```bash
$ go test -run 4C
Test: snapshots, one client (4C SnapshotsRPC) ...
Test: InstallSnapshot RPC (4C) ...
... Passed -- 4.5 3 241 64
Test: snapshots, one client (4C snapshot size is reasonable) ...
... Passed -- 11.4 3 2526 800
Test: snapshots, one client (4C speed) ...
... Passed -- 14.2 3 3149 0
Test: restarts, snapshots, one client (4C restarts, snapshots, one client) ...
... Passed -- 6.8 5 305 13
Test: restarts, snapshots, many clients (4C restarts, snapshots, many clients ) ...
... Passed -- 9.0 5 5583 795
Test: unreliable net, snapshots, many clients (4C unreliable net, snapshots, many clients) ...
... Passed -- 4.7 5 977 155
Test: unreliable net, restarts, snapshots, many clients (4C unreliable net, restarts, snapshots, many clients) ...
... Passed -- 8.6 5 847 33
Test: unreliable net, restarts, partitions, snapshots, many clients (4C unreliable net, restarts, partitions, snapshots, many clients) ...
... Passed -- 11.5 5 841 33
Test: unreliable net, restarts, partitions, snapshots, random keys, many clients (4C unreliable net, restarts, partitions, snapshots, random keys, many clients) ...
... Passed -- 12.8 7 2903 93
PASS
ok 6.5840/kvraft1 83.543s
```
---
*From: [6.5840 Lab 4: Fault-tolerant Key/Value Service](https://pdos.csail.mit.edu/6.824/labs/lab-kvraft1.html)*

View File

@@ -0,0 +1,258 @@
# 6.5840 Lab 5: Sharded Key/Value Service
## 简介
你可以选择做基于自己想法的期末项目,或做本实验。
本实验中你将构建一个 key/value 存储系统,在一组由 Raft 复制的 key/value 服务器组(**shardgrps**)上对 key 进行 **"shard"**(分片/分区)。一个 **shard** 是 key/value 对的一个子集;例如所有以 "a" 开头的 key 可以是一个 shard以 "b" 开头的为另一个,等等。分片的目的是**性能**。每个 shardgrp 只处理少数 shard 的 put 和 get各 shardgrp 并行工作;因此系统总吞吐(单位时间内的 put/get 数)随 shardgrp 数量增加。
分片 key/value 服务的组件见实验示意图。**Shardgrps**(蓝色方块)存储带 key 的 shardshardgrp 1 存 key "a" 的 shardshardgrp 2 存 key "b" 的 shard。客户端通过 **clerk**绿色圆与服务交互clerk 实现 **Get****Put** 方法。为找到 Put/Get 所传 key 对应的 shardgrpclerk 从 **kvsrv**(黑色方块,即你在 **Lab 2** 实现的)获取 **configuration**。Configuration 描述从 shard 到 shardgrp 的映射(例如 shard 1 由 shardgrp 3 服务)。
管理员(即测试程序)使用另一个客户端 **controller**(紫色圆)向集群添加/移除 shardgrp 并更新应由哪个 shardgrp 服务哪个 shard。Controller 有一个主要方法:**ChangeConfigTo**,以新 configuration 为参数,将系统从当前 configuration 切换到新 configuration这涉及将 shard 迁移到新加入的 shardgrp、以及从即将离开的 shardgrp 迁出。为此 controller 1) 向 shardgrp 发 RPC**FreezeShard**、**InstallShard**、**DeleteShard**2) 更新存储在 kvsrv 中的 configuration。
引入 controller 是因为分片存储系统必须能**在 shardgrp 之间迁移 shard**:用于负载均衡,或当 shardgrp 加入、离开时(新容量、维修、下线)。
本实验的主要挑战是在 1) shard 到 shardgrp 的分配发生变化,以及 2) controller 在 **ChangeConfigTo** 期间失败或处于分区时恢复的情况下,保证 Get/Put 操作的 **linearizability**
1. **若 ChangeConfigTo 在重配置过程中失败**,部分 shard 可能已开始但未完成从一 shardgrp 迁到另一 shardgrp从而不可访问。测试会启动新的 controller你的任务是确保新的能完成旧 controller 未完成的重配置。
2. **ChangeConfigTo 会在 shardgrp 之间迁移 shard**。你必须保证**任意时刻每个 shard 最多只有一个 shardgrp 在服务请求**,这样使用旧 shardgrp 与新 shardgrp 的客户端不会破坏 linearizability。
本实验用 "configuration" 指 **shard 到 shardgrp 的分配**。这与 Raft 集群成员变更**不是**一回事;不需要实现 Raft 集群成员变更。
一个 shardgrp 服务器只属于一个 shardgrp。给定 shardgrp 内的服务器集合不会改变。
客户端与服务器之间的交互**只能通过 RPC**(不得使用共享 Go 变量或文件)。
* **Part A**:实现可用的 **shardctrler**(在 kvsrv 中存储/读取 configuration、**shardgrp**(用 Raft rsm 复制)和 **shardgrp clerk**。shardctrler 通过 shardgrp clerk 迁移 shard。
* **Part B**:修改 shardctrler在 configuration 变更期间处理故障与分区。
* **Part C**:允许多个 controller 并发且互不干扰。
* **Part D**:以任意方式扩展你的方案(可选)。
本实验的设计与 Flat Datacenter Storage、BigTable、Spanner、FAWN、Apache HBase、Rosebud、Spinnaker 等思路一致(细节不同)。
Lab 5 将使用你在 **Lab 2****kvsrv**,以及 **Lab 4****rsm 和 Raft**。Lab 5 与 Lab 4 必须使用相同的 rsm 和 Raft 实现。
迟交时长仅可用于 **Part A****不能**用于 Part BD。
---
## 起步
执行 `git pull` 获取最新实验代码。
我们在 **src/shardkv1** 中提供了测试和骨架代码:
* **shardctrler** 包:`shardctrler.go`,包含 controller 变更 configuration 的 **ChangeConfigTo** 和获取 configuration 的 **Query**
* **shardgrp** 包shardgrp clerk 与 server
* **shardcfg** 包:计算 shard configuration
* **client.go**shardkv clerk
运行以下命令即可开始:
```bash
$ cd ~/6.5840
$ git pull
...
$ cd src/shardkv1
$ go test -v
=== RUN TestInitQuery5A
Test (5A): Init and Query ... (reliable network)...
shardkv_test.go:46: Static wrong null 0
...
```
---
## Part A: 迁移 Shard困难
第一个任务是:在无故障时实现 shardgrp 以及 **InitConfig**、**Query**、**ChangeConfigTo**。Configuration 的代码在 **shardkv1/shardcfg**。每个 **shardcfg.ShardConfig** 有唯一编号 **Num**、从 shard 编号到 group 编号的映射、以及从 group 编号到复制该 group 的服务器列表的映射。通常 shard 数多于 group 数,以便以较细粒度调整负载。
### 1. InitConfig 与 Query尚无 shardgrp
**shardctrler/shardctrler.go** 中实现:
* **Query**:返回当前 configuration从 kvsrv 读取(由 InitConfig 存储)。
* **InitConfig**:接收第一个 configuration测试程序提供的 **shardcfg.ShardConfig**)并存入 Lab 2 的 **kvsrv** 实例。
**ShardCtrler.IKVClerk** 的 Get/Put 与 kvsrv 通信,用 **ShardConfig.String()** 序列化后 Put**shardcfg.FromString()** 反序列化。通过第一个测试时即完成:
```bash
$ cd ~/6.5840/src/shardkv1
$ go test -run TestInitQuery5A
Test (5A): Init and Query ... (reliable network)...
... Passed -- time 0.0s #peers 1 #RPCs 3 #Ops 0
PASS
ok 6.5840/shardkv1 0.197s
```
### 2. Shardgrp 与 shardkv clerkStatic 测试)
通过从 Lab 4 kvraft 方案复制,在 **shardkv1/shardgrp/server.go** 实现 **shardgrp** 的初始版本,在 **shardkv1/shardgrp/client.go** 实现 **shardgrp clerk**。在 **shardkv1/client.go** 实现 **shardkv clerk**:用 **Query** 找到 key 对应的 shardgrp再与该 shardgrp 通信。通过 **Static** 测试时即完成。
* 创建时,第一个 shardgrp**shardcfg.Gid1**)应将自己初始化为**拥有所有 shard**。
* **shardkv1/client.go** 的 Put 在回复可能丢失时必须返回 **ErrMaybe**内部shardgrp的 Put 可用错误表示这一点。
* 要向 shardgrp put/get 一个 keyshardkv clerk 应通过 **shardgrp.MakeClerk** 创建 shardgrp clerk传入 configuration 中的服务器以及 shardkv clerk 的 **ck.clnt**。用 **ShardConfig.GidServers()** 获取 shard 的 group。
***shardcfg.Key2Shard()** 得到 key 的 shard 编号。测试程序将 **ShardCtrler** 传给 **shardkv1/client.go****MakeClerk**;用 **Query** 获取当前 configuration。
* 可从 kvraft 的 **client.go****server.go** 复制 Put/Get 及相关代码。
```bash
$ cd ~/6.5840/src/shardkv1
$ go test -run Static
Test (5A): one shard group ... (reliable network)...
... Passed -- time 5.4s #peers 1 #RPCs 793 #Ops 180
PASS
ok 6.5840/shardkv1 5.632s
```
### 3. ChangeConfigTo 与 shard 迁移JoinBasic、DeleteBasic
通过实现 **ChangeConfigTo** 支持**在 group 之间迁移 shard**:从旧 configuration 切换到新 configuration。新 configuration 可能加入新 shardgrp 或移除现有 shardgrp。Controller 必须迁移 shard **数据**,使每个 shardgrp 存储的 shard 与新 configuration 一致。
**迁移一个 shard 的建议流程:**
1. 在源 shardgrp **Freeze** 该 shard该 shardgrp 拒绝对该迁移中 shard 的 key 的 Put
2. **Install**(复制)该 shard 到目标 shardgrp。
3. 在源端 **Delete** 已 freeze 的 shard。
4. **Post** 新 configuration使客户端能找到迁移后的 shard。
这样避免 shardgrp 之间直接交互,并允许继续服务未参与变更的 shard。
**顺序**:每个 configuration 有唯一 **Num**(见 **shardcfg/shardcfg.go**。Part A 中测试程序顺序调用 ChangeConfigTo新 config 的 **Num** 比前一个大 1。为拒绝过时 RPC**FreezeShard**、**InstallShard**、**DeleteShard** 应包含 **Num**(见 **shardgrp/shardrpc/shardrpc.go**),且 shardgrp 须记住每个 shard 见过的**最大 Num**。
**shardctrler/shardctrler.go** 中实现 **ChangeConfigTo**,并扩展 shardgrp 支持 **freeze**、**install**、**delete**。在 **shardgrp/client.go****shardgrp/server.go** 中实现 **FreezeShard**、**InstallShard**、**DeleteShard**,使用 **shardgrp/shardrpc** 中的 RPC并根据 Num 拒绝过时 RPC。修改 **shardkv1/client.go** 中的 shardkv clerk 以处理 **ErrWrongGroup**(当 shardgrp 不负责该 shard 时返回)。先通过 **JoinBasic****DeleteBasic**(加入 group离开可稍后
* 像 Put 和 Get 一样,通过你的 **rsm** 包执行 **FreezeShard**、**InstallShard**、**DeleteShard**。
* 若 RPC 回复中包含属于服务器状态的 **map**,可能产生数据竞争;**在回复中附带该 map 的副本**。
* 可以在 RPC 请求/回复中发送整个 map使 shard 迁移代码更简单。
* 若 Put/Get 的 key 的 shard 未分配给该 shardgrpshardgrp 应返回 **ErrWrongGroup****shardkv1/client.go** 应重新读取 configuration 并重试。
### 4. 离开的 ShardgrpsTestJoinLeaveBasic5A
扩展 **ChangeConfigTo** 以处理**离开**的 shardgrp在当前 config 中但不在新 config 中)。通过 **TestJoinLeaveBasic5A**
### 5. 全部 Part A 测试
你的方案必须**在 configuration 变更进行时继续服务未受影响的 shard**。通过全部 Part A 测试:
```bash
$ cd ~/6.5840/src/shardkv1
$ go test -run 5A
Test (5A): Init and Query ... (reliable network)...
... Passed -- time 0.0s #peers 1 #RPCs 3 #Ops 0
Test (5A): one shard group ... (reliable network)...
... Passed -- time 5.1s #peers 1 #RPCs 792 #Ops 180
Test (5A): a group joins... (reliable network)...
... Passed -- time 12.9s #peers 1 #RPCs 6300 #Ops 180
...
Test (5A): many concurrent clerks unreliable... (unreliable network)...
... Passed -- time 25.3s #peers 1 #RPCs 7553 #Ops 1896
PASS
ok 6.5840/shardkv1 243.115s
```
---
## Part B: 处理失败的 Controller简单
Controller 生命周期短,在迁移 shard 时可能**失败或失去连接**。任务是在启动新 controller 时**恢复**:新 controller 必须**完成**前一个未完成的重配置。测试程序在启动 controller 时调用 **InitController**;你可以在其中实现恢复。
**做法**:在 controller 的 kvsrv 中维护**两个 configuration****current** 和 **next**。Controller 开始重配置时存储 next configuration。完成时把 next 变为 current。在 **InitController** 中,若存在存储的 **next** configuration 且其 Num 大于 current则**完成 shard 迁移**以重配置到该 next config。
从前一个失败 controller 继续的 controller 可能**重复** FreezeShard、InstallShard、Delete RPCshardgrp 可用 **Num** 检测重复并拒绝。
在 shardctrler 中实现上述逻辑。通过 Part B 测试时即完成:
```bash
$ cd ~/6.5840/src/shardkv1
$ go test -run 5B
Test (5B): Join/leave while a shardgrp is down... (reliable network)...
... Passed -- time 9.2s #peers 1 #RPCs 899 #Ops 120
Test (5B): recover controller ... (reliable network)...
... Passed -- time 26.4s #peers 1 #RPCs 3724 #Ops 360
PASS
ok 6.5840/shardkv1 35.805s
```
***shardctrler/shardctrler.go****InitController** 中实现恢复。
---
## Part C: 并发 Configuration 变更(中等)
修改 controller 以允许**多个 controller 并发**。当某个失败或处于分区时,测试会启动新的,新 controller 必须完成任何进行中的工作(同 Part B。因此多个 controller 可能并发运行,向 shardgrp 和 kvsrv 发 RPC。
**挑战**:确保 controller **互不干扰**。Part A 中你已用 **Num** 对 shardgrp RPC 做 fencing过时 RPC 会被拒绝;多个 controller 的重复工作是安全的。剩余问题是**只有一个 controller** 应更新 **next** configuration这样两个 controller例如分区中的与新的不会为同一 Num 写入不同 config。测试会并发运行多个 controller每个读取当前 config、为 join/leave 更新、然后调用 ChangeConfigTo——因此多个 controller 可能用**同一 Num 的不同 config** 调用 ChangeConfigTo。可使用 **version 与带 version 的 Put**,使只有一个 controller 能成功提交该 Num 的 next config其他直接返回不做任何事。
修改 controller使**对给定 configuration Num 只有一个 controller 能提交 next configuration**。通过并发测试:
```bash
$ cd ~/6.5840/src/shardkv1
$ go test -run TestConcurrentReliable5C
Test (5C): Concurrent ctrlers ... (reliable network)...
... Passed -- time 8.2s #peers 1 #RPCs 1753 #Ops 120
PASS
ok 6.5840/shardkv1 8.364s
$ go test -run TestAcquireLockConcurrentUnreliable5C
Test (5C): Concurrent ctrlers ... (unreliable network)...
... Passed -- time 23.8s #peers 1 #RPCs 1850 #Ops 120
PASS
ok 6.5840/shardkv1 24.008s
```
* 参见 **test.go** 中的 **concurCtrler** 了解测试如何并发运行 controller。
**恢复 + 新 controller**:新 controller 仍应执行 Part B 的恢复。若旧 controller 在 ChangeConfigTo 期间处于分区,确保旧的不干扰新的。若所有 controller 更新都用 Num 正确 fencingPart B可能不需要额外代码。通过 **Partition** 测试:
```bash
$ go test -run Partition
Test (5C): partition controller in join... (reliable network)...
... Passed -- time 7.8s #peers 1 #RPCs 876 #Ops 120
...
Test (5C): controllers with leased leadership ... (unreliable network)...
... Passed -- time 60.5s #peers 1 #RPCs 11422 #Ops 2336
PASS
ok 6.5840/shardkv1 217.779s
```
重新运行全部测试,确保最近的 controller 修改没有破坏前面的部分。
**Gradescope** 会运行 Lab 3AD、Lab 4AC 和 5C 测试。提交前:
```bash
$ go test ./raft1
$ go test ./kvraft1
$ go test ./shardkv1
```
---
## Part D: 扩展你的方案
在这最后一部分你可以**以任意方式扩展**你的方案。你必须**为自己的扩展编写测试**。
实现下列想法之一或你自己的想法。在 **extension.md** 中写**一段话**描述你的扩展,并将 **extension.md** 上传到 Gradescope。对较难、开放式的扩展可与另一名同学组队。
**想法(前几个较易,后面更开放):**
* **(难)** 修改 shardkv 以支持**事务**(跨 shard 的多个 Put 和 Get 原子执行)。实现两阶段提交与两阶段锁。编写测试。
* **(难)** 在 kvraft 中支持**事务**(多个 Put/Get 原子)。这样带 version 的 Put 不再必要。参见 [etcd's transactions](https://etcd.io/docs/v3.4/learning/api/)。编写测试。
* **(难)** 让 kvraft **leader 不经 rsm 直接处理 Get**Raft 论文 Section 8 末尾的优化,含 **leases**),并保持 linearizability。通过现有 kvraft 测试。增加测试:优化后的 Get 更快(如更少 RPC、以及 term 切换更慢(新 leader 等待 lease 过期)。
* **(中等)** 为 kvsrv 增加 **Range** 函数(从 low 到 high 的 key。偷懒做法遍历 key/value map更好做法支持范围查询的数据结构如 B-tree。包含一个在偷懒实现下失败、在更好实现下通过的测试。
* **(中等)** 将 kvsrv 改为 **恰好一次** Put/Get 语义(如 Lab 2 丢包风格)。在 kvraft 中也实现恰好一次。可移植 2024 的测试。
* **(简单)** 修改测试程序,使 controller 使用 **kvraft** 而非 kvsrv例如在 test.go 的 MakeTestMaxRaft 中用 kvraft.StartKVServer 替换 kvsrv.StartKVServer。编写测试在一个 kvraft 节点宕机时 controller 仍能查询/更新 configuration。测试代码在 **src/kvtest1**、**src/shardkv1**、**src/tester1**。
---
## 提交步骤
提交前最后运行一遍全部测试:
```bash
$ go test ./raft1
$ go test ./kvraft1
$ go test ./shardkv1
```
---
*来源: [6.5840 Lab 5: Sharded Key/Value Service](https://pdos.csail.mit.edu/6.824/labs/lab-shard1.html)*

View File

@@ -0,0 +1,258 @@
# 6.5840 Lab 5: Sharded Key/Value Service
## Introduction
You can either do a final project based on your own ideas, or this lab.
In this lab you'll build a key/value storage system that **"shards,"** or partitions, the keys over a set of Raft-replicated key/value server groups (**shardgrps**). A **shard** is a subset of the key/value pairs; for example, all the keys starting with "a" might be one shard, all the keys starting with "b" another, etc. The reason for sharding is **performance**. Each shardgrp handles puts and gets for just a few of the shards, and the shardgrps operate in parallel; thus total system throughput (puts and gets per unit time) increases in proportion to the number of shardgrps.
The sharded key/value service has the components shown in the lab diagram. **Shardgrps** (blue squares) store shards with keys: shardgrp 1 holds a shard storing key "a", and shardgrp 2 holds a shard storing key "b". Clients interact with the service through a **clerk** (green circle), which implements **Get** and **Put** methods. To find the shardgrp for a key passed to Put/Get, the clerk gets the **configuration** from the **kvsrv** (black square), which you implemented in **Lab 2**. The configuration describes the mapping from shards to shardgrps (e.g., shard 1 is served by shardgrp 3).
An administrator (i.e., the tester) uses another client, the **controller** (purple circle), to add/remove shardgrps from the cluster and update which shardgrp should serve a shard. The controller has one main method: **ChangeConfigTo**, which takes as argument a new configuration and changes the system from the current configuration to the new configuration; this involves moving shards to new shardgrps that are joining and moving shards away from shardgrps that are leaving. To do so the controller 1) makes RPCs (**FreezeShard**, **InstallShard**, and **DeleteShard**) to shardgrps, and 2) updates the configuration stored in kvsrv.
The reason for the controller is that a sharded storage system must be able to **shift shards among shardgrps**: for load balancing, or when shardgrps join and leave (new capacity, repair, retirement).
The main challenges in this lab will be ensuring **linearizability** of Get/Put operations while handling 1) changes in the assignment of shards to shardgrps, and 2) recovering from a controller that fails or is partitioned during **ChangeConfigTo**.
1. **If ChangeConfigTo fails while reconfiguring**, some shards may be inaccessible if they have started but not completed moving from one shardgrp to another. The tester starts a new controller; your job is to ensure that the new one completes the reconfiguration that the previous controller started.
2. **ChangeConfigTo moves shards** from one shardgrp to another. You must ensure that **at most one shardgrp is serving requests for each shard at any one time**, so that clients using old vs new shardgrp don't break linearizability.
This lab uses "configuration" to refer to the **assignment of shards to shardgrps**. This is **not** the same as Raft cluster membership changes; you don't have to implement Raft cluster membership changes.
A shardgrp server is a member of only a single shardgrp. The set of servers in a given shardgrp will never change.
**Only RPC** may be used for interaction among clients and servers (no shared Go variables or files).
* **Part A**: Implement a working **shardctrler** (store/retrieve configurations in kvsrv), the **shardgrp** (replicated with Raft rsm), and a **shardgrp clerk**. The shardctrler talks to shardgrp clerks to move shards.
* **Part B**: Modify shardctrler to handle failures and partitions during config changes.
* **Part C**: Allow concurrent controllers without interfering with each other.
* **Part D**: Extend your solution in any way you like (optional).
This lab's design is in the same general spirit as Flat Datacenter Storage, BigTable, Spanner, FAWN, Apache HBase, Rosebud, Spinnaker, and others (details differ).
Lab 5 will use your **kvsrv from Lab 2**, and your **rsm and Raft from Lab 4**. Lab 5 and Lab 4 must use the same rsm and Raft implementations.
You may use late hours for **Part A** only; you may **not** use late hours for Parts BD.
---
## Getting Started
Do a `git pull` to get the latest lab software.
We supply you with tests and skeleton code in **src/shardkv1**:
* **shardctrler** package: `shardctrler.go` with methods for the controller to change a configuration (**ChangeConfigTo**) and to get a configuration (**Query**)
* **shardgrp** package: shardgrp clerk and server
* **shardcfg** package: for computing shard configurations
* **client.go**: shardkv clerk
To get up and running:
```bash
$ cd ~/6.5840
$ git pull
...
$ cd src/shardkv1
$ go test -v
=== RUN TestInitQuery5A
Test (5A): Init and Query ... (reliable network)...
shardkv_test.go:46: Static wrong null 0
...
```
---
## Part A: Moving Shards (hard)
Your first job is to implement shardgrps and the **InitConfig**, **Query**, and **ChangeConfigTo** methods when there are no failures. The code for describing a configuration is in **shardkv1/shardcfg**. Each **shardcfg.ShardConfig** has a unique identifying number **Num**, a mapping from shard number to group number, and a mapping from group number to the list of servers replicating that group. There will usually be more shards than groups so that load can be shifted at a fairly fine granularity.
### 1. InitConfig and Query (no shardgrps yet)
Implement in **shardctrler/shardctrler.go**:
* **Query**: returns the current configuration; read it from kvsrv (stored there by InitConfig).
* **InitConfig**: receives the first configuration (a **shardcfg.ShardConfig** from the tester) and stores it in an instance of Lab 2's **kvsrv**.
Use **ShardCtrler.IKVClerk** Get/Put to talk to kvsrv, **ShardConfig.String()** to serialize for Put, and **shardcfg.FromString()** to deserialize. You're done when you pass the first test:
```bash
$ cd ~/6.5840/src/shardkv1
$ go test -run TestInitQuery5A
Test (5A): Init and Query ... (reliable network)...
... Passed -- time 0.0s #peers 1 #RPCs 3 #Ops 0
PASS
ok 6.5840/shardkv1 0.197s
```
### 2. Shardgrp and shardkv clerk (Static test)
Implement an initial version of **shardgrp** in **shardkv1/shardgrp/server.go** and a **shardgrp clerk** in **shardkv1/shardgrp/client.go** by copying from your Lab 4 kvraft solution. Implement the **shardkv clerk** in **shardkv1/client.go** that uses **Query** to find the shardgrp for a key, then talks to that shardgrp. You're done when you pass the **Static** test.
* Upon creation, the first shardgrp (**shardcfg.Gid1**) should initialize itself to **own all shards**.
* **shardkv1/client.go**'s Put must return **ErrMaybe** when the reply was maybe lost; the inner (shardgrp) Put can signal this with an error.
* To put/get a key from a shardgrp, the shardkv clerk should create a shardgrp clerk via **shardgrp.MakeClerk**, passing the servers from the configuration and the shardkv clerk's **ck.clnt**. Use **ShardConfig.GidServers()** to get the group for a shard.
* Use **shardcfg.Key2Shard()** to find the shard number for a key. The tester passes a **ShardCtrler** to **MakeClerk** in **shardkv1/client.go**; use **Query** to get the current configuration.
* You can copy Put/Get and related code from kvraft **client.go** and **server.go**.
```bash
$ cd ~/6.5840/src/shardkv1
$ go test -run Static
Test (5A): one shard group ... (reliable network)...
... Passed -- time 5.4s #peers 1 #RPCs 793 #Ops 180
PASS
ok 6.5840/shardkv1 5.632s
```
### 3. ChangeConfigTo and shard movement (JoinBasic, DeleteBasic)
Support **movement of shards among groups** by implementing **ChangeConfigTo**: it changes from an old configuration to a new one. The new configuration may add new shardgrps or remove existing ones. The controller must move shard **data** so that each shardgrp's stored shards match the new configuration.
**Suggested approach for moving a shard:**
1. **Freeze** the shard at the source shardgrp (that shardgrp rejects Puts for keys in the moving shard).
2. **Install** (copy) the shard to the destination shardgrp.
3. **Delete** the frozen shard at the source.
4. **Post** the new configuration so clients can find the moved shard.
This avoids direct shardgrp-to-shardgrp interaction and allows serving shards not involved in the change.
**Ordering:** Each configuration has a unique **Num** (see **shardcfg/shardcfg.go**). In Part A the tester calls ChangeConfigTo sequentially; the new config has **Num** one larger than the previous. To reject stale RPCs, **FreezeShard**, **InstallShard**, and **DeleteShard** should include **Num** (see **shardgrp/shardrpc/shardrpc.go**), and shardgrps must remember the **largest Num** they have seen for each shard.
Implement **ChangeConfigTo** in **shardctrler/shardctrler.go** and extend shardgrp to support **freeze**, **install**, and **delete**. Implement **FreezeShard**, **InstallShard**, **DeleteShard** in **shardgrp/client.go** and **shardgrp/server.go** using the RPCs in **shardgrp/shardrpc**, and reject old RPCs based on Num. Modify the shardkv clerk in **shardkv1/client.go** to handle **ErrWrongGroup** (returned when the shardgrp is not responsible for the shard). Pass **JoinBasic** and **DeleteBasic** first (joining groups; leaving can come next).
* Run **FreezeShard**, **InstallShard**, **DeleteShard** through your **rsm** package, like Put and Get.
* If an RPC reply includes a **map** that is part of server state, you may get races; **include a copy** of the map in the reply.
* You can send an entire map in an RPC request/reply to keep shard transfer code simple.
* A shardgrp should return **ErrWrongGroup** for a Put/Get whose key's shard is not assigned to it; **shardkv1/client.go** should reread the configuration and retry.
### 4. Shardgrps that leave (TestJoinLeaveBasic5A)
Extend **ChangeConfigTo** to handle shardgrps that **leave** (in current config but not in the new one). Pass **TestJoinLeaveBasic5A**.
### 5. All Part A tests
Your solution must **continue serving shards that are not affected** by an ongoing configuration change. Pass all Part A tests:
```bash
$ cd ~/6.5840/src/shardkv1
$ go test -run 5A
Test (5A): Init and Query ... (reliable network)...
... Passed -- time 0.0s #peers 1 #RPCs 3 #Ops 0
Test (5A): one shard group ... (reliable network)...
... Passed -- time 5.1s #peers 1 #RPCs 792 #Ops 180
Test (5A): a group joins... (reliable network)...
... Passed -- time 12.9s #peers 1 #RPCs 6300 #Ops 180
...
Test (5A): many concurrent clerks unreliable... (unreliable network)...
... Passed -- time 25.3s #peers 1 #RPCs 7553 #Ops 1896
PASS
ok 6.5840/shardkv1 243.115s
```
---
## Part B: Handling a Failed Controller (easy)
The controller is short-lived and may **fail or lose connectivity** while moving shards. The task is to **recover** when a new controller is started: the new controller must **finish the reconfiguration** that the previous one started. The tester calls **InitController** when starting a controller; you can implement recovery there.
**Approach:** Keep **two configurations** in the controller's kvsrv: **current** and **next**. When a controller starts a reconfiguration, it stores the next configuration. When it completes, it makes next the current. In **InitController**, check if there is a stored **next** configuration with a higher Num than current; if so, **complete the shard moves** to reconfigure to that next config.
A controller that continues from a failed one may **repeat** FreezeShard, InstallShard, Delete RPCs; shardgrps can use **Num** to detect duplicates and reject them.
Implement this in the shardctrler. You're done when you pass the Part B tests:
```bash
$ cd ~/6.5840/src/shardkv1
$ go test -run 5B
Test (5B): Join/leave while a shardgrp is down... (reliable network)...
... Passed -- time 9.2s #peers 1 #RPCs 899 #Ops 120
Test (5B): recover controller ... (reliable network)...
... Passed -- time 26.4s #peers 1 #RPCs 3724 #Ops 360
PASS
ok 6.5840/shardkv1 35.805s
```
* Implement recovery in **InitController** in **shardctrler/shardctrler.go**.
---
## Part C: Concurrent Configuration Changes (moderate)
Modify the controller to allow **concurrent controllers**. When one crashes or is partitioned, the tester starts a new one, which must finish any in-progress work (as in Part B). So several controllers may run concurrently and send RPCs to shardgrps and to the kvsrv.
**Challenge:** Ensure controllers **don't step on each other**. In Part A you already fenced shardgrp RPCs with **Num** so old RPCs are rejected; duplicate work from multiple controllers is safe. The remaining issue is that **only one controller** should update the **next** configuration, so two controllers (e.g. partitioned and new) don't write different configs for the same Num. The tester runs several controllers concurrently; each reads the current config, updates it for a join/leave, then calls ChangeConfigTo—so multiple controllers may call ChangeConfigTo with **different configs with the same Num**. You can use **version numbers and versioned Puts** so that only one controller successfully posts the next config and others return without doing anything.
Modify the controller so that **only one controller can post a next configuration for a given configuration Num**. Pass the concurrent tests:
```bash
$ cd ~/6.5840/src/shardkv1
$ go test -run TestConcurrentReliable5C
Test (5C): Concurrent ctrlers ... (reliable network)...
... Passed -- time 8.2s #peers 1 #RPCs 1753 #Ops 120
PASS
ok 6.5840/shardkv1 8.364s
$ go test -run TestAcquireLockConcurrentUnreliable5C
Test (5C): Concurrent ctrlers ... (unreliable network)...
... Passed -- time 23.8s #peers 1 #RPCs 1850 #Ops 120
PASS
ok 6.5840/shardkv1 24.008s
```
* See **concurCtrler** in **test.go** for how the tester runs controllers concurrently.
**Recovery + new controller:** A new controller should still perform Part B recovery. If the old controller was partitioned during ChangeConfigTo, ensure the old one doesn't interfere with the new one. If all controller updates are properly fenced with Num (from Part B), you may not need extra code. Pass the **Partition** tests:
```bash
$ go test -run Partition
Test (5C): partition controller in join... (reliable network)...
... Passed -- time 7.8s #peers 1 #RPCs 876 #Ops 120
...
Test (5C): controllers with leased leadership ... (unreliable network)...
... Passed -- time 60.5s #peers 1 #RPCs 11422 #Ops 2336
PASS
ok 6.5840/shardkv1 217.779s
```
Rerun all tests to ensure recent controller changes didn't break earlier parts.
**Gradescope** will run Lab 3AD, Lab 4AC, and 5C tests. Before submitting:
```bash
$ go test ./raft1
$ go test ./kvraft1
$ go test ./shardkv1
```
---
## Part D: Extend Your Solution
In this final part you may **extend your solution** in any way you like. You must **write your own tests** for your extensions.
Implement one of the ideas below or your own. Write a **paragraph in extension.md** describing your extension and upload **extension.md** to Gradescope. For harder, open-ended extensions, you may partner with another student.
**Ideas (first few easier, later more open-ended):**
* **(hard)** Modify shardkv to support **transactions** (several Puts and Gets atomically across shards). Implement two-phase commit and two-phase locking. Write tests.
* **(hard)** Support **transactions in kvraft** (several Put/Get atomically). Then versioned Puts are unnecessary. See [etcd's transactions](https://etcd.io/docs/v3.4/learning/api/). Write tests.
* **(hard)** Let the kvraft **leader serve Gets without going through rsm** (optimization at end of Section 8 of the Raft paper, including **leases**), preserving linearizability. Pass existing kvraft tests. Add a test that optimized Gets are faster (e.g. fewer RPCs) and a test that term switches are slower (new leader waits for lease expiry).
* **(moderate)** Add a **Range** function to kvsrv (keys from low to high). Lazy: iterate the key/value map; better: data structure for range search (e.g. B-tree). Include a test that fails the lazy solution but passes the better one.
* **(moderate)** Change kvsrv to **exactly-once** Put/Get semantics (e.g. Lab 2 dropped-messages style). Implement exactly-once in kvraft as well. You may port tests from 2024.
* **(easy)** Change the tester to use **kvraft** instead of kvsrv for the controller (e.g. replace kvsrv.StartKVServer in MakeTestMaxRaft in test.go with kvraft.StartKVServer). Write a test that the controller can query/update configuration while one kvraft peer is down. Tester code lives in **src/kvtest1**, **src/shardkv1**, **src/tester1**.
---
## Handin Procedure
Before submitting, run all tests one final time:
```bash
$ go test ./raft1
$ go test ./kvraft1
$ go test ./shardkv1
```
---
*From: [6.5840 Lab 5: Sharded Key/Value Service](https://pdos.csail.mit.edu/6.824/labs/lab-shard1.html)*