Abstract
From entertainment to personal communication, and from business to safety-critical applications, the world increasingly relies on distributed systems. Despite looking simple, distributed systems hide a major source of complexity: tolerating faults and component crashes is very difficult, due to the incompleteness of (remote) knowledge. The need to overcome this problem, and provide different guarantees to applications, sparked a huge research effort and resulted in a large body of communication protocols, and middleware. Thus, it is worthwhile to survey the state of the art in distributed systems, with a particular emphasis on reliable communication. We discuss key concepts in reliable communication, such as interaction patterns (e.g., one-way vs. request-response, synchronous vs. asynchronous), reliability semantics (e.g., at-least-once, at-most-once), and reliability targets(e.g., message, conversation), and we analyze a wide set of current communication solutions, which map to the different concepts. Building on the concepts, we analyze applications that have different reliable communication needs. As a result, we observe that, in most cases, elaborate communication solutions offering superior guarantees are purely academic efforts that cannot compete with the popularity and maturity of established, albeit poorer solutions. Based on our analysis, we identify and discuss open research topics in this area.