Apache Thrift architecture

The Apache Thrift Framework can be organized into five layers:

* The RPC Server Library
* RPC Service Stubs
* User-Defined Type Serialization
* The Serialization Protocol Library
* The Transport Library

Applications requiring a common way to serialize data structures for storage or messaging may need nothing more than the bottom three layers of this model.

The top two layers the Apache Thrift library of RPC servers and the IDL compiler generated service stubs, adding RPC support to the stack.

Apache Thrift is conceptually an object oriented framework, though it supports object-oriented and non-object oriented languages. The Transport, Protocol, and Server libraries are often referred to as class libraries, though they may be implemented in other ways in non-object oriented languages. The classes within the Apache Thrift libraries are typically named with a leading capital T, for example, TTransport, TProtocol, and TServer.

Transports

At the bottom of the stack we have transports (see figure 2.2). The Apache Thrift transport library insulates the upper layers of Apache Thrift from device-specific details. ~In particular, transports enable protocols to read and write byte streams without knowledge of the underlying device.~

For example, imagine you developed a set of programs to move stock price quotations over the Sockets networking API. After the application is deployed, the requirements expand and you’re asked to add support for stock price transmission over an AMQP messaging system as well.

With Apache Thrift, the expanded capability will be fairly easy to implement. The new AMQP code can implement the existing Apache Thrift Transport interface, allowing the upper layer of code to use either the Socket solution or the AMQP solution without knowing the difference.

The modular nature of Apache Thrift transports allows them to be selected and changed at compile time or run time, giving applications plug-in support for a range of devices (see figure 2.4).

The Transport interface

::The Apache Thrift transport layer exposes a simple byte-oriented I/O interface to upper layers of code.:: The interface is typically defined in an abstract base class called TTransport. Table 22.1 describes the TTransport methods present in most language implementations. Each Apache Thrift language implementation has its own subtleties. Apache Thrift language libraries implementations tend to play the strengths of the language in question, making a level of variety across implementations the norm.

For example, certain languages define transport interfaces with additional methods for performance or other purposes. A case in point, the C++ language TTransport interface defines borrow() and consume() methods, which enable more efficient buffer processing. The examples here focus on the conceptual architecture of Apache Thrift.

End point transports

In this book we refer to Apache Thrift transports that write to a physical or logical device as “end point transports”. End point transports are always at the bottom of an Apache Thrift transport stack and most use cases require precisely one end point transport.

Apache Thrift languages supply end point transports for memory, file, and network devices.
* Memory oriented transports, such as TMemoryBuffer, are often used to collect multiple small write operations that are later transmitted as a single block.
* File-based transports, such as TSimpleFileTransport, are often used for logging and state persistence.

::The most important Apache Thrift Transport types are network oriented and used to support RPC operations.:: The most commonly used Apache Thrift network transport is TSocket. The TSocket transport uses the Socket API to transmit bytes over TCP/IP (see figure 2.5).

Other devices and networking protocols can be exposed though the TTransport interface as well. For example, many Apache Thrift language libraries provide HTTP transports to read and write using the HTTP protocol. Building a custom transport for an unsupported network protocol or device isn’t typically difficult, and doing so enables the entire framework to operate over the new end point type.

Layered transports

Because Apache Thrift transports are defined by the generic TTransport interface, client code is independent of the underlying transport implementation. This give transports the ability to overlay anything, even other transports. Layering allows generic transport behavior to be separated into interoperable and reusable components.

Imagine you’re building a banking application that makes calls to a service hosted by another company and you need to encrypt all the bytes traveling between your client and the RPC server. If you create a layered transport to provide the encryption, the client and server code could use your new encryption layer on top of the original network transport. The benefits of isolating this new encryption feature in a layered transport are several, no the least of which is that it can be inserted between the existing client code and old network transport with potentially no impact. The client code will see the encryption transport layer as another transport. The network end point transport will see the encryption transport as another client.

The encryption transport can be layered on top of any end point transport, allowing you to encrypt network I/O as well as file I/O and memory I/O. The layering approach allows the encryption concern to be separated from the device I/O concern.

In this book we refer to all Apache Thrift transports that aren’t end point transports as “layered transports.” Layered transports expose the standard Apache Thrift Transport interface to clients and depend on the Transport interface of the layer below. In this way one or more transport layers can be used from a transport stack.

A commonly used Apache Thrift layered transport is the framing transport. The transport is called TFramedTransport in most language libraries and ~it adds a four-byte message size as a prefix to each Apache Thrift message.~ This enables more efficient message processing in certain scenarios, allowing a receiver to read the frame size and then provide buffers of the exact size needed by the frame, for example.

NOTE Clients and servers must use compatible transport stacks to communicate. If the server is using a TSocket transport the client will need to use a TSocket transport. If the server is using a TFrameTransport layer on top of a TSocket, the client will have to use a TFramedTransport layer on top of a TSocket. Apache Thrift doesn’t have a built-in runtime transport or protocol discovery mechanism, though custom discovery systems can be crated on top of Apache Thrift.

Another important feature offered by layered transports is buffering. The TFramedTransport implicitly buffers writes until the flush() method is called, at which point the frame size and data are written to the layer below. The TBufferedTransport is an alternative to the TFramedTransport that can provide buffering when framing isn’t needed. Several languages build buffering into the end point solution and don’t provide a TBufferedTransport (Java is an example).

Server transports

When two processes connect over a network to facilitate communications, the server must listen for clients to connect, accepting new connections as they arrive.

* ::The abstract interface for the server’s connection acceptor is usually named `TServerTransport`.::
* The most popular implementation of `TServerTransport` is `TServerSocket` used for TCP/IP networking. The server transport wires each new connection to a `TTransport` to handle the individual connection’s I/O.

Server transports follow the factory pattern with TServerSockets manufacturing TSockets, TServerPipes manufacturing TPipes, and so on.

Server transports typically have only four methods (see table 2.2). The listen() and close() methods prepare the server transport for use and shut it down, respectively. Clients cannot connect before listen() is invoked or after close() is invoked. The accept() method blocks until a client connection arrives.

Protocols

::In the context of Apache Thrift, a protocol is a means for serialization types.:: Apache Thrift RPC doesn’t support every type defined in every language. Rather, the Apache Thrift type system includes all the important base types found in most languages (int, double, string, an so on), as well as a few heavily used and widely supported container types (map, set, list). All protocols must be capable of reading and writing all the types in the Apache Thrift type system.

Protocols sit on top of a transport stack (see figure 2.8). Labor is divided between ~the transport that’s responsible for manipulating bytes~ and ~the protocol that’s responsible for working with data types.~ Transports see only opaque byte stream; protocols turn data types into byte streams (see Figure 2.9).

For example, if you want to store an integer into a disk file on one system and make it readable on another system, you need to ensure that the integer is stored in an agreed-upon byte order. Either the most significant or least significant byte must be first. The choice between these two options is made by the serialization protocol. The transport simply writes the bytes supplied to ink in the order presented.

Apache Thrift provides several serialization protocols, each with its own goals:

* The Binary protocol — Simple and fast
* The Compact protocol — Smaller data size without excessive overhead
* The JSON protocol — Standards-based, human-readable, broad interperability

The Binary Protocol is the default Apache Thrift protocol and at the time of initial release, it was the only protocol.

The Compact Protocol is designed to minimize the size of the serialized representation fo data. The Compact Protocol is family simple but does use more CPU in the process of shuffling bits into smaller spaces. In cases where I/O is the bottleneck and CPU abounds (a common situation), this is a good protocol to consider.

The JSON Protocol converts inputs into JSON formatted text. One of the three common Apache Thrift protocols, JSON is likely to produce the largest representation on the wire and consume the most CPU. The advantages of JSON are broad interoperability and human readability.

Apache Thrift languages typically provide an abstract protocol interface, called TProtocol, adhered to by all concrete protocol implementations. The interface defines methods for reading and writing each of the Apache Thrift types as well as compositional method used to serialize container, user-defined types, and RPC messages.

A key feature of the Apache Thrift type system is its support for user-defined types in the form of struct. Apache Thrift struct are IDL-based compositor types built from a set of fields. The fields can be of any legal Apache Thrift type, including base types, containers, and other struct.

::Apache Thrift messages are the envelops used to deliver RPC calls and responses over transports.:: Key bits of these messages are implemented as specialized Apache Thrift structs.

Table 2.3 lists several of the typical TProtocol methods that define the Apache Thrift type system. Each write method listed here has a corresponding read method with the same suffix (for example, writeBool()/readBool()).

<

table>

Apache Thrift IDL

Combining Apache Thrift Protocols and Transports provides a way to serialize double,s lists of strings, and other such generic data representations. While useful, most applications also deal in user-defined data types. For example, a stock trading application may deal in trade reports, a social platform may deal in status updates, and a flight simulator may deal in telemetry.

Interface Definition Languages (IDLs) can be used to define application level types and service interface, enabling tools to generate code to automate serialization for these types. Rather than hand-coding the serialization of a Trade Report for a stock trading program, yo can describe the trade type in IDL and let the Apache Thrift IDL Compiler generate the serialization code for you.

Apache Thrift IDL is implementation language independent. The IDL compiler reads IDL files and can output serialization code and RPC client/server stubs in any of the Apache Thrift target languages (see figure 2.10).

Imagine you’re writing a program for the California Fisheries Bureau in Python and you want to call a server maintained by the Seattle Ocean Research Center to retrieve Halibut catch levels but you discover the server is written in Java. If the server was coded with an Apache Thrift API you could compile the server interface IDL for Python and then use the Python stubs to call the Java server directly.

The following listing shows an example of what such an interface definition might look like.

struct Date {
    1:i16 year
    2:i16   month
    3:i16   day
}

service HalibutTracking {
    i32 get_catch_in_pounds_today()
    i32  get_catch_in_pounds_by_date(1: Date dot, 2:double tm)
}

User-defined types and serialization

User-Defined Types (UDTs) are an important aspect of external interfaces. While it’s possible to compose the get_catch_in_pounds_by_date() method in our above example with discrete year/month/day parameters, the Date type is much more expressive, reusable, and concise. Apache Thrift IDL allows user-defined types to be created with the “struct” keyword.

The IDL compiler generates language-specific types from IDL types; for example, the struct keyword will cause the IDL Compiler to produce in C++, a record in Erlang, and a package in Perl. These generated types have built-in serialization capabilities, making it easy to serialize them using any Apache Thrift protocol/transport stack.

The following listing shows a pseudo code example of what an IDL Compiler generated UDT for your example data type might look like.

class Date {
    public:
        short year;
        short month;
        short day;
        
        read(TProtocol protocol) { ... }
        write(TProtocol protocol) { ... }
};

The trivial Date type illustrated in pseudo code above has the exact fields described in the IDL and is organized into a class with the same name as your IDL struct. The Apache Thrift compiler creates read() and write() methods to automate the process of serialization the type through the Apache Thrift TProtocol interface. This makes transmitting a complex data structure as easy as calling read or write on the structure with the target Apache Thrift Protocol as a parameter.

Apache Thrift structs are used internally within the Apache Thrift Framework as the means to package all RPC data transmissions. The argument list of each Apache Thrift Service method is defined in an “args” struct. This allows Apache Thrift to use the same convenient struct read() and write() methods to send and receive RPC parameter lists.

The implementation of a struct’s write method is a simple sequential invocation of the appropriate TProtocol methods. The following listing shows the pseudo code for the write method of the Date struct.

Date::write(TProtocol protocol) {
    protocol.writeStructBegin(“Date”);
    
    protocol.writeFieldBegin(“year”, T_I16, 1);
    protocol.writeI16(this.year);
    protocol.writeFileEnd();

    protocol.writeFieldBegin(“month”, T_I16, 2);
    protocol.writeI16(this.month);
    protocol.writeFieldEnd();

    protocol.writeFieldBegin(“day”, T_I16, 3);
    protocol.writeI16(this.day);
    protocol.writeFieldEnd();

    protocol.writeFieldStop();
    protocol.writeStructEnd();
}

The ability to compose serializable, language-agnostic types is a key feature of the Apache Thrift IDL. Types can be serialized to memory and then sent over messaging systems, types can be used directly in RPC methods, and types can be serialized to files. This cross-language serialization capability is one of the key Apache Thrift features used by commercial applications.

RPC services

For many programmers, building cross-language RPC services is the primary reason for using Apache Thrift. Defining services in Apache Thrift IDL allows the IDL Compiler to generate client and server stubs that supply all of the plumbing necessary t of all a function remotely. In our previous example, the IDL generated client and server stub code to support the HalibutTracking service.

The following listing shows the pseudo code for the compiler’s HalibutTracking service interface.

interface HalibutTracking {
    int32 get_catch_in_pounds_today();
    int32   get_catch_in_pounds_by_date(Date dt, double tm);
};

This service has two methods, both of which return a 32-bit integer and one that takes a Date struct and a double as input. In addition to defining the interface in the target language, the IDL Compiler will generate a pair of classes to support RPC on this interface: a client stub for use in the client process and a server stub, called a processor, for use in the server process. The client class is used as a proxy for the remote service. The processor is used to invoke the user-defined service implementation on behalf of the remote client.

The client stub

A client interested in calling a service in a remote server can call the desired method on the client proxy object. Under the covers the client proxy sends a message to the server, including information regarding the method to invoke and any parameters. Typically, the client proxy then waits to receive the result of the call from the server (see figure 2.13). Using the generated client proxy makes developing software utilizing RPC services as natural as coding to local functions.

The following listing shows a pseudo code listing for the IDL Compiler-generated client implementation of the HalibutTracking Service get_catch_in_pounds_by_date() method.

// Thrift-generated client code
int32 HalibutTrackingClient::get_catch_in_pounds_by_date(Date dt, double tm) {
    send_get_catch_in_pounds_by_date(dt, tm);
    recv_get_catch_in_pounds_by_date();
}

void HalibutTrackingClient::send_get_catch_in_pounds_by_date(Date dt, double tm) {

protocol.writeMessageBegin(“get_catch_in_pounds_by_date”, T_CALL, 0);
    args.write(protocol);

    protocol.writeMessageEnd();
    protocol.getTransport().flush();
}

In this example, the client implementation of get_catch_in_pounds_by_date() calls an internal “send_” method to send a message to the server. This is followed by a call to a second “recv_” method to receive the results. Clients send messages to servers to invoke methods, and servers send results back.

The second method in the listing is the pseudo code for the second method. The send method creates a message to send to the server. The message begins with the protocol writeMessageBegin() call. This serializes the T_CALL constant that informs the server that this is an “RPC call”-type message. The string “`get_catch_in_pounds_by_date” is a serialized to indicate which method you want to invoke. The zero passed here indicates you won’t use sequence numbers. Message sequence numbers are useful in certain applications but aren’t employed in most normal Apache Thrift RPC (for more information on the use of Apache Thrift messages, see chapter 8, Implementing services).

As you discovered in the previous section, the Apache Thrift IDL Compiler can generate read() and write() serialization methods for any struct defined in IDL. Rather than reinvent the wheel, Apache Thrift generates an internal structure for each method’s argument list called args. To add the method’s arguments to the byte stream the args struct is instantiated and initialized with the parameters for the method call. Calling the arg object’s write() method serializes all the parameters required to invoke the get_catch_in_pounds_by_date() method.

The message serialization is completed by calling writeMessageEnd() to bookend the writeMessageBegin() call. Once the message has been completely serialized, the transport stack is asked to `flush()the bytes out to the network (in case they have been buffered).

The client follows the send_get_catch_in_pounds_by_date() call wit the complimentary recv_get_catch_in_pounds_by_date() call. The server may respond to an RPC invocation with one of two messages. The first is a normal T_REPLY and the second is a T_EXCEPTION. Consistent with the creation of the args class, the Thrift Compiler generates a result class for each service method to package the method’s results.

The recv_get_catch_in_pound_by_date() method performs the same operations as the send_get_catch_in_pounds_by_date() method but in reverse, using the result object read() method but in reverse, using the result object read() method to recover the server’s response.
* If the recv_get_catch_in_pounds_by_date() method decodes a normal results, it’s returned.
* If an exception is decoded, language-specific processing occurs, such as throwing the exception.

While high level, this is a fairly concise summary of the function of Apache Thrift RPC from the client’s perspective. These are additional considerations on the server side of the equation.

Service processors

The server side of an RPC call consists of two code elements. The processor is the server side IDL generated stub, the counterpart of the client stub. The Thrift Compiler generates a client and processor pair of each IDL-defined service. The processor use the protocol stack to deserialize service method call requests, invoking the appropriate local function. The result of the local function call is packaged into a result structure by the processor and sent back to the client. The processor is essentially a dispatcher, receiving requests from the client and then dispatching them to the appropriate internal function.

Service Handlers

Processors depend on a user coded service “handler” to implement the service interface (hey, yo gotta do some work around here). The handler is supplied to the processor to complete the RPC support chain. The service handler is the only code you need to write to implement a complete Apache Thrift service.

Servers

In the context of Apache Thrift, a server is a program specifically designed to host one ore more Apache Thrift Services. As it turns out, the job of an Apache Thrift RPC server is fairly for formulaic. Serves listen for client connections, dispatch calls to services using generated processors, and get shut down by admins on occasion.

The boilerplate nature of server operation allows Thrift to supply a library of server classes with a wide range of implementations. Different language libraries support different server classes based on the community’s needs and the capabilities of the language. For example, Java offers single and multithreaded servers, as well as servers that use dedicated client threads and servers that use thread pools to process request, while Go servers use go routines. Concurrency models are the key distinction between the various servers offered (for more server details see chapter 10).

Most production needs can be met with an existing library server. Apache Thrift is open source, so even unique requirements can be met by customizing an existing server. Let’s look at a simplified Java program in the following listing that uses an Apache Thrift library server to support the HalibutTracking service.

public class JavaServer {
    public static void main(String[] args) {
        TServerTransport svrTransport = new TServerSocket(8585);
        HalibutTracking.Processor<HalibutTrackingHanlder> processor = new HalibutTracking.Processor<>(handler);
        TServer server = new TSimpleServer(Args(svrTransport).processor(processor));
        server.serve();
    }
}

Security

Summary

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s