einarvalur.co

AlthingiAggregator

The Icelandic Parliament (Althingi) maintains a service that provides information about its day to day activity. This includes sessions, speeches, list of MP among other things. This service is a HTTP (RESTlike) service that returns data in XML format. althingi-aggregator is a program that fetches these XML documents, formats and validates then before handing over to an external API for persistent storage.

The over-all architecture.

To accomplish this, only three components are needed. A Provider, the thing that fetches the information. An Extractor that formats and validates the information. A Consumer that accepts the formatted information and knows how to persist it (or knows how to hand it over to something that knows how to persist it).

There are a few other moving parts to glue this all together. The Handler who kick-start everything. The Router who selects the correct Handler. The Service layer who provides all the concrete implementation and a few other bits and bobs.

For the most part, it comes down to these tree initial things. Next I'll talk to them in some detail.

The Provider.

The Provider interface is very simple. It only has a get method. - Provide it with a Universal Resource Locater and it will return you a DOMDocument (DOMDocument is an OOP way of representing XML documents in PHP)

interface ProviderInterface
{
    public function get(string $url): DOMDocument;
}

This is the only method that the rest of the application sees. To put it in another way: A concrete implementation of this interface can add more methods onto it, but the application will not call them because the application is only aware of this one method, the application is only presented with the ProviderInterface interface.

I'm saying this because we are going to add methods to the implementation of this interface. These methods will be called in the ServiceManager to configure the implementation so that it can do the job we want it to do.

To accomplish this, I'm turning to the SOLID principles.

These principles are often explained with the Bird example. Let's say you were creating an interface for a Bird. you might have a fly() method, a swim() method and a sing() method. Now you want to implement this interface for a Dove. Doves fly and sing (well, they tweet) but they don't swim. Right away you see that the interface does not align with the implementation. You brush it aside and construct an implementation for a Duck, this looks better, a duck can swim, fly and... does it sing or quack? If it quacks (which I think ducks do), then again the interface and the implementation don't line up. Lastly you want to implement a Penguin, well they don't fly and they don't sing. At this point you go back to the drawing board where you rediscover the Interface segregation principle and realize that you should define and implement it like this

interface Bird {}

interface Singable {
    public function sing();
}

interface Flyable {
    public function fly();
}

interface Swimmable {
    public function swim();
}

class Dove implements Bird, Flyable, Singable {
    public function fly();
    public function sing();
}

class Duck implements Bird, Flyable, Swimmable {
    public function fly();
    public function swim();
}

class Penguin implements Bird, Swimmable {
    public function swim();
}

The way this relates to the Provider is that; depending on where it will be used, the concrete class can be decorated with different interfaces.

An implementation for testing might need to be injected with data as an array, in which case the class signature could be

interface ArrayInjectable {
    public function setData(array $data);
}

class ProviderTest implements ProviderInterface, ArrayInjectable
{
    private array $data;

    public function setData(array $data): self {
        $this->data = $data;
        return $this;
    }

    public function get(string $url): DOMDocument {
        return convertArrayToDom($this->data);
    }
}

Because the main body of the application will only see the get() method (because only the ProviderInterface is exposed to the application), the Dependency inversion principle and Liskov substitution principle are satisfied.

To go from the abstract to the concrete. An implementation of the ProviderInterface that can be used to fetch information from althingi.is via HTTP connection, cache the results and dispatch events will look like this

class ServerProvider implements
    ProviderInterface,
    ClientAwareInterface,
    CacheableAwareInterface,
    EventDispatcherAware
{
    //... see https://github.com/fizk/althingi-aggregator
}

ClientAwareInterface provides a Dependency injection, Setter method for a PSR-18 library to inject a HTTP client.

use Psr\Http\Client\ClientInterface;

interface ClientAwareInterface
{
    public function setHttpClient(ClientInterface $client): self;
}

CacheableAwareInterface provides a Dependency injection, Setter method for a PSR-6 library to inject a caching adapter (this application uses Reddis, but could use Memcache or any other system without code changes).

use Laminas\Cache\Storage\StorageInterface;

interface CacheableAwareInterface
{
    public function setCache(StorageInterface $cache): self;
}

EventDispatcherAware provides a Dependency injection, Setter method for a PSR-14 Event Dispatcher library.

use Psr\EventDispatcher\EventDispatcherInterface;

interface EventDispatcherAware
{
    public function getEventDispatcher(): EventDispatcherInterface;

    public function setEventDispatcher(EventDispatcherInterface $eventDispatch): self;
}

An important point to the Provider is that it will happily fetch and return any XML document from any service. It is agnostic to the information it is fetching. It does not try to poke inside the payload. The only thing it is concerned about is if the XML is well-formatted and that is about it.

Summary: The Provider is responsible for fetching information and returning it as a DOMDocument. It accepts a URL through the get() method. Implementations of this interface can (and do) extend it to allow for caching and event delegation. The main implementation uses a PRS compliant HTTP client. The Provider is payload agnostic, so long as it is XML and well-formatted it will do its job.

The Extractor.

Here is where the main logic of the application lives. The role of The Extractor is to make a DOMElement, derived from the DOMDocument, extract relevant information out of it, flatten it and store in a PHP array (PHP arrays are similar to dictionaries in Python or Objects in JavaScript. That is, key/value pairs structure).

The interface looks like this

use DOMElement;

interface ExtractionInterface
{
    public function populate(DOMElement $object): self;
    public function extract(): array;
}

There is another interface that is sometimes used in conjunction with ExtractionInterface called IdentityInterface.

interface IdentityInterface
{
    public function setIdentity(string $id): self;

    public function getIdentity(): string;
}

Together they can look like this:

use App\Extractor;
use App\Lib\IdentityInterface;
use DOMElement;

class Assembly implements ExtractionInterface, IdentityInterface
{
    private string $id;
    private DOMElement $object;

    public function populate(DOMElement $object): self
    {
        $this->object = $object;
        return $this;
    }

    public function extract(): array
    {
        $this->setIdentity('1'):
        return [
            // $this->object->getElementsByTagName(...)
        ];
    }

    public function setIdentity(string $id): self
    {
        $this->id = $id;
        return $this;
    }

    public function getIdentity(): string
    {
        return $this->id;
    }
}

For why IdentityInterface is needed and what happens inside of ExtractionInterface I'm gonna give a brief intro into the information provided by althingi.is

althingi.is provides non-normalized snippets of data in XML documents over HTTP. It is obviously a RESTful API that has evolved over a long periods of time. It uses many different date formats, it duplicates data over many datasets, it will sometimes use the empty string, sometimes an empty XML element and sometimes omit the element all together when specific property is not available

The role of The Extractor is to make sense of the document, fill in the blanks when needed, unify date formats, covert empty string, empty elements and omitted elements into nulls and so on and so forth.

The Extractor therefor knows all about the data set and is very much coupled with the domain. There is one Extractor for each type of data. It is very easy to unit-test them. - Throw what every variation of the XML file into an Extractor and test if it can withstand all the discrepancies.

Behind the scene, althingi.is is probable storing the data in a relation-database with primary keys. These keys are part of the payload and will often show up as foreign keys in another dataset. It becomes important to capture these keys and identify them as unique-identifiers. The IdentityInterface serves this purpose and becomes a vital clue for The Consumer (which we will look at next) to know under which key to store the extracted dataset. Extractor that implement IdentityInterface will provide a unique key to store the formatted dataset under, Extractor who don't, put no constraints on how the dataset is stored. As we will see later, when exploring The Consumer, this makes the distinction between if it should used the HTTP POST or PUT verbs.

Summary. The Handler will use The Provider to fetch an XML document, convert it into a DOMDocument, select the correct Extractor, populate that Extractor with data through function populate(DOMElement $object):before handing it over to The Consumer.

The Consumer.

Final piece of the puzzle is The Consumer. Its job is to extract formatted data from The Extractor and persist it.

The Consumer has a very simple interface as well

use App\Extractor\ExtractionInterface;

interface ConsumerInterface
{
    public function save(string $storageKey, ExtractionInterface $extract);
}

Only a single method save which takes in a storage key and an Extractor implementation. This application uses an external API to persist the data. That is; The Consumer hands data over to an API for persistent storage. The role of The Consumer then becomes: picking the right URL to call with the right HTTP Verb and the right data. It also can take appropriate action depending on the HTTP response code.

The concrete implementation is called HttpConsumer and its signature looks like this

class HttpConsumer implements
    ConsumerInterface,
    ClientAwareInterface,
    CacheableAwareInterface,
    UriAwareInterface,
    EventDispatcherAware
{
    //... see https://github.com/fizk/althingi-aggregator
}

This looks pretty similar to The Provider, the following interfaces ClientAwareInterface, CacheableAwareInterface and EventDispatcherAware were all described in The Provider section. The only outstanding interface is UriAwareInterface. Think of it as a ValueObject that holds the host name and protocol of the API that will be reserving the data.

The rest of the logic within HttpConsumer's save method would go something like this:

  1. Call extract() on the $extract
  2. Check if $extract implements IdentityInterface
    • If so: call getIdentity() and append the result to $storageKey and use the PUT verb when calling external API
    • Else: just use $storageKey and issue the POST verb
  3. Append the $storageKey to UriAwareInterface and pass that along with the data array from $extract into ClientAwareInterface. Make a request to the external API and wait for the response.

The StorageKey and Identity.

I have touched on IdentityInterface and $storageKey in passing, but let me clarify it here. Let's say we were implementing a API system to store musical artists and their albums/releases. The logical URL structure would be to have an endpoint called /artists. It would probable list all artists in the system. Next we would want /artists/:id, which would return a single artist.

We don't want the API to be responsible for creating a unique ID for our artists... we don't want /artists/5tyq467bgxgfh, we want /artists/the-beatles. In which case we construct the URL /artists/the-beatles and issue a PUT request along with the data. Some people think of the PUT verb as an update, but what it really is, is a replace. A replace in a sense that if the API has something stored under that unique key, it will replace it with the incoming data, else it will create a new record under this key. They most important part is that when issuing a PUT request, the whole data-set needs to be provided because the receiver will delete the old record and store the new one. The API should still tell us what it did. If it needed to create a new record, it should return a 201 Created, if it needed to replace it could return a 205 Reset Content.

Next the API will also accept user-comments. For that, there is a /artists/:id/comments URL. In this case, we are happy with the API creating a unique key for us. The URL /artists/the-beatles/comments/67ghslrt76 looks fine. To commission a comment, we would use the POST verb and call the /artists/the-beatles/comments endpoint. The API would create a new comment for us and return a Location header.

HTTP/1.1 201 Created
Location: /artists/the-beatles/comments/67ghslrt76

Let's put this in the context of the Aggregator. Here is the Artist Extractor:

class Artist implements ExtractionInterface, IdentityInterface
{
    private string $id;
    private DOMElement $object;

    public function populate(DOMElement $object): self
    {
        $this->object = $object;
        return $this;
    }

    public function extract(): array
    {
        $this->setIdentity(
            $this->object->getElementsByTagName('name')->item(0)->nodeValue
        );
        return [
            // ... rest of the data
        ];
    }

    public function setIdentity(string $id): self
    {
        $this->id = $id;
        return $this;
    }

    public function getIdentity(): string
    {
        return $this->id;
    }
}

Notice that this Extractor implements IdentityInterface which will tell The Consumer that we want to control the unique identifier and that The Consumer should use the PUT verb. Using this Extractor would then look like this

$data; //comes from The Provider
$extractor = new Artist();
$extractor->populate($data);

$consumer = new HttpConsumer();
$consumer->save('/artists', $extractor);

Inside of the HttpConsumer things would look like this;

public function save(string $storageKey, ExtractionInterface $extract) {

    if ($extract instanceof IdentityInterface) {
        $this->url->setPath($storageKey . '/' . $extract->getIdentity());
        // The URL becomes `/artists/the-beatles`
        $request = new Request($this->url, 'PUT', http_build_query($extract->extract());
        // and the verb becomes PUT
        $this->client->sendRequest($request);
    } else {
        // ..
    }
}

For the Comment Extractor, it will look like this:

class Comment implements ExtractionInterface
{
    private DOMElement $object;

    public function populate(DOMElement $object): self
    {
        $this->object = $object;
        return $this;
    }

    public function extract(): array
    {
        return [
            // ... rest of the data
        ];
    }
}

It is used in similar way

$data; //comes from The Provider
$extractor = new Comment();
$extractor->populate($data);

$consumer = new HttpConsumer();
$consumer->save('/artists/the-beatles/comments', $extractor);

...and here is how the HttpConsumer would look like.

public function save(string $storageKey, ExtractionInterface $extract) {

    if ($extract instanceof IdentityInterface) {
        // ...
    } else {
        $this->url->setPath($storageKey);
        // The URL becomes `/artists/the-beatles/comments`
        $request = new Request($this->url, 'POST', http_build_query($extract->extract());
        // and the verb becomes POST
        $this->client->sendRequest($request);
    }
}

The flow through HttpConsumer is a bit more complicated. Rather than typing it all out I have left a decision-tree for you to walk through.

PATCH POST PUT Done Done Done Done Done Done is cached is cached is cached 2xx response Has Location Header 409 response 409 response 2xx response is IdentityInterface 2xx response add to cache add to cache add to cache Error Error Error Error yes yes yes yes yes yes yes yes yes yes no no no no no no no no no no

Summary: The responsibility of The Consumer is to consume The Extractor and by looking at its construction, determine how to persist the data. The $storageKey and IdentityInterface play a role in how that decision is made.

Handlers, Router and the ServiceManager.

Now it comes the time to glue this all together. First off; all requests go to index.php, where the request is mapped to a Handler via The Router. The Router does a little bit of magic where it extracts values out of the request parameters. But for our intents and purposes, we can think of The Router being an array where the key is the request and the value is a pointer to The Handler.

$routres = [
    'request1' => HandlerOne::class,
    'request2' => HandlerTwo::class,
];

A handler is PSR-15 interface which is very simple

namespace Psr\Http\Server;

use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\ServerRequestInterface;

interface RequestHandlerInterface
{
    public function handle(ServerRequestInterface $request): ResponseInterface;
}

It simply says: give me a Request via the handle method, I will process it and give you a Response. For The Handler to do its job, it need additional resources and that's where The ServiceManager comes in.

A wise person once said:

If you have the keyword new in your code, you are doing something wrong.

I wouldn't go as far..., but it is true that if you instantiate a class inside your code, you will have a hard dependency of that particular implementation and you will not be able to switch it out (when you are doing testing for example).

As discussed previously, a Handler will require a ProviderInterface and a ConsumerInterface, agnostic of the implementation. The Handler should be able to do its job if The Provider is a ServiceProvider or a ReadFromDiskProvider or a ForUnitTestingProvider. The same goes for The Consumer. So instead of doing this:

class HandlerOne implements RequestHandlerInterface
{
    public function handle(ServerRequestInterface $request): ResponseInterface
    {
        $provider = new ServiceProvider(); // <-- wrong
    }
}

...The Handler is decorated with ProviderAwareInterface to make The Handler aware of its capabilities.

namespace App\Provider;

interface ProviderAwareInterface
{
    public function setProvider(ProviderInterface $provider): self;
}

And then The Handler becomes

class HandlerOne implements RequestHandlerInterface, ProviderAwareInterface
{
    private ProviderInterface $provider;
    public function handle(ServerRequestInterface $request): ResponseInterface
    {
        $provider = $this->provider; // <-- correct
    }

    public function setProvider(ProviderInterface $provider): self
    {
        $this->provider = $provider;
        return $this;
    }
}

This is known in the industry as DependencyInjection. More specifically Interface injection

This is where the ServiceManager comes in. Sometimes called The Container, a ServiceManager contract is defined in the PSR-11 specs.

namespace Psr\Container;

interface ContainerInterface
{
    public function get($id);
    public function has($id);
}

It then requires a config file that defines which classes to instantiate and which dependencies to inject into which instances. Historically, config files were long and complicated XML file (Java Spring). For languages who allow it, dependency injection has moved into annotations (Java Spring) and other languages have started defining their configs in code. Laminas\ServiceManager\ServiceManager organizes its config file into keys that maps to factory functions that can return a configured instance of the desired class. To put it simply: calling the get method on the ContainerInterface will produce a concrete and configured implementation of the interface that was requested.

$serviceManager = new ServiceManager([
    'factories' => [
        App\Handler\HandlerOne => function (ContainerInterface $container) {
            return (new App\Handler\HandlerOne())
                ->setProvider($container->get(Provider\ProviderInterface::class))
                ->setConsumer($container->get(Consumer\ConsumerInterface::class))
                ;
        },
        App\Provider\ProviderInterface::class => function (ContainerInterface $container) {
            return (new App\Provider\ServerProvider())
                ->setHttpClient($container->get(Psr\Http\Client\ClientInterface::class))
                ->setCache($container->get('ProviderCache'))
                ->setEventDispatcher($container->get(Phly\EventDispatcher\EventDispatcherInterface::class))
                ;
        },
        App\Consumer\ConsumerInterface::class => function (ContainerInterface $container) {
            return (new App\Consumer\HttpConsumer())
                ->setHttpClient($container->get(Psr\Http\Client\ClientInterface::class))
                ->setCache($container->get('ConsumerCache'))
                ->setUri(new Laminas\Diactoros\Uri())
                ->setEventDispatcher($container->get(Phly\EventDispatcher\EventDispatcherInterface::class))
                ;
        },
        Psr\Http\Client\ClientInterface::class => function (ContainerInterface $container, $requestedName) {
            return new Curl(new Psr17Factory())
            ;
        },
        Phly\EventDispatcher\EventDispatcherInterface::class => function (ContainerInterface $container, $requestedName) {
            return new Phly\EventDispatcher\EventDispatcher()
            ;
        },
        'ConsumerCache' => function () {
            return new Laminas\Cache\Storage\Adapter\Redis()
            ;
        },

        'ProviderCache' => function () {
            return new Laminas\Cache\Storage\Adapter\Redis()
            ;
        },
    ]
]);

echo $serviceManager->get(App\Handler\HandlerOne::class)->handle(new Request());

This might look a little bit verbose, but what is happening is that: from the initial request for the handler, each factory function is recursively calling The ServiceManager to fulfill each dependency. Maybe it is simpler to see it in a graph.

Initial request HandlerOne <ProviderInterface> ServerProvider <ConsumerInterface> HttpConsumer <ClientInterface> Curl {ProviderCache} Redis {ConsumerCache} Redis <EventDispatcherInterface> EventDispatcher Psr17Factory Uri

Summary: The Router will pick the relevant Handler depending on command served to the application. Once The Router has a pointer to a Handler, it will request a configured instance of The Handler provided with all dependencies from the ServiceManager. After that it will execute The Handler and retrieve a Response.

Cache and Events.

You might have noticed that HttpConsumer and ServerProvider have CacheableAwareInterface and EventDispatcherAware dependencies. I guess that CacheableAwareInterface is self explanatory. It allows for PSR-6 Caching Interface to be injected into The Consumer and The Provider so they can use cache version of data if they have it. For The Provider, it means that althingi.is doesn't have to be called if there is already a valid version of the file in cache. For The Consumer, it means that if nothing has changed in a dataset from the last time it was served to the persisting system, then there is no need to bother it with data that it already has.

While the system is running, there are a lot events happening that might be of interest to us. File was fetched, HTTP client timed out... thinks like that.

Let's for a moment imagine that we want to log when ServerProvider fetches a file from althingi.is. That's no problem, we would just make ServerProvider dependent on PSR-3: Logger Interface and inject it in the ServiceManager. Now let's say we want to implement mechanism so that if the system crashes mind-way through it can pick up where it left off. For that we need to know what was the last file fetched from althingi.is. That would also depend on the fetched files event to maintain state. We could implement that logic and inject it into ServerProvider. The list of dependencies is growing and so it the complexity of the code inside of the ServerProvider.

A better way would be to extract that logic out of ServerProvider and use a PSR-14: Event Dispatcher to notify us if an event occurs in The Provider.

In our first example, the code would look like this:

class ServerProvider implements ProviderInterface, EventDispatcherAware
{
    private function httpRequest(string $url)
    {
        $requeste = (new Request())
            ->withUri(new Uri($url));
        $response = $this->client->sendRequest($request);

        $this->getLogger()->info([$request, $response]);
        $this->getState()->saveSatate($request, $response);

        //any new requirements/features need to be added here.
        // ...
    }
}

Every time we need new functionality, we need to change our code. That breaks the Open–closed principle which is one of the SOLID principles.

[...] software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification" that is, such an entity can allow its behavior to be extended without modifying its source code.

Here is a much cleaner version.

class ServerProvider implements ProviderInterface, EventDispatcherAware
{
    private function httpRequest(string $url)
    {
        $request = (new Request())
            ->withUri(new Uri($url));
        $response = $this->client->sendRequest($request);

        // ...here is where the magic happens
        $this->getEventDispatcher()
            ->dispatch(new ProviderSuccessEvent($request, $response));
    }
}

Over in The ServiceManager configuration file

    EventDispatcherInterface::class => function (ContainerInterface $container) {
        $logger = $container->get(Psr\Log\LoggerInterface::class);

        $provider = new AttachableListenerProvider();

        $provider->listen(ProviderSuccessEvent::class, function (ProviderSuccessEvent $event) use ($logger) {
            $logger->debug((string) $event);
            // Additional logic can be added here without touching the code where the
            // event happens
        });

        return new EventDispatcher($provider);
    },

This can also be illustrated with an image. On the left hand side, each dependency is injected into the system where it is responsible for calling each command. There are a lot of lines that are going between the system and The ServiceManager representing dependencies.

On the right hand side, the is only one dependency; namely the EventDispatcherInterface. It will dispatch an event over to The ServiceManger where all the dependencies live. This means that more command can be added without changing the system's code

event command1() command2() command3() event dispatch() dependency 1 command1() dependency 2 command2() dependency 3 command3() dependency 1 dependency 2 dependency 3 event mgr system service manager system service manager

Currently, in this system, this is only used for logging but could be extended.

Conclusion.

In a simple system like this there are many design-decision to be made. This is by far not the only way to write a system but hopefully my decision will make some sense.