{"url":"/","title":"Shipping Software Products","text":"Shipping Software Products Hi! I'm Rinat Abdullin. Through my software career I have been involved in everything from warehouse management to social platforms and international logistics. These days customers come to me for technical advice on shipping LLM-driven products faster . See About me for more details or check out BitGN - a platform for agent benchmarks & challenges. I work worldwide from Vienna, Austria 🇦🇹. My newsletter \" ML Under the Hood \" talks about ChatGPT and building ML-driven software products. It covers important news, technical insights, and interesting product cases. Want to get started with ChatGPT/LLM? Check out ChatGPT quick-start guide for developers ! It is concise and beginner-friendly. Blog Sections 🔥 Schema-Guided Reasoning (SGR) - improve LLM accuracy by forcing it to think according to the predefined plan. Shipping products with LLMs and ChatGPT - answers to the frequently asked questions (also benchmarks) ♻️ Event Sourcing - event sourcing and CQRS tutorials and explanations. 🤔 Opinionated Tech - thoughts about software design and tech. 💊 Simulation - dive into deterministic simulation of actors, clusters and logistic systems. It is awesome! 🦾 Robotics - story about trying to build a robot hand. 📦 SkuVault - evolving and scaling online warehouse management system (C# .NET and a lot of event sourcing at scale). 👍 HappyPancake - scaling the largest dating website in Sweden. 🎙️ Being the Worst - adventures in practical software design with Kerry and Rinat. Blog Updates 🔥 New experiment in 2026 How to talk to your knowledge base? My team has no experience with ML/GPT. How do we proceed? How to get into ML for a developer? ChatGPT quickstart for developers How to segment texts for embeddings? Better performance with DOD Black Friday Experiment - Report 1 You can find my older posts in the archive ."}
{"url":"/2015/","title":"2015 Retrospective","text":"2015 Retrospective Our family almost doubled in size in 2015: we took a pug puppy in May and our daughter was born in July (events are ordered chronologically and not by importance). This was a wonderful experience which completely changed the way we perceive life, spend days and nights. As a side effect, time became a very limited commodity for us, continuous sleep - a fairy tale from a long ago. Time constraints influenced my approach to software development and continuous learning - there was much less coding in spare time, but more reading, learning and improving existing skills. SkuVault At SkuVault we are gradually maturing V2 project branch. It is in closed beta right now (full release of its features is expected in Q1'16). However, some ideas that proved themselves in V2 are already being taken back into V1. While working on V2 we also have to maintain V1 and all systems in between. So I had to learn bits of: InfluxDB time-series database (which felt like a more robust alternative to Carbon/Graphite for real-time metrics at a scale); R and plenty of libraries in the ecosystem - for performing data-driven research for devops, development and marketing; JavaScript with ES6, ReactJS, Webpack and all the joy of building interactive frontends without ASP.NET and IIS; even more Linux (mostly Ubuntu) for managing devops infrastructure. In the long term, I look forward to the slow (because we have to be extra careful) process of scaling SkuVault, while adding more features and unhooking it from Azure. Plus, there is a lot of work to be done with capturing data coming from SkuVault, analyzing it and leveraging to improve the product further. Cooking I think, I perfected our cheese cake recipe (using ingredients still available in Russian stores despite the sanctions) and got hang of making reasonable steaks in our electrical oven. I'd love to try making some cured meat in 2016. Erlang I invested time in learning me some more Erlang . It is a very nice language with a lot of value in OTP libraries. I don't think, I'll get to apply Erlang practically any time soon, however lessons already affected they way I think, reason and design distributed systems in the other languages. Emacs and Org-mode I believe, I became slightly better at managing everyday flows of tasks, problems and information in an unpredictable environment. All credit goes to Time Management for System Administrators by Limoncelli, Moleskine notebook and awesome org-mode in Emacs. Here's the layout: A bunch of org-mode files contains all my notes, tasks, projects and calendar entries (including recurring events and routines). I try to review and rectify them on a regular basis. All work hours are tracked in org-mode as well, they are used for invoicing. I use outstanding org-mode agenda to get a high-level overview of current projects and planned tasks, while planning my next day in Moleskine. Things are crossed out from the Moleskine list as I complete them. Unfinished tasks are moved to the next day or scheduled further down the road via org-mode. Any new information coming in (e.g. \"buy diapers tomorrow\" or \"plan driving school next month\") is logged immediately either to voice recorder app or in notebook, to get them out of my mind. As a part of my daily routine (I'm still trying to make it a routine), I incorporate them into my digital notes and agenda. Stand-alone projects get their own org files. They are also added to the global agenda list. This way, if any of the files contains tasks or scheduled entries, they would show up in the agenda views. org-mode has also a great support for literate programming, making it trivial to intermix executable code with prose. This affected the way I write about programming and learn. My Emacs config and Erlang Basics are both examples of the literate programming. I also use Emacs Speaks Statistics with org-mode and R to perform documented and reproducible data research. As you can, I'm spending a lot of time in Emacs. Almost all of the development (except for .NET) happens in there as well. Mastering this environment and tuning it, makes me a better developer, so I'm definitely going to continue this trend in 2016. JVM and Clojure .NET is a nice runtime with a decent ecosystem. It currently is the development tool I'm most familiar with. However, it misses some features and libraries which are available in Java world (e.g. Apache Kafka consumer library, proper bindings for LevelDB/RocksDB, libraries for sending real-time tagged metrics etc). Besides, managing servers at scale is more pleasant with Linux than with Windows (and Java is a first-class citizen there). My previous attempts to get into Java world were spoiled by C# and F#, which are nice .NET languages, one with strong support for object-oriented programming, the other with an emphasis on functional programming style. Compared to these primary .NET languages, it felt that Java 8 is too verbose, while Scala felt too complex and slow to compile (even with background compilation daemon). However, this year I finally paid attention to Clojure, which is a dialect of Lisp developed for JVM and JS by Rich Hickey. Clojure places a strong emphasis on simplicity, functional programming, immutability and interoperability with existing Java ecosystem. There is also a version of Clojure for CLR which isn't as popular as JVM or JS versions. In fact, if you look beyond the funny lisp syntax with tons of brackets, you could encounter simplicity of golang (with go-like channels and light-weight processes), backed up by JVM ecosystem and templating capabilities of LISP. ( defn old? [blob period] ( let [ date ( c/from-date ( . getLastModified ( . getProperties blob ))) limit ( t/minus ( t/now ) period ) ] ( t/before? date limit ))) Yes, the syntax is very funny and unusual. This is because code is data in Lisp (and I'm still grasping all implications of that). I'm not sure if I will end up using Clojure in my everyday work, however learning it will make me a better developer. I'm sure of it. It is a nice coincidence that Emacs itself is an interpreter for a different flavor of Lisp: ELisp. Mastering Clojure would make it easier to customize Emacs itself. There are a few tricks with Clojure macros and evolving neural networks that I want to give a shot to :) Internet of Things When Maya was born, we realized that we need to measure temperature and humidity around her crib. This became especially important during the winter period. \"Thanks\" to the central heating and cold winters, humidity could easily drop to 10% (very bad for the health of the family). I started playing with DHT22 temperature/humidity sensor and Arduino/ATTiny controllers to capture the data. However, it turned out that reading data from OLED display in an inconvenient location is no fun. A few iterations down the road I discovered EP8266 controller - a tiny chip smaller than a quarter. It has a Wi-Fi module and a programmable micro-controller. That baby has everything needed to grab temperature/humidity from the sensor and dump it to InfluxDB time-series database via our Wi-Fi network. Then it all gets charted on a Grafana dashboard. Obviously, this is just a start. There are a few other things I could do to improve our home IT infrastructure (e.g. a better control of the music player daemon, better tracking of sleep patterns of our baby etc). I believe, this year could see a few more hardware projects, more related to everyday programming. The End I didn't achieve much in the area of software development in 2015, rather just discovered things I want to learn more about in 2016. This already makes 2016 a more interesting year :) Unfortunately, some areas of my life were sacrificed in 2015 to free up room for work: writing, blogging and sports. I hope to bring them back to life in the upcoming year."}
{"url":"/2024-webinar-ai-assistants-4/","title":"2024-webinar-ai-assistants-4","text":"2024-webinar-ai-assistants-4 Вебинар “Проектируем и внедряем AI-ассистенты с LLM под капотом” Для кого: для инженеров, которые руководят созданием подобных ассистентов. А еще - для продактов и сейлзов, которым нужно понимать возможности и ограничения подобных продуктов. О чем: Мы посмотрим на работу и реализацию одного такого ассистента на конкретной задаче в отдельной предметной области. Это будет end-to-end кейс с реальными документами. Помимо кода, мы обсудим с вами принципы построения и внедрения подобных систем, от первого AI навыка до создания платформ. Поговорим про контроль качества, сбор данных и построение работы с пользователями. Расскажу про Knowledge Maps для адаптирования ассистентов с LLM под капотом к работе в конкретных доменных областях. Время проведения (московское): 7 марта, четверг, 13:00 GMT+3 (места закончились) 9 марта, суббота, 15:30 GMT+3 (места закончились) 5 апреля, пятница, 11:00 GMT+3 (места закончились) 25 апреля, четверг, 11:00 GMT+3 (места закончились) Update : места в последний поток закончились. Если вы не попали на вебинар - не расстраивайтесь! В дальнейшем я планирую записать лекционную часть вебинара, без обсуждения кейсов участников, специально для более широкой аудитории и на английском языке. Ее можно будет купить отдельно. Длительность - 2 часа. Язык - русский. Что нужно для участия? Для участия вам понадобится телеграм-аккаунт: вся основная информация, вопросы, обсуждения будут в телеграм-чате потока. После оплаты вам придет персональный инвайт в чат. Вебинар будет проводиться в Zoom. Будет ли запись? Запись будет доступна участникам вебинара только для персонального использования - мы обсуждаем рабочие кейсы, и не все участники хотели бы ими делиться на публику. Пожалуйста, не делитесь этой записью. Участники получат запись вебинара, презентацию, ноутбуки из демо-кейса и архив с документами для задачи, которую мы будем рассматривать. Можно ли купить запись вебинара отдельно? Конкретно эта запись продаваться не будет - чтобы можно было свободно обсуждать рабочие кейсы на вебинаре. В дальнейшем я сделаю отдельную запись лекционной части этого вебинара, специально для более широкой аудитории. Хочу, чтобы мой кейс разобрали на вебинаре, что нужно сделать? После оплаты вебинара вам на емейл придет приглашение в чат потока - можете прислать кейс туда, я прочитаю его заранее и постараюсь на вебинаре или в чате рассказать, как сам решал бы подобные задачи. Если кейсов будет много, то на вебинаре выберу несколько показательных или кластеризую похожие кейсы. Можно ли оплатить участие другого человека? Да, только обязательно укажите его емейл при оплате. После оплаты вам придет email со ссылкой на закрытый телеграм-чат потока. Если вдруг по какой-то причине емейл не пришел в течение 30 минут, пишите мне на rinat @ abdullin.com (но сначала, как водится, проверьте папку “спам”). Совершая оплату, вы соглашаетесь с условиями проведения вебинара (Terms & Conditions)."}
{"url":"/404.html","title":"404 - Not Found","text":"404 - Not Found I'm really sorry about that. You arrived at a page that was moved to another location or dropped completely for being no longer relevant. Ping me on twitter and I'll locate the article for you. Or just check out the archive ."}
{"url":"/about-me/","title":"Rinat Abdullin","text":"Rinat Abdullin LinkedIn | Twitter | Github | rinat @ abdullin . com Hello! I'm a technical advisor. I help ship LLM-driven products faster. You might know me from LLM Product Benchmarks , Schema-Guided Reasoning or from the Enterprise RAG Challenge . These days I focus on developing BitGN - a Platform for Benchmarks and Challenges around Autonomous Agents. I specialize in identifying patterns for AI solutions in business and driving them to practical application. I work with teams in EU and the USA, helping them to build LLM-driven business products and ship them faster. This is the current snapshot of successful and proven AI Cases that I currently have in my portfolio. Darker colors mean more cases. High-level perspective across industries and AI Cases gives me systemic insights into the approaches that work. I guide companies to leverage these insights, and they save months of effort on the way to market. If you have an interesting AI Case to complement this matrix or would like to know more about proven applications - feel free to reach out for a chat via rinat @ abdullin . com . I don't take any development projects . However, for large-scale enterprise projects I can refer to TIMETOACT GROUP Austria (and the entire international GROUP). I'm Head of ML and Innovation there, working to establish an institute of building LLM-driven business solutions that are based on relevant real-world practices. Longer Story I have been working in commercial projects since 2003. Through my career, I have been involved in various domains from warehouse management to social platforms to international logistics. This exposure allows me to understand business problems across domains. Helping to set up IT development offices for 4 different companies gave me a deeper insight into the challenges of organisational dynamics, leadership and building healthy teams. I have worked with software startups for the most of my career. This created a strong connection with product development, a focus on value delivery and adaptation to changing market conditions. Along the way, I have been lucky to learn from outstanding people and work with them on the exciting projects: contributed to the book on Implementing Domain-Driven Design ; designed architecture of a social network ; helped to deliver big data analytics for retail in the cloud ; helped to scale multi-channel inventory management software past 1.5B events; delivered real-time analytics for a social network ; helped to build a data science infrastructure for an international transport organization for the whole past year, I've been working on my AI Research: a series of interviews with companies on adopting LLMs for solving their business challenges. I used to blog and occasionally talk at the conferences about CQRS, Event Sourcing, Domain-Driven Design and scaling event-driven systems . You can check out my newsletter ML Under the Hood and latest blog posts to find things I'm passionate about. These days, I'm focused on Machine Learning. I help companies to build and ship ML-driven software products faster. If you want to talk, ask a question or work with me, don't hesitate to write me an email at rinat @ abdullin . com . I'm learning a lot from the smart and kind people around the world. I'd be glad to help out someone as well."}
{"url":"/about-this-site/","title":"About This Site","text":"About This Site This website started around year 2007 as a simple blog. It went through multiple changes since then. This is the story behind it. History Initially I hosted blob posts on open platforms like LiveJournal and Blogger . They tended to be limiting, clunky and with advertisements. At some point I discovered Linux for the first time and decided to give a try to WordPress . It worked out well for a couple of years, till I got tired from maintaining it. Some time later, commercial Squarespace platform attracted me with good performance and sleek designs. It was hassle-free and enjoyable for some time. At some point, I got tired with limitations of the platform and their pricing. Around the year 2014 I imported my posts to Octopress . Octopress is a blogging framework for Jekyll - static site generator powering GiHub pages. I loved the idea of static websites: low maintenance and high availability; great performance; hosting prices are negligible. However Octopress was slow and it was running on Ruby. I experienced occasional frustrating issues with runtime versions and ruby gems. Hugo While working on HappyPancake project, I discovered the ecosystem of Go. Go (or golang) is an opinionated language, introduced by Google in year 2007. It is [[L2 golang|simple, fast and compiles to native code]]. It wasn't long before Hugo got my attention. This static website generator was really fast, building full site out of my 200+ posts in less than a second. This was worth migrating to. Roughly at the same time I moved my site to the Amazon cloud for static hosting. Static website hosting requires resources to be pre-generated, but provides high availability at low cost. This can apply to the entire web site or to specific parts. The combination worked great. Fast site regeneration allowed have smooth feedback loop. HTML templates were a bit uncomfortable and limiting, but that is the nature of html/template in go. Custom Node.JS + React.JS In 2014, while working on front-end logic for 👍 HappyPancake , I discovered Node.js and React.JS. Node.js ecosystem is totally different from ecosystems of golang, .NET, Erlang or Haskell. It is build upon JavaScript, which is probably the most widely used language out there. There are a lot of bad features in the language and poor quality projects written in it. Yet, if you stick with the good parts of the language, nice packages and align your design with the ecosystem, you would get a brilliant set of tools for front-end development. I also loved templating capabilities of React.JS - a JavaScript Library for building web interfaces. It has a convenient way of expressing composable User Interface components in a single file. Here is an example of a simple page listing all posts within a tag: render: function () { var site = this.props.site; var title = \"Articles in \" + this.props.tag; var list = []; var posts = site.tags[this.props.tag]; lists.sortLatestFirst(posts).forEach(function(p) { list.push( <li key= {p.url} > <Permalink post= {p} /> </li> ); }); return ( <PostLayout title= {title} site= {this.props.site} nav= \"archive\" > <article> <ul> {list} </ul> </article> </PostLayout> ); } So I decided to replace Hugo with a hand-written static HTML generator, using markdown for content and React.JS with JSX for HTML. I purposefully misused ReactJS in this project for the benefit of the syntax. Normally ReactJS shines, when used with Flux-based architecture for building fast and responsive web applications. December 3 - New Version Deployed This week I pushed a new version of the web-site to production. It took roughly 2 weeks to replace Hugo with my own version, working in spare time. I simply picked existing node.js packages, wired them together and then rewrote Hugo HTML templates as ReactJS components. This was also an opportunity to improve web design. React has renderToStaticMarkup method which produces raw HTML without the attributes needed for the DOM magic. Here are some of the features that came for free with this approach. Fast generation Hugo generated HTML for my website in less than a second. It used all 4 cores of my MacBookAir. Golang works good for building utilities and servers which use multiple CPU cores efficiently. Node.js processes, on the other side, run on a single core by default. Build takes just 4 seconds, which came as a surprise. I use some brutal HTML validation and regex replacements on old posts. Without this logic, build would be faster. In-memory processing HTML processing happens in memory. In the start node.js loads all my source files into the virtual file system. All further operations would happen there. mock-fs package provided virtual file system capabilities. At the end, I either dump /build folder into the real filesystem for sync with Amazon S3 or launch a simple HTTP server that behaves exactly like Amazon. Since regeneration is fast, and it doesn't write thousands of files to my SSD, I can watch the input directory for changes and rebuild on every save. Continuous compilation makes writing flow more convenient. No vendor lock-in If, at some point, I decide to go away from Amazon, it would require a little effort. Nginx with a simple configuration could host the entire website in exactly the same way. Code snippets are compiled I added syntax for extracting named code snippets from source files and injecting them into the HTML. They are auto-indented and then highlighted with highlight.js . Obviously, build script lints, compiles and tests source files, ensuring that they have consistent quality. This process is fast for golang, which is one of the reasons I started using it for new snippets. HTML-specific processing Node.js has an ecosystem for dealing with web resources. I use it for: HTML minification CSS linting and minification Atom Feed generation Markdown rendering Semantic HTML5 New layouts use semantic HTML5 elements like aside , header , footer or section . Not only this helps with the design, but also makes website friendly for search engines. Publishing Features There also are some miscellaneous features related to publishing. They were either easy to add or came for free. Build process extracts all TODO comments across all content, updating a simple org-mode outline file. I can easily see what needs to be done next . There are some broken links in my old content, build process finds them and lists by priority. Current UI layout forces new content to be book-ready. In fact, new content could be converted from the sources into O'Reilly AsciiDoctor format or transformed to Epub directly. Some HTML templates display the most relevant articles to read after the content. If relevant articles aren't provided manually, build process computes them. Org-mode is an Emacs extension for managing outlines, tasks, documents and agendas. It was the initial reason why I switched to Emacs for all my work. I also introduced additional constraints for the new prose. For example, all essays should have a central idea that they focus on (build will fail if that property is not provided). This idea is then used in description HTML header or in tweet summary, should you click Discuss button by the post header. if ( form === \"essay\" ) { if ( idea === undefined ) { throw new Error ( \"Essays MUST specify an idea they focus on.\" ); } } else if ( form === \"bio\" ) { /* ... */ } Different forms of prose have different constraints. To be continued... The blog engine gets another rewrite to Python in 2020 ."}
{"url":"/ai-assistants-course","title":"2024-ai-assistants-course-landing","text":"2024-ai-assistants-course-landing Курс в записи “LLM под капотом: выбираем эффективные технические решения для AI-ассистентов” О чем этот курс · Для кого? · Курс будет полезен · Кому курс не подойдет? · Структура · Комьюнити курса · Формат курса · Стоимость · Вопросы и ответы · Купить Update: a revised english version of this course was published in April 2026: building-ai-assistants-course . О чем этот курс Этот курс о том, как обоснованно выбирать эффективно реализуемые технические решения продуктовых задач на базе LLM в различных доменных областях так, чтобы они: давали стабильное качество; были контролируемыми и тестируемыми; развивались без деградации при расширении функциональности. Главная проблема большинства LLM-ассистентов в продакшене - качество часто получается нестабильным и плохо управляемым. Появляются галлюцинации, рассыпается формат, ответы становятся непредсказуемыми. В курсе мы разберем, почему это происходит, и как строить решения, которые работают надежно. Я покажу вам подход к проектированию и диагностике LLM-систем на практических кейсах внедрений. Отдельный акцент курса — SGR (Schema-Guided Reasoning) : подход, который помогает управлять рассуждением модели через структурированный вывод, чеклисты и этапы самопроверки. Это повышает предсказуемость, снижает «творческую самодеятельность» модели и упрощает автоматическую проверку результата. Методология обучения основана на моем подходе в консалтинге и кейсах успешных внедрений AI. Купить курс Для кого? Этот курс для тех, кто уже разрабатывает продукты с LLM самостоятельно или в составе команды, и сталкивался с типовыми вопросами: почему качество нестабильно и «то работает, то нет»; как перестать лечить всё промптами; какая модель лучше всего подойдет для конкретной задачи; как сохранить качество при добавлении новых возможностей. Мы будем учиться проектировать решения для продуктов с LLM под капотом так, чтобы качество было воспроизводимым: снижать галлюцинации через структуру и проверки, выбирать архитектуру под задачу и выстраивать процесс улучшений через набор паттернов, которые делают поведение модели предсказуемым и пригодным для продакшена. Вместо того чтобы переизобретать решения с нуля, вы получите библиотеку проверенных паттернов и подходов, которые можно сразу применять и адаптировать под свою задачу - экономя бюджет и время. Курс будет полезен: инженерам - увидите разбор диагностики качества на реальном сценарии, получите набор практик для управляемых пайплайнов, включая Schema-guided Reasoning; техлидам / CTO - получите набор паттернов, из которых складываются устойчивые архитектуры, разберетесь, как масштабировать решения без сюрпризов; продактам - чтобы выбирать жизнеспособные LLM-сценарии, задавать измеримые критерии качества и формулировать требования так, чтобы команда предсказуемо доводила фичи до продакшена; фаундерам - получите ориентиры по выбору решений на MVP и при развитии продукта: где обычно возникают проблемы качества, как закладывать устойчивую траекторию развития и какие паттерны дают быстрый прогресс. Непосредственно опыт программирования для курса не обязателен, но понимание принципов разработки важно. Кому курс не подойдет? Тем, кто только начинает и пока не строил никаких LLM-решений. Тем, кого интересуют ассистенты исключительно на локальных моделях (в курсе акцент на инженерных принципах и паттернах, применимых везде, но это не курс по локальной инфраструктуре). Тем, кто ожидает туториал «как подключить LangChain/LlamaIndex и индексировать документы». Курс не про фреймворки, а про архитектурные решения, качество и контроль информационного потока. Структура курса Модуль 1: Основы Разбираем типичный сценарий «ассистент по документам», воспроизводим проблемы и учимся находить основные причины ограничений. Описание структуры курса Путь к галлюцинациям - воспроизводим проблему RAG-ов на годовых отчетах компаний Как по годовым отчетам компаний отвечают лучшие RAG-и и в чем их ограничения? Минимальное воспроизведение проблемы в один промпт LogProbs - инструмент анализа галлюцинаций Паттерны галлюцинаций LLM Первопричины проблем в RAG-ах Как избежать тупика? Приходим к простому решению Модуль 2: Кейсы и паттерны Разбираем повторяющиеся архитектурные паттерны из кейсов успешных внедрений проектов c AI. Prompt и Query Expansion Prompt 2 Knowledge Base Learn from Feedback Searching for data Sandbox и практические задания Router - когда нужна специализация Instruction Distillation Dedicated Agents Workflow Human in the Loop Structured Data Extraction и визуализация ошибок REPL (Self-review) Checklists и Prompt Caching Structured Outputs & Custom CoT - Deep Dive Для каждого паттерна я показываю на кейсах, как формулируется задача, какие ограничения проявляются в реальной системе, где именно появляются ошибки качества, и какие решения дают устойчивый эффект в продакшене. Комьюнити курса После покупки вы получаете персональный инвайт в закрытое сообщество курса. Это рабочее пространство, где можно задавать вопросы по материалам курса, разбирать свои кейсы, обсуждать инженерные и продуктовые решения. За время жизни курса в сообществе собрались практики, которые строят и внедряют LLM-решения: инженеры и техлиды, CTO, фаундеры, руководители DS/ML-команд, директора Gen-AI направлений. Опыт - из разных контекстов: финтех и банки, HR-tech, B2B SaaS, контакт-центры и sales-аналитика, медтех, enterprise-автоматизация, документные пайплайны, knowledge-боты по внутренней документации, VC-аналитика. Формат курса Формат : видеолекции, разделенные по темам, с навигацией; дополнительные разборы, практические примеры и иллюстрации. Длительность видеолекций : около 4 часов. Время прохождения курса : self-paced; для вдумчивого изучения требуется в среднем неделя времени. Язык : русский. Доступ к курсу : насовсем, включая обновления материала. Оплата : только зарубежной картой. Доступ к комьюнити курса : персональный инвайт в Телеграм-чат, где можно будет задавать вопросы мне и общаться с другими специалистами, проходящими курс. Купить курс Стоимость Курс можно оплатить любой карточкой, которая работает зарубежом и принимается Stripe. Закрывающие документы генерируются при оплате и придут вам на указанную электронную почту. Компании при оплате могут указать свой billing address и EU VAT ID. Принимающее оплату юрлицо зарегистрировано в EU. Личный доступ: 1 место на курсе Курс в записи с обратной связью в чате поддержки: €258.00 (VAT included). Командный доступ: 5 мест на курсе для команды из одной организации Курс в записи с обратной связью в чате поддержки: EUR €1290.00 (VAT included). Вопросы и ответы Я работаю в Европе / США / РФ, подойдет ли мне этот курс? Да, принципы и паттерны решений основаны на статистике успешных внедрений кейсов AI в разных странах и отраслях. На курсе мы разберем как повторяющиеся архитектурные паттерны, так и особенности их применения в разных кейсах. Что такое паттерны Case/Solution? Я проводил AI Research, в ходе которого пообщался с десятками компаний из разных областей. Это позволило мне выявить паттерны проблем и решений, в которых можно применить LLM в разных доменах на различных задачах. Об этих паттернах я рассказываю на курсе, и показываю на реальных примерах, как пользоваться библиотекой паттернов применения LLM для поиска оптимально работающих решений. Вместо того, чтобы придумывать (и тестировать!) свои решения, можно применить Pattern Matching и переиспользовать проверенный опыт других. Нужен ли опыт программирования? Непосредственно писать код самому не обязательно. Курс ориентирован на проектирование и принятие решений. Инженеры смогут применять материал сразу в коде. Есть ли практическая часть? Курс лекционный: видео в записи и методические материалы. Я показываю практические кейсы, разбираю решения для них и даю инструментарий для выбора оптимального решения. Используя библиотеку паттернов, о которой я рассказываю на курсе, вы сможете научиться делать это самостоятельно. В курсе есть два практических задания - если вы не пишете код, можете смело пропускать их. Можно ли оплатить участие другого человека? Да, только обязательно укажите его емейл при оплате: доступ на курс и в чат поддержки будет персональным. Можно ли оплатить курс криптовалютой, PayPal или переводом на карточку? К сожалению, нет. Оплата возможна только международными картами и средствами платежа, которые принимает Stripe. Остались вопросы? Пишите мне в телеграм или на почту. Купить курс Купить можно здесь: https://labs.abdullin.com/courses/ai-assistants-ru?a=unlock Курс расположен на моей площадке Abdullin Labs, перед покупкой нужно будет завести там аккаунт для доступа к курсу."}
{"url":"/ai-coding/kata-1/","title":"AI+Coding Kata","text":"AI+Coding Kata AI in Coding helps a lot with high-level tasks like prototyping, reasoning and finding bugs. Please use your favourite tools (no limitations) to implement as much of this spec as possible in a language of your choice. This should be a parser that can parse any document in this spec. Imagine that your team will have to support this code for a few years, so you want to do a thorough job here. There are no limits here. If you spot something unusual - use your best judgement. BizDocumentAI Spec Let's define a simple document format that could describe a contract, procedure, or any other business document in a structured way. It may be used to load this business data into AI Assistants (like in Enterprise RAG Challenge ). We’ll work with the documents. Our documents will consist of blocks . A block is a logical piece of text (like a paragraph). It can optionally have a head , number , and body . A block’s body can contain: Another block Text A list A dictionary Blocks can contain heterogeneous content—texts, other blocks, dictionaries, etc. Lists can contain only similar block items that also have a number . Document Layout The document below describes a simple text format that can be deterministically parsed into JSON objects. This document is also a test suite ! Code admonitions always come in pairs: first input and then json . When the parser is implemented, parsed input should always produce output that is structurally similar to the expected JSON. The headline before the code blocks is the name of the text. Python Data Structures Below is an example of how you might structure your data models in Python using Pydantic: from typing import List , Optional , Union , Dict , Literal from pydantic import BaseModel , Field # This type alias helps with readability and forward references. ContentNode = Union [ str , \"Block\" , \"ListBlock\" , \"Dictionary\" ] class Dictionary ( BaseModel ): \"\"\" A distinct dictionary structure for key-value pairs. \"\"\" kind : Literal [ \"dict\" ] items : Dict [ str , str ] = Field ( default_factory = dict ) class Block ( BaseModel ): \"\"\" A general-purpose container for a 'section' or item. - 'number' can store a section number (e.g., \"5\", \"5.1\") if applicable. - 'head' is an optional heading for the block. - 'body' can hold any mix of strings, sub-blocks, dictionaries, or lists. \"\"\" kind : Literal [ \"block\" ] number : Optional [ str ] = None head : Optional [ str ] = None body : List [ ContentNode ] = Field ( default_factory = list ) class ListBlock ( BaseModel ): \"\"\" A container for a list of items, each item being a 'Block'. \"\"\" kind : Literal [ \"list\" ] items : List [ Block ] = Field ( default_factory = list ) # Important for forward references within union types Block . model_rebuild () Specifications Empty text Empty text results in an empty document block. Input : (there is no content) JSON : { \"kind\" : \"block\" } Body Plain text goes into the block body straight away. Different paragraphs are separated by new lines. Input : First paragraph. Second paragraph. JSON : { \"kind\" : \"block\" , \"body\" : [ \"First paragraph.\" , \"Second paragraph.\" ] } Note that we strip and skip empty lines! Input : First paragraph. Second paragraph. (An empty line in between) JSON : { \"kind\" : \"block\" , \"body\" : [ \"First paragraph.\" , \"Second paragraph.\" ] } Head Text marked with <head> goes directly into the head of the current block. Input : <head> Test Document </head> Content JSON : { \"kind\" : \"block\" , \"head\" : \"Test Document\" , \"body\" : [ \"Content\" ] } Blocks You've seen that the document is parsed into a root block . But everything is a block, and blocks can be nested explicitly: Input : <head> AI Coding Kata </head> Let's get started with the kata <block> <head> Preface </head> Here is a little story </block> JSON : { \"kind\" : \"block\" , \"head\" : \"AI Coding Kata\" , \"body\" : [ \"Let's get started with the kata\" , { \"kind\" : \"block\" , \"head\" : \"Preface\" , \"body\" : [ \"Here is a little story\" ] } ] } Dictionaries Dictionaries are used to capture key-value pairs. By default, they are separated by : . Input : <dict sep= \":\" > Key One: Value One Key Two: Value Two Key Three: Value Three </dict> JSON : { \"kind\" : \"block\" , \"body\" : [ { \"kind\" : \"dict\" , \"items\" : { \"Key One\" : \"Value One\" , \"Key Two\" : \"Value Two\" , \"Key Three\" : \"Value Three\" } } ] } We can also have a non-standard separator and empty values: Input : <dict sep= \"-\" > Title - AI Coding - for TAT Kata Number - </dict> JSON : { \"kind\" : \"block\" , \"body\" : [ { \"kind\" : \"dict\" , \"items\" : { \"Title\" : \"AI Coding - for TAT\" , \"Kata Number\" : \"\" } } ] } Lists Lists are very important! By default, each non-empty line is a list item. They go inside the root block. There are multiple kinds: . for ordered lists that are dot-separated * for bulleted lists Note that the list item’s text goes into head and the item number goes into number . Ordered Lists Input : <list kind= \".\" > 1. First 2. Second </list> JSON : { \"kind\" : \"block\" , \"body\" : [ { \"kind\" : \"list\" , \"items\" : [ { \"kind\" : \"block\" , \"number\" : \"1.\" , \"head\" : \"First\" }, { \"kind\" : \"block\" , \"number\" : \"2.\" , \"head\" : \"Second\" } ] } ] } As a convenience, nested lists are automatically detected: Input : <list kind= \".\" > 1. First 2. Second 2.1. Subitem 1 2.2. Subitem 2 </list> JSON : { \"kind\" : \"block\" , \"body\" : [ { \"kind\" : \"list\" , \"items\" : [ { \"kind\" : \"block\" , \"number\" : \"1.\" , \"head\" : \"First\" }, { \"kind\" : \"block\" , \"number\" : \"2.\" , \"head\" : \"Second\" , \"body\" : [ { \"kind\" : \"list\" , \"items\" : [ { \"kind\" : \"block\" , \"number\" : \"2.1.\" , \"head\" : \"Subitem 1\" }, { \"kind\" : \"block\" , \"number\" : \"2.2.\" , \"head\" : \"Subitem 2\" } ] } ] } ] } ] } Unordered lists We can have unordered lists too: Input : <list kind= \"*\" > • First • Second • Third </list> JSON : { \"kind\" : \"block\" , \"body\" : [ { \"kind\" : \"list\" , \"items\" : [ { \"kind\" : \"block\" , \"number\" : \"•\" , \"head\" : \"First\" }, { \"kind\" : \"block\" , \"number\" : \"•\" , \"head\" : \"Second\" }, { \"kind\" : \"block\" , \"number\" : \"•\" , \"head\" : \"Third\" } ] } ] } And nesting can be done with \"o\" : Input : <list kind= \"*\" > • First o Subitem • Second • Third </list> JSON : { \"kind\" : \"block\" , \"body\" : [ { \"kind\" : \"list\" , \"items\" : [ { \"kind\" : \"block\" , \"number\" : \"•\" , \"head\" : \"First\" , \"body\" : [ { \"kind\" : \"list\" , \"items\" : [ { \"kind\" : \"block\" , \"number\" : \"o\" , \"head\" : \"Subitem\" } ] } ] }, { \"kind\" : \"block\" , \"number\" : \"•\" , \"head\" : \"Second\" }, { \"kind\" : \"block\" , \"number\" : \"•\" , \"head\" : \"Third\" } ] } ] } Mixed lists We can mix lists, but we need to designate different types separately with tags. Input : <list kind= \".\" > 1. Beginning 2. Main 2.1. Subsection <list kind= \"*\" > * Bullet 1 * Bullet 2 </list> 3. Ending </list> JSON : { \"kind\" : \"block\" , \"body\" : [ { \"kind\" : \"list\" , \"items\" : [ { \"kind\" : \"block\" , \"number\" : \"1.\" , \"head\" : \"Beginning\" }, { \"kind\" : \"block\" , \"number\" : \"2.\" , \"head\" : \"Main\" , \"body\" : [ { \"kind\" : \"list\" , \"items\" : [ { \"kind\" : \"block\" , \"number\" : \"*\" , \"head\" : \"Bullet 1\" }, { \"kind\" : \"block\" , \"number\" : \"*\" , \"head\" : \"Bullet 2\" } ] } ] }, { \"kind\" : \"block\" , \"number\" : \"3.\" , \"head\" : \"Ending\" } ] } ] } Lists with content Lists can also have additional content. If something in the current list doesn't match the prefix, then it is treated as a block body : Input : <list kind= \".\" > 1. First First body 2. Second Some more text <dict sep= \":\" > Key: Value Another Key: Another Value </dict> </list> JSON : { \"kind\" : \"block\" , \"body\" : [ { \"kind\" : \"list\" , \"items\" : [ { \"kind\" : \"block\" , \"number\" : \"1.\" , \"head\" : \"First\" , \"body\" : [ \"First body\" ] }, { \"kind\" : \"block\" , \"number\" : \"2.\" , \"head\" : \"Second\" , \"body\" : [ \"Some more text\" , { \"kind\" : \"dict\" , \"items\" : { \"Key\" : \"Value\" , \"Another Key\" : \"Another Value\" } } ] } ] } ] }"}
{"url":"/ai-research/","title":"ai-research","text":"ai-research Call for Research: Identifying patterns for AI solutions in business Duration : 45-minute focused session Your input : highlight your key business challenges and review potential AI applications together My role : provide expert AI insights and potential solutions Outcome : collaborative exploration to identify practical AI applications for business Let's Explore AI Innovations Together Throughout my career, I have been involved in various domains from warehouse management to social platforms to international logistics. This exposure allows me to understand business problems across domains. In the past year, while working as a consultant, I’ve been gaining more and more understanding on how to power up businesses, solving their key challenges with the benefits that AI has to offer. If you want to read more about me and my expertise: this is my website and my newsletter where I from time to time publish my AI-driven cases and solutions. The Core of Our Research My current research revolves around the process of AI adoption in various business sectors. The goal is to identify patterns of opportunities for custom AI solutions. This involves understanding specific industry challenges, potential AI applications, and the impact of these technologies on existing workflows. Is This Research Right for You? This initiative is designed for decision-makers who are actively shaping the IT and AI landscape of their organizations. Ideal collaborators are those in roles such as Senior Product Managers, CTOs, CEOs, Founders, or other high-level positions where strategic decisions about technology are made. If you're someone who can drive change and implement innovative IT solutions in your business, your insights and participation would be highly valuable in this research. I'm looking for thought leaders eager to explore AI's potential and impact at the top levels of business decision-making. I focus on the cases of AI adoption in EU and USA businesses. Let's Collaborate on AI Research I'm looking for businesses interested in exploring AI possibilities within their domains. Our collaboration will include a focused 45-minute call to exchange thoughts on potential AI applications in your industry. I'll share insights from similar cases, discuss common pitfalls, and explore potential shortcuts that can potentially be applicable as your solutions. My approach is about partnership and mutual learning – I believe that the best way to understand and leverage AI is by sharing experiences and ideas. Let's explore the possibilities of AI together. Your Unique Perspective In return for sharing my insights, I'm eager to hear about the specific challenges and AI adoption hurdles your business might be facing. Your input is invaluable in shaping the direction of this research. If your case is part of a recurring pattern observed at different companies, I might use it for a generalized analysis in my posts or newsletter articles. Client names and sensitive details are always kept confidential, ensuring that your proprietary information remains secure. Ready to Dive into AI Exploration? If you believe we could work well together, please complete the form below. I will review each submission, then reach out to discuss more and plan a joint session. Expect a reply from me within a few days. Form is loading…"}
{"url":"/analyze-caddy-logs-with-clickhouse/","title":"Analyze website logs with clickhouse-local","text":"Analyze website logs with clickhouse-local Lean web analytics (see Lean web analytics ) has been in place on my website for a couple of days. It doesn't share data with third parties, doesn't track individual users, and is treated nicely by all adblockers. It tries to get away by: running the website as a single-page application annotating each normal request with page number and page time. Let's see if we can get any useful insights out of the limited dataset. It would be a waste of time to move forward without proving that the approach works. Fail fast or succeed fast. Given the output log in Caddy structured log format, what is the shortest path to experiment with the data? We need to: convert JSON to some table format; use a fast database engine capable of SQL, because nothing beats SQL; plot results on a chart. Let's skip the \" production-worthy \" setup for the time being. No streaming and real-time updates. Just batch processing and a few utilities: jq , Clickhouse, and Jupyter notebooks to connect these two. Tools Here is a list of tools that I always have installed on my dev machine for cases like that: jq - command-line utility for JSON data Jupyter - interactive development environment for code and data Clickhouse - fast column-oriented database. It also has a clickhouse-local utility for manipulating data. Along the way, we'll use a few \"must-have\" Python libraries: pandas - data analysis library matplotlib - visualization with Python plotly - another graphing library for Python Workflow First, use jq to convert a structured log.json to a clean.tsv : FORMAT = \"\"\"[ .ts, .request.method, .request.uri, .request.headers.Referer[0?], .request.headers[ \\\\ \"X-Page-Num \\\\ \"][0?], .request.headers[ \\\\ \"X-Page-Sec \\\\ \"][0?] ] | @tsv \"\"\" ! jq -- raw - output - r - c \"$FORMAT\" log . json > clean . tsv Note: this is a python code running in Jupyter. Lines that start with ! are executed as shell commands by the environment. Then, use Clickhouse in local mode to load clean.tsv , run a query, and print the results. For example, to display TOP 10 visited pages: SCHEMA = \"\"\" CREATE TABLE table ( ts Float32, method Enum8('GET' = 1, 'HEAD' = 2, 'POST'=3), uri String, ref String, num UInt8, sec UInt8 ) ENGINE = File(TSV, 'clean.tsv'); \"\"\" QUERY = \"\"\" SELECT uri, COUNT(uri) AS visits FROM table WHERE endsWith(uri,'/') GROUP BY uri ORDER BY visits DESC LIMIT 10 FORMAT Pretty \"\"\" ! clickhouse - local -- query \"$SCHEMA;$QUERY\" The output would be like that (thanks to FORMAT Pretty ): ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┃ uri ┃ visits ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ │ /beautiful-tech-debt/ │ 1337 │ ├───────────────────────────────────────────┼────────┤ │ / │ 650 │ ├───────────────────────────────────────────┼────────┤ │ /lean-web-analytics/ │ 342 │ ├───────────────────────────────────────────┼────────┤ │ /designing-privacy-first-analytics/ │ 147 │ ├───────────────────────────────────────────┼────────┤ │ /post/dddd-cqrs-and-other-enterprise-... │ 46 │ ├───────────────────────────────────────────┼────────┤ │ /about-me/ │ 46 │ ├───────────────────────────────────────────┼────────┤ │ /my-productivity-system/ │ 42 │ ├───────────────────────────────────────────┼────────┤ │ /post/when-not-to-use-cqrs/ │ 41 │ ├───────────────────────────────────────────┼────────┤ │ /cqrs/ │ 40 │ ├───────────────────────────────────────────┼────────┤ │ /archive/ │ 37 │ └───────────────────────────────────────────┴────────┘ To render a chart, save query results as a file that matplotlib will render: QUERY = \"\"\" SELECT uri, COUNT(uri) AS visits FROM table WHERE endsWith(uri,'/') GROUP BY uri ORDER BY visits DESC LIMIT 10 FORMAT CSVWithNames; DROP TABLE table \"\"\" data = ! clickhouse - local -- query \"$SCHEMA;$QUERY\" > tmp . csv df = pd . read_csv ( 'tmp.csv' ) plt . pie ( df [ 'visits' ], labels = df [ 'uri' ]) plt . title ( \"TOP 10 pages\" ) plt . show () This should render something like that: And here is how it looks all together in Jupyter interface: The amazing part of clickhouse-local - it runs a full column database engine in a single command-line request . You could use Clickhouse queries to build advanced transformations on top of the plain-text files. For example, to display daily visits to the most popular pages: CREATE TEMPORARY TABLE top_pages AS SELECT uri FROM table WHERE endsWith ( uri , '/' ) GROUP BY uri ORDER BY COUNT ( uri ) DESC LIMIT 9 ; SELECT date_trunc ( 'day' , toDateTime ( ts )) as date , uri as page , COUNT ( * ) as visits FROM table WHERE uri in ( SELECT uri from top_pages ) GROUP BY date , uri FORMAT CSVWithNames Then pivot the results with pandas and display as area chart: df = pd . read_csv ( 'tmp.csv' , parse_dates = [ 'date' ]) . set_index ( 'date' ) dfp = df . pivot_table ( index = df . index , columns = 'page' , values = 'visits' , aggfunc = 'sum' ) dfp . plot . area ( figsize = ( 12 , 6 )) plt . title ( \"Daily visits to most popular pages\" ) plt . show () To create a sankey diagram, use plotly library: QUERY = \"\"\" SELECT ref as src, uri as trg, num, COUNT(*) as count FROM table WHERE num > 0 GROUP BY ref, uri, num FORMAT CSVWithNames \"\"\" ! clickhouse - local -- query \"$SCHEMA;$QUERY\" > tmp . csv df = pd . read_csv ( 'tmp.csv' ) labels , sources , targets , values = {},[],[],[] # brutally simple processing below, will not scale def index ( s ): if s not in labels : labels [ s ] = len ( labels ) return labels [ s ] for i , row in df . iterrows (): count = row [ 'count' ] if count == 1 : continue num = row [ 'num' ] src = f \" { row [ 'src' ] } : { num - 1 } \" trg = f \"https://abdullin.com { row [ 'trg' ] } : { num } \" sources . append ( index ( src )) targets . append ( index ( trg )) values . append ( count ) fig = go . Figure ( data = [ go . Sankey ( node = dict ( pad = 15 , label = list ( labels . keys ()), color = \"#cd2e29\" ), link = dict ( source = sources , target = targets , value = values ) )]) fig . update_layout ( title_text = \"User flow\" , font_size = 12 ) fig . show () I don't have enough visits to render a nice chart (most visitors drop off after the first page), so here is what we have: Summary The approach looks good enough for my purposes. Small gotchas: some requests come without a referrer field, which messes up the sankey diagram; not much multi-step data on my website, so it is hard to tell where this stops scaling; it is hard to tell bots from the real people; perhaps should try using User-Agent field. All in all, the experience is quite nice. More Reading Retrospective on a project in building a lean platform for real-time analytics: Real-Time Analytics with Go and LMDB . Retrospective about a high-load project at SkuVault (inventory management SaaS): High Availability and Performance ."}
{"url":"/archive/","title":"Archive","text":"Archive Here is a list of older blog posts that I wrote. 2016 2015 Retrospective 2014 Data Anonymization Behavioral Analytics Decomposing HappyPancake into components How do you decompose complicated system into components? How Micro-Services approach worked out in production 2013 From .NET to Erlang - Starting Work with Large Free Dating Website Stop overdoing things Separation of contract and implementation in software Many ways for an Emergent Design in a component How to produce a superb software design? How sticking to HTTP and REST makes your life simpler How I Stopped Worrying and Learned to Love the WWW and UNIX Way Implementing C# projections for Event Store Essay on Messaging and Design Patterns Replicating events to remote EventStore One Example of Using Message-Driven Design at Lokad Migrating to EventStore in Windows Azure Helping non technical people understand the data Greg's Event Store and Windows Azure Making Sense of Insane Software Requirements Use Software Design to Aid Your Development Process Domain-Driven Design and Organizational Politics Does your event store only serve one node? Aggregate Design and Security Utility Classes are Lesser Evils Essence of Domain-Driven Design - Re-Explained 2012 Being the Worst Updates Rule of Time Limiting Recommended Reading on Big Systems Management Lifehacks Analysis Paralysis Design Obsessions Design Obsessions Specification Testing For Event Sourcing Using Redis for Event Sourcing and much more Business Amnesia Lokad Topologies Updates on Lokad.CQRS - IDDD Branch Don't Be Depressed, Be The Worst Behind the Scenes of DDD/ES Project - Salescast2 Structure of DDD+ES Solution in Visual Studio Bounded Context is a Team Working Together DDD/ES - Missing Guidance Recent Lessons Learned in Lokad.CQRS Our Ultimate Technological Challenge Technology Demons Essential Reading on Big Data and Persistence Design Observations on Big Data for Retail DDD/CQRS Challenge - Integrating Distributed Systems DDD Summit 2012 Summary - Move Forward by Discarding Complex Tech Processing Big Data in Cloud à la Lokad DDD - Evolving Business Processes a la Lokad DDD From Reality to Implementation Software War Starts with a Map, Context Map Bird's-eye view of a Distributed System - Context Map Anatomy of Distributed System à la Lokad Get Real and Move Forward Building Blocks in CQRS world à la Lokad New Case Study on CQRSGuide.com Less is More Get Better Understanding Importance of Execution How to Deal with Unexpected Problems When NOT to use CQRS? Reading List on Big Data Getting Rid of CQRS View Rebuilds Make Code Explicit and Stupid People Don't Think in Tables Handling Big Data in Cloud 2011 Example of Self-documenting Unit Test with Event Sourcing Migrating Legacy Systems to Event Sourcing Migrating Legacy Systems to Event Sourcing Handling Multiple Event Streams in Event Store Tech Layer of CQRS Systems: Pushing it Further Store and forward for Event Streams Importance of Good Domain Models Query Multiple Aggregates from Event Stream Roadtrip-2011 Lessons Learned at CQRS/ES Road-trip lessons learned at Experience with CQRS+ES/DDD in Current Project Golden Rule of Technology Vienna CQRS Notes Mindsets in development - Enterprise vs Startup Update on CQRS Tour Around xUSSR - Ukraine Self-Improvement Process Continuous Learning at Lokad Shelfcheck - Cloud Scalability Challenge for CQRS from Lokad Messages and Distributed Systems Distributed Decide-Act-Report Model Use CQRS When Constrained on Developers and Time Why Even Bother Trying Linux Keeping Mind of .NET Developer Open Handling Concurrency in Cloud Views Scalable and Simple CQRS Views in the Cloud 2010 Cloud CQRS Lifehacks From Lokad - Part 2 Troubleshooting Azure Deployments Contributing To Open Source Is So Much Better Than Taking A Class 10 Steps To Become Better .NET Developer Key CQRS Ingredient CQRS Architecture and Definitions Was Your Software Career Like This? My Current Development Environment Top 10 Reasons to do CQRS (in a PDF) What is your project like - Titanic or a street dog? Practical CQRS/Cloud Samples and an eBook: Does This Interest You? CQRS Lifehacks From Lokad Production Command Handlers without 2PC and with Various Levels of Reliability Types of Messages and Events in Reactive CQRS World Scenario-based Unit Tests for DDD with Event Sourcing The Best Way to Learn CQRS, DDD and Event Sourcing Redirect Tcp Connections in Windows Azure Space Travel and Infinitely Scalable Solutions Rx works nicely with DDD and Event Sourcing Windows Azure Storage Can Be Confusing Challenges of the Code Documentation Importance of Tooling and Statistics in CQRS World Microsoft: Lokad is Windows Azure Platform Partner of the Year Salescast - Scalable Business Intelligence on Windows Azure DDDD, CQRS and Other Enterprise Development Buzz-words 2009 Efficient Development, Deployment and Customer Feedback at Lokad Concurrency Programming is Like Thinking in 5 Dimensions Legacy Code should still be Agile Maintenance and Monitoring Infrastructure for the Enterprise Software Systems Why is Cloud Computing important for us? 2008 Parameter vs. Argument Const vs. readonly vs. static readonly in C# .NET Hunger for the integration builds Some tips on writing event handling code in C# .NET Stress testing the stability The power of simplicity You can build a house with big LEGO blocks. Can you add a nice porch? Some side thoughts on the development Long story made short - Project Report for the xLim 1 2007 Lessons Learned - Diagnostics Lessons Learned - Development Methodology"}
{"url":"/beautiful-tech-debt/","title":"Beautiful technical debt","text":"Beautiful technical debt What looks like a technical debt in a software solution, might be the most efficient way for people to deliver business value in a given situation. There is a beauty in that, if observed from a distance. Don't make paying off technical debt your mission. Use it as a guide to understand the landscape and build solutions. Take a look at that beautiful painting \"Pine Tree at Lake Side\" by Imoto Tekiho. The pine tree in front looks nothing like the straight pine trees on the right side of the picture. Yet, this lone pine tree fits the landscape - the barren rock above the water. This crooked shape is the most efficient way for nature to grow on tough terrain. It clings to a rock while withstanding the storms, getting water and nutrients. It is perfect for this environment. Software solutions are like that. They take time to grow, from an idea to a prototype to a lifetime in production. While this happens, the solutions are affected by various forces of nature: team structures, personal incentives, pandemics, or bottlenecks in company processes. These forces affect the shape. Available resources also matter. We all want to build understandable and efficient solutions. Creating a mess isn't in our nature. We prefer straight lines and angles. Nature didn't invent measuring tape - humans did. If there is some \"ugliness\" in a solution - there usually is a cause. Maybe a data scientist could not get proper access to the database, not without a temporary workaround. Maybe a software engineer had to comply with a short-lived regulation. Maybe the lead architect was poached by a competitor before he could finish converging everything towards desired architecture. All these nuances will leave marks on the final product. It is sometimes called technical debt. It is a mark of external forces that had to be dealt with for the solution to survive . There is immense value and beauty in that. Don't just try to pay the technical debt off, but embrace and dig beneath. Dig into the problems , conduct customer interviews with stakeholders, get more data, and validate solutions early. Use existing technical debt to guide your questions and product design. Read More Problem-driven design Solving the wrong problem"}
{"url":"/bitgn/real-time-analytics/","title":"Real-Time Analytics with Go and LMDB","text":"Real-Time Analytics with Go and LMDB I'd like to share a few details about a project I've been working this year: a lean platform for real-time analytics and batch reporting . Project Inception The project started as an internal tool for running ad-hoc reports on business event streams. Existing tools took 5-8 hours to download a billion of events, replay them and produce useful reports. Event replay within 5-8 hours is good enough for warming up local caches on production nodes. However, it doesn't fit well with data analysis which has to iterate through multiple theories and experiments. You will probably forget 80% of the experiment by the time the results are ready. In my spare time I started working on experimental tooling to speed up the process and figure approaches that combine iterative and incremental event processing with occasional full replays. Constraints This project had a few constraints from the start: No multi-node data processing frameworks . Although Kafka and Spark could do the job, they come with dependencies like ZooKeeper and DevOps related overhead. I needed something simple that could efficiently utilize hardware on a single machine. Encryption and compression had to be supported by the infrastructure from the start. While modern file systems could handle both aspects, in some specific cases it would be more efficient to let the application layer handle these concerns. Throughput over latency is a trade-off that allows to avoid a lot of complexity in the code. Analytical pipelines are likely to spend seconds and minutes digging through the numbers anyway, so there is no need to optimize for sub-millisecond latency for the individual events. Simplicity and reuse - an inherent constraint, since the project was initially running on a limited time budget (roughly 15 hours per month). First Version in .NET The initial implementation was mostly done in .NET, since that is the platform I was most familiar with. I evaluated .NET Core, but at the moment tooling was too fragile and important libraries were still missing from the ecosystem. Despite the promise of Linux-friendliness (which halves hosting costs), actually developing in .NET core would've blown my time budget out of the water. Implementation consisted of: Analytical event storage , optimized for throughput and supporting a single writer with multiple readers. Data processing stages responsible for incremental event import, compression, filtering and aggregation; they operated on event storage streams (similar to how Kafka is used in multi-node systems), using durable in-process database for indexes, metadata and non-stream artifacts. Full replay stages which operated on compressed event streams, feeding them to in-memory actors for anomaly detection and fine-grained analysis. Visualization and report generation was done with python (charting) and lisp (html dahboard generation) stages. Due to the nature of Lisp, I inevitably ended up with a DSL for defining tiles and dashboards. That is something worth exploring in more detail later. This overall setup was similar to what you'd do these days with Apache Kafka, Spark and Spark Streaming, but it lived on a single machine. It was able to provide up-to-date visual and numeric answers to complex questions in matter of minutes instead of the hours. This worked because \"heaviest\" data transformations were incremental, while full replays operated on optimized datasets. LMDB - The Transactional Heart Almost all of the storage needs were handled with LMDB , which saved a ton of time in development without sacrificing performance. Databases like SQLite, LevelDB, RocksDB were evaluated, but discarded for performance or complexity issues (out-of-process databases were ignored). Besides, designing data models in terms of lexicographically sorted byte ranges is something I'm very comfortable with (provided the store provides true ACID guarantees). LMDB responsibilities: meta-data of the analytical event store: event chunk headers, event processing checkpoints; indexes, lookup tables, aggregated counters and mappings used by the data processing stages; final reporting artifacts: tables, lookups, time-series and meta-data. Switching to golang Some time along the way, I started porting core parts of the platform to golang for following reasons: Golang offers lightweight concurrency primitives that work well for real-time data ingestion, aggregation, incremental and batch processing. I wanted to migrate deployments to Linux/Unix to reduce hosting costs and gain access to its ecosystem: tools and knowledge on performance optimization. Golang also works well for fast and efficient API gateways that ingest data into the system or expose resulting data to the other systems. Rewrite took some time, but it was quite straightforward. There were some nice performance gains along the way. For instance, the new version of event storage layer can read event partitions at the speed of ~600 MB per second on a modest Intel Xeon server with NVMe SSD. This includes decryption and decompression of event data. My Windows Azure deployments were ~10x times slower and way more expensive for obvious reasons. First Linux deployment First deployment of the new version was rolled out recently. It is an analytical platform that captures events from real-time bidding auctions (header bidding and google AdX) coming from thousands of sessions running in parallel. This data can then used for real-time insight, experiments and long-term reporting driven by the financial indicators. This deployment handles 10M impressions per day across 4 different shards . A few interesting features: real-time dashboards with statistics for CPM, revenue (including Google AdX estimate), Ad size performance, bidder performance and responses, placements, user segments and sessions; aggregating and slicing data across placements, bidders, shards and experiments; running experiments (fine-grained A/B testing) and tracking their financial impact in real-time; long-term data retention and analysis which enable tracking revenue indicators per session and user lifetime, bidder behavior changes and the long-term impact of experiments. Storage and analysis layer is designed to support machine-assisted user segmentation and tuning of real-time bidding parameters within these segments. Experiments and revenue indicators provide a tight feedback loop for this process. All this is essential for publishers that want to increase their revenue in the world driven by players with large investments in automation and machine learning. Implementation Details Complete tech stack of this solution is rather simple: Ubuntu 16 LTS running on dedicated server (e.g.: Intel Xeon with DDR4 ECC RAM and SSD RAID on NVMe); Nginx for API intake and SSL termination; Golang and LMDB as the data processing heart of the system; Partitioned analytical event storage with AES encryption of data-at-rest (high-throughput, single writer and multiple readers); Telegraf with InfluxDB and Grafana for real-time display; python for custom reports that don't fit real-time UI. I find python invaluable for some data manipulation and analysis tasks these days. Its rich ecosystem (e.g. starting from matplotlib , pandas and up to keras and tensorflow ) can save a lot of development effort. Obvious fragility and complexity side-effects could be mitigated by freezing dependencies and exposed abstractions. What Next? It will be interesting to push this platform to its limits either with data loads or challenging ML tasks. So far, event streams running at rate of 10-30M events per day from thousands of connected devices aren't a big challenge for a single small-ish server with Linux, golang and LMDB. Bear in mind that hardware is evolving rapidly these days. Catching up with these limits can be a challenge."}
{"url":"/blog-migrated-to-python/","title":"Blog Migrated to Python","text":"Blog Migrated to Python Once every few years this blog is ported to a new platform. Each migration attempts to simplify writing while preserving the existing content. Here is a brief history of my blogging platforms: LiveJournal - that was my first blogging platform; WordPress - I discovered Linux and wanted to play with WordPress, like other cool kids; Squarespace - I didn't want to maintain the WordPress installation anymore; Octopress - I wanted something less expensive and more flexible than Squarespace; Hugo - I discovered go and wanted to have something faster and less fragile than Ruby site generator; Static site generator using React components - I fell in love with Node.js and React, wanted to have flexible UI components in my blog. It isn't that hard to notice the emerging pattern here: each time a new tool shows up, promising to solve some problems. The platform ultimately gets migrated to new pastures with a greener grass. All the content, including legacy urls and asset locations, is dragged along. A few years ago, I started spending more time with Python, getting to love its opinionated simplicity and poetic elegance. You can guess what happened next. Current Implementation The current iteration ( source code ) uses Python3 with Flask to render all the pages. Flask is an extremely lightweight HTTP server in Python that also implements a few building blocks like url routing, request handling and template rendering. It isn't as heavy and opinionated as Django or Ruby on Rails, but is perfect for a quick API or a simple web site. The structure of the website is determined by the content : blog posts and post series, all spread between the different folders. When the application starts, we enumerate and load all the available content files. Flask application has a generic url handler that renders the appropriate template based on content type: @app . route ( '/<path:path>' ) def url ( path : str ) : value = s . urls . get ( '/' + path , None ) if not value : return \"Not found\" , 404 if isinstance ( value , blog . Story ) : return render_template ( \"story_cover_page.html\" , site = s , story = value ) if isinstance ( value , blog . Post ) : return render_template ( \"post_page.html\" , site = s , post = value ) if isinstance ( value , str ) : return render_template ( \"redirect.html\" , url = value ) HTML layout is done with Jinja templates . I grew to like them over the past years, just like Python itself. Here is, for example, a template for the story page. It lists all posts that make up a story. { % extends \"layout.html\" % } { % from \"macros.html\" import link % } { % block title % }{{ story . title }}{ % endblock % } { % block content % } < section > < p > {{ story . description }} </ p > { % for year , group in story . _items | groupby ( 'date.year' ) | reverse % } < div > < h3 > {{ year }} </ h3 > < ul > { % for post in group | reverse % } < li > {{ link ( post , date = \"\" ) }} </ li > { % endfor % } </ ul > </ div > { % endfor % } </ section > { % endblock % } Jinja markup is more constrained than React components: you can't mix HTML markup and code with the same ease. This made some features slightly more difficult to implement (e.g. \"article tags\" and \"recommended articles\"). So I just threw that functionality away. Maintaining tag lists added mental overhead anyway. The website is static in nature . It could be rendered once and pushed to any cloud provider. This eliminates almost all of the maintenance pains and effort. There are no servers to maintain and keep updated. Static html generation is performed by spinning up a test Flask client, requesting all of the existing urls and saving them into files. Resulting folder will then be synced to S3 bucket that is served via CloudFlare CDN. cli = app.test_client() urls = list(s.urls.keys()) urls.extend(['/', '/archive/', '/atom.xml', '/about-me/', '/404.html']) for u in urls: save_url(cli, u) for k, v in assets.items(): shutil.copy2(v, 'build/images/' + k) The website differs from the previous iteration by dropping a bunch of features: Article tags and recommendations. Syntax highlighting on the snippets. Complex and fine-tuned CSS/HTML layout. React.js templates (with complex inheritance). Re-implementation of the literate programming model that allowed to inject snippets from the code files. Async content processing pipelines (to speed up things). Drafts - articles that are rendered but don't show up in the index or home page. Instead of that, all that we have is a few functions that load content files into a big in-memory dictionary, plus a Flask site that renders that content with Jinja templates. Writing Flow The writing flow itself is evolving as well. I used to write blog posts by editing markdown files in Emacs or Vim, while using the website to preview the results. This introduced unnecessary ceremony and added friction. Git folder was always dirty. Fewer posts got published. I'm currently trying a simpler approach that is just easier: write the blog posts via a dedicated markdown writer (e.g. 1Writer on an iPad). The polished article could be then added to the site codebase and published."}
{"url":"/btw/","title":"🎙️ Being the Worst","text":"🎙️ Being the Worst In a summer of 2012 Kerry Street offered to start a small podcast about learning software development. It wasn't supposed to be fancy or professionally edited, focusing instead on discovery, coding and mistakes that we were going to make. Years later, we are still pushing, thanks to Kerry and our awesome community. Later, the podcast turned out to be one of the longest and most successful projects I ever participated in. For that I will be gratefull to Kerry forever. In that podcast we started exploring things like Domain-Driven Design, Event Sourcing and CQRS. Somewhere along the way we decided to take on the domain of Getting Things Done , writing yet another task manager. Our GTD implementation was inspired by OmniFocus - an awesome task manager for OSX and iOS. This is the story of that journey."}
{"url":"/btw/2012-08-21-episode-1-the-worst-welcome/","title":"E01 - The Worst Welcome","text":"E01 - The Worst Welcome Audio apprenticeships for the aspiring software craftsman. Kerry and Rinat introduce the show and describe the plan for the road ahead. They are still figuring out how to produce and host a podcast so feel free to cut them a little slack on any episode quality issues. References: Study plan Feedback on current episode - in the comments to this episode post Discuss upcoming episodes - in trello General feedback - in UserVoice Being The Worst on twitter - @beingtheworst Kerry on twitter - @kcstreet Rinat on twitter - @abdullin Also mentioned in this episode: Be The Worst - pattern from the book, Apprenticeship Patterns Lokad.CQRS - in GitHub Book - Implementing Domain-Driven Design by Vaughn Vernon - in Safari Books IDDD book's code sample - in GitHub Greg Young twitter profile"}
{"url":"/btw/2012-08-23-episode-2-messaging-basics/","title":"E02 - Messaging Basics","text":"E02 - Messaging Basics Kerry and Rinat start with the foundational topics of messages and messaging. What is a message? Why do we use them? How do you create them in code? Their conversation references our sample code that is available on GitHub.com . References: Sample Code on GitHub ( E002 folder ) Kerry's Homework Code on GitHub ( E002 homework folder ) Feedback on current episode - in the comments to this episode post Discuss upcoming episodes - in trello General feedback - in UserVoice Being The Worst on twitter - @beingtheworst Kerry on twitter - @kcstreet Rinat on twitter - @abdullin What do you think?"}
{"url":"/btw/2012-08-27-episode-3-commanding-your-words/","title":"E03 - Commanding Your Words","text":"E03 - Commanding Your Words Kerry and Rinat discuss the importance of the spoken words we use and how they impact the messages we create in code. From there, they introduce two critical message types: commands and events. Here's the factory plan, referenced in this episode: Homework: Write a short dictionary of words that reflect the names of the commands and events which could be used to describe a day at this car factory. This is how we define the vocabulary and ubiquitous language of the context being modeled. What are some of the parameters of these words that would make up their definition? Try to write a coherent story about situations that can happen at the factory. Use the names of these commands and events that are in our domain dictionary to ensure that the spoken language has been captured properly within the story. Normal sentences that would be used as part of normal spoken language. Explain how the factory workers would react to these commands and how they would carry them out. References: Endymion poem by John Keats Sample Code and Documentation on GitHub (E003 folder) Kerry's Homework Code on GitHub (E003 folder) Feedback on current episode - in the comments to this episode post Discuss upcoming episodes - in trello General feedback - in UserVoice Being The Worst on twitter - @beingtheworst Kerry on twitter - @kcstreet Rinat on twitter - @abdullin What do you think?"}
{"url":"/btw/2012-08-31-episode-4-event-sourcing-basics/","title":"E04 - Event Sourcing Basics","text":"E04 - Event Sourcing Basics Event Sourcing Basics (a day in the car factory). Kerry and Rinat review questions about designing command and event messages and then introduce the new topic of Event Sourcing. We model one day at the Factory using event sourcing, domain language and concepts discussed in earlier episodes. Here's the car factory implementation diagram, referenced in this episode: Homework: CSharp developers: implement 2 remaining methods of Factory using event sourcing other languages: reimplement Sample E004 in your own language References: Sample Code and Documentation on GitHub (E004 folder) Kerry's Homework Code on GitHub (E004 folder) Domain-Driven Design: Tackling Complexity in the Heart of Software - book by Eric Evans on Amazon Implementing Domain-Driven Design - book by Vaughn Vernon on Amazon - with an Appendix by our very own Rinat Abdullin! Feedback on current episode - in the comments to this episode post Discuss upcoming episodes - in trello General feedback - in UserVoice Being The Worst on twitter - @beingtheworst Kerry on twitter - @kcstreet Rinat on twitter - @abdullin What do you think?"}
{"url":"/btw/2012-09-03-episode-5-aggregating-stories/","title":"E05 - Aggregating Stories","text":"E05 - Aggregating Stories Kerry and Rinat introduce the concept of aggregates and their importance to the design of distributed systems. Then, they describe another way to express the story of an aggregate, through the use of specifications and the testing of use cases. Homework: CSharp developers: Several factory behaviors are missing from sample E005 that were in E004. Implement a few of these missing behaviors (methods) and create the specifications/unit tests for them. Other languages: First, take the pseudo test/specification framework in E005's framework.cs file and reimplement it in your language. Then do the same assignment above but in your own language. Episode References: Lokad User group DDD/CQRS Group Pat Helland's - Life beyond Distributed Transactions (PDF) Behavior-driven development (BDD) - Given, When, Then Sample Source Code and Documentation: Being The Worst code samples on GitHub (E005 folder) Kerry's Homework code on GitHub (E005 folder) Feedback and Contacts: Feedback on current episode - in the comments to this episode post Discuss upcoming episodes - in trello General feedback - in UserVoice Being The Worst on twitter - @beingtheworst Kerry on twitter - @kcstreet Rinat on twitter - @abdullin What do you think?"}
{"url":"/btw/2012-09-10-episode-6-community-code-and-questions-1/","title":"E06 - Community Code and Questions 1","text":"E06 - Community Code and Questions 1 Kerry and Rinat discuss community questions about naming messages, aggregate state, and event implementation. Sample Source Code and Documentation: Being The Worst code samples on GitHub (E006 folder) Kerry's Homework code on GitHub Feedback and Contacts: Feedback on current episode - in the comments to this episode post Discuss upcoming episodes - in trello General feedback - in UserVoice Being The Worst on twitter - @beingtheworst Kerry on twitter - @kcstreet Rinat on twitter - @abdullin Other Community Links: Lokad User group DDD/CQRS Group What do you think?"}
{"url":"/btw/2012-09-14-episode-7-re-factory/","title":"E07 - Re-Factory","text":"E07 - Re-Factory Kerry reflects on his homework experiences with Rinat. Then, they discuss ways to refactor the code to accurately capture the car factory's requirements. A hint of how this ongoing refactoring will evolve us towards a more production-worthy implementation is also mentioned. Episode References: Episode 7 code samples on GitHub Kerry's Episode 7 homework code on GitHub Feedback and Finding Us What do you think?"}
{"url":"/btw/2012-09-23-episode-8-frameworks-over-forms-is-a-design-obsession/","title":"E08 - Frameworks Over Forms Is A Design Obsession","text":"E08 - Frameworks Over Forms Is A Design Obsession Rinat covers some of his previous obsessions with certain tools and technologies [like composite UIs, Object-Relational Mappers (ORMs), and Inversion of Control (IoC) Containers] and reminds us that even the approach we have been learning is also, JUST A TOOL . The focus of your design should be on the essence of the domain that you are capturing and the problems you are trying to solve, not on specific tools or technologies. Homework: Have a look at the source code inside of the Open Source Event Store project Notice the use of various programming languages (C#, C++, JavaScript) as well as how the code reads like a story given the context of the event message persistence domain that it was designed for. Episode References: Book - Implementing Domain-Driven Design by Vaughn Vernon - in Safari Books Plato's theory of Forms Event Store by Gregory Young Feedback and Finding Us What do you think?"}
{"url":"/btw/2012-09-28-episode-9-greg-young-has-a-major-event-in-store/","title":"E09 - Greg Young Has A Major Event In Store","text":"E09 - Greg Young Has A Major Event In Store Kerry and Rinat welcome their first guest, Gregory Young. You may know Greg from his blog, training, or CQRS fame but in this episode we discuss software design, distributed systems, and the new open source Event Store product that Greg recently released. Episode References: Event Store by Gregory Young Event Store Architectural Overview and source code on GitHub Staged event-driven architecture (SEDA) Cohesion in Computer Science Byzantine fault tolerance ØMQ - The Guide Feedback and Finding Us What do you think?"}
{"url":"/btw/2012-10-05-episode-10-dont-break-the-chain/","title":"E10 - Don't Break The Chain","text":"E10 - Don't Break The Chain Kerry and Rinat discuss when they will wrap up the car factory domain from previous episodes and then introduce the \"Don't Break The Chain\" domain. That domain and sample project will be used for the next set of episodes and code samples. Episode References: Source Code for Productivity Project on GitHub Don't Break The Chain Topic in Lokad Discussion Group on Google Groups Don't Break The Chain article on lifehacker.com Implementation of Don't Break The Chain article on lifehacker.com Feedback and Finding Us What do you think?"}
{"url":"/btw/2012-10-13-episode-11-specs-of-the-living-dead/","title":"E11 - Specs Of The Living Dead","text":"E11 - Specs Of The Living Dead Kerry and Rinat discuss the code updates that were made to correct the car factory sample. After that, they introduce the concepts of Value Objects and Strategic Modeling. Episode References: Episode 11 code sample on GitHub Zombies from the movie Night of the Living Dead Feedback and Finding Us What do you think?"}
{"url":"/btw/2012-10-21-episode-12-now-serving-dsl/","title":"E12 - Now Serving DSL","text":"E12 - Now Serving DSL Kerry and Rinat introduce Application Services that host our Aggregates, the Domain Services that they use, and a code-generating DSL tool to help along the way. Episode References: Episode 12 code sample on GitHub Lokad-codedsl utility on GitHub Feedback and Finding Us What do you think?"}
{"url":"/btw/2012-11-01-episode-13-so-uh-we-need-boundaries/","title":"E13 - So, Uh, We Need Boundaries","text":"E13 - So, Uh, We Need Boundaries A homework question takes the conversation to unplanned places. Kerry and Rinat start with clarifying Application and Domain Services, but end up discussing many things. From system design, tactical options like CQRS (Command Query Responsibility Segregation) and SOA (Service-Oriented Architecture), to an attempt to influence a presidential election. Episode References: Architectural Patterns and Styles Service-Oriented Architecture What is CQRS? Feedback and Finding Us What do you think?"}
{"url":"/btw/2012-11-08-episode-14-back-in-the-u-s-s-car-factory-tests/","title":"E14 - Back In The U.S.S. caR Factory Tests","text":"E14 - Back In The U.S.S. caR Factory Tests Kerry and Rinat continue discussing the Car Factory sample domain from Episode 12. Then, they cover its Domain.Test project and the updated specifications and testing approach within. Episode References: Episode 14 code sample on GitHub Lokad-codedsl utility on GitHub Feedback and Finding Us What do you think?"}
{"url":"/btw/2012-11-24-episode-15-the-bits-that-keep-on-giving/","title":"E15 - The Bits That Keep On Giving","text":"E15 - The Bits That Keep On Giving Kerry and Rinat discuss new ways to reuse and benefit from our modeling efforts. This includes using our messages to test any serializer, auto-generating diagrams of our current design, and giving our domain its own command prompt. Episode References: Episode 15 code sample on GitHub Graphviz - Graph Visualization Software The Gift That Keeps On Giving clip from the movie Christmas Vacation Feedback and Finding Us What do you think?"}
{"url":"/btw/2012-12-03-episode-16-adding-value/","title":"E16 - Adding Value","text":"E16 - Adding Value A listener's questions lead Kerry and Rinat into the details of Value Objects. They review several examples of Value Objects and cover some pros and cons of their usage. This very important concept is commonly used to represent the nouns in the sentences of our ubiquitous language. Episode References: Episode 16 code sample on GitHub Matthew's Questions That We Discuss Feedback and Finding Us What do you think?"}
{"url":"/btw/2012-12-07-episode-17-view-projections/","title":"E17 - View Projections","text":"E17 - View Projections Kerry and Rinat wrap-up the Car Factory sample domain with an introduction to View Projections. The Event messages that we generate represent the state of our Aggregates. Projection classes can also listen for these Event messages and generate Views to provide query results to client applications. Episode References: Episode 17 code sample on GitHub Feedback and Finding Us What do you think?"}
{"url":"/btw/2012-12-21-episode-18-git-army-knife/","title":"E18 - Git Army Knife","text":"E18 - Git Army Knife Kerry and Rinat dig into some of the development tools and practices used at Lokad. They specifically discuss Distributed Version Control Systems (DVCS) with a focus on Git's familiar design and our usage of it. Homework: Create a free GitHub account if you don't have one Fork the official Being The Worst sample repository or Lokad.CQRS Add or clarify some comments to anything that may need clarification in the code/comments of the repository Commit your changes to your own local repo and then submit a pull request to us on GitHub to update the master repository with your changes Bonus: Port BTW samples to your own programming language and ask us to add your repo to our GitHub account to share with the rest of our community Episode References: GitHub for Windows or Mac GUI Client Git site with x-plat installers Analyzing Git history blog post by Yves Reynhout and his C# code to do it Graph Theory at Think Like (a) Git Codebase code hosting services Feedback and Finding Us What do you think?"}
{"url":"/btw/2012-12-29-episode-19-the-hermit-an-unexpected-journeyman/","title":"E19 - The Hermit: An Unexpected Journeyman","text":"E19 - The Hermit: An Unexpected Journeyman Kerry and Rinat introduce the topics to be covered and the general direction for the next set of episodes. They discuss the new domain to be implemented, as they officially launch the next expedition. Homework: Get familiar with the \"Getting Things Done(R)\" Domain and learn the most important concepts Capture behaviors and concepts that describe the actual human practices involved Layout the vocabulary and start writing stories (ex: 10 terms/4 paragraphs) Episode References: General Itinerary for Expedition 2 (i.e. what we plan to discuss and learn) Getting Things Done book by David Allen Getting started with \"Getting Things Done\" post by Merlin Mann OmniFocus, GTD, and You (pdf) - The Omni Group Feedback and Finding Us What do you think?"}
{"url":"/btw/2013-01-07-episode-20-staring-at-a-blank-page/","title":"E20 - Staring At A Blank Page","text":"E20 - Staring At A Blank Page Kerry and Rinat discuss approaches you can use to start the domain modeling process in various situations. This helps them determine the first step to take to start designing their new domain. Episode References: Getting Things Done book by David Allen OmniFocus, GTD, and You (pdf) - The Omni Group Feedback and Finding Us What do you think?"}
{"url":"/btw/2013-01-19-episode-21-linguistic-cartographers/","title":"E21 - Linguistic Cartographers","text":"E21 - Linguistic Cartographers Kerry and Rinat start the discussion with a Context Map that has several Bounded Contexts. They use their conversation to refine the Context Map and carve out the initial key words of the Core Domain's Ubiquitous Language. Subscribe via RSS | Subscribe for free in iTunes Episode References: Context Maps and Initial Ubiquitous Language Published Language and Shared Kernel definitions We used Twiddla in this episode to collaborate on the updated Context Map diagram Feedback and Finding Us What do you think?"}
{"url":"/btw/2013-01-29-episode-22-biohazard-blah-blah/","title":"E22 - Biohazard Blah Blah","text":"E22 - Biohazard Blah Blah Kerry and Rinat discuss the problem space, Bounded Contexts within it, Subdomains, and the Core Domain of \"Getting Things Done\"(TM). After the DDD discussion, they dive into the initial approach to structuring the code, exploring the Published Language, and making sense of information flows. Homework: Fork our GTD repository and explore branch for Episode E22 Add a local branch to your fork of the repo (use YourName_E22 as the name of the branch) Try adding more methods, Commands, and Events according to the GTD Methodology . Can you think of a better name for \"Tenant\" as discussed in the episode? Episode References: Codebase for Episode 22 Getting Things Done Domain-Driven Design: Tackling Complexity in the Heart of Software Implementing Domain-Driven Design Feedback and Finding Us What do you think?"}
{"url":"/btw/2013-02-10-episode-23-spaghetti-free-terminology/","title":"E23 - Spaghetti-Free Terminology","text":"E23 - Spaghetti-Free Terminology Kerry and Rinat revisit and correct some DDD terminology they used based on new learning and listener feedback. After they readdress Domain, Domain Model, Subdomains, Bounded Contexts, Published Language, Shared Kernel, and Anti-Corruption Layer, they try to relate these terms to real examples and code. Episode References: Rinat's 'Essence of Domain-Driven Design - Re-Explained' blog post GTD Context Map from E21 that is still valid Codebase for Episode 22 we looked back at Domain-Driven Design: Tackling Complexity in the Heart of Software Feedback and Finding Us What do you think?"}
{"url":"/btw/2013-02-20-episode-24-grambos-got-a-shotgun/","title":"E24 - Grambo's Got A Shotgun","text":"E24 - Grambo's Got A Shotgun Kerry and Rinat dive into the GTD® code and discuss ways to choose words in the ubiquitous language. The new interactive shell allows them to use and shape the domain model as it evolves. In addition, an Event store is introduced to persist messages to disk, and they discuss ways to replay and view these state changes in the console. Homework: Go into the project, run the Shell, and try adding more Commands/Events and Console commands to make it work. Episode References: Codebase for Episode 24 Riak bitcask PDF Redis persistence demystified Feedback and Finding Us What do you think?"}
{"url":"/btw/2013-03-06-episode-25-these-arent-the-projects-youre-looking-for/","title":"E25 - These Aren't The Projects You're Looking For","text":"E25 - These Aren't The Projects You're Looking For Kerry and Rinat discuss changes to the ubiquitous language, the characteristics of lists and projects, and lessons learned from using a similar productivity system. In between, they probably talk too much about some variable names and system implementation details. Homework: Try adding a DefineAction command to the solution, this should include: Adding the console \"shell\" command to enter it in the shell Adding a View that allows you to list all Actions currently inside of a Project Episode References: Codebase for Episode 25 Trusted System: David Allen on twit's Triangulation show #22 Trusted Place: David Allen at Google Star Wars: These aren't the droids you're looking for Feedback and Finding Us What do you think?"}
{"url":"/btw/2013-03-27-episode-26-rinat-gits-committed/","title":"E26 - Rinat Gits Committed","text":"E26 - Rinat Gits Committed In this episode, there is a lot of new code to dig into. Kerry and Rinat cover the newly added features, discuss how the console relates to future UIs and assists with ongoing maintenance, dabble in Event message design, and get into the CQS pattern that was used to implement some of the Entity classes. They wrap-up with how the current message-based approach can be applied synchronously or asynchronously depending on needs, and explain why there are hints of Redis in the code repository. Episode References: btw-gtd Codebase branch for Episode 26 DDD Community site Command–query separation (CQS) Community's Java port of the BTW Expedition 1 (original) Samples Redis Event Store Feedback and Finding Us What do you think?"}
{"url":"/btw/2013-04-12-episode-27-evolving-event-centric-infrastructure/","title":"E27 - Evolving Event Centric Infrastructure","text":"E27 - Evolving Event Centric Infrastructure Kerry and Rinat review some of the lessons learned from the deployment of Lokad.CQRS. Then, they discuss Rinat's blog post about those experiences, and some of his ideas for future improvement. This results in a sneak peek of the new event centric hosting infrastructure that will be used in their GTD sample project. Episode References: Best Infrastructure for Event Sourcing and CQRS Event Store Lucene.net Service Stack Feedback and Finding Us What do you think?"}
{"url":"/btw/2013-04-24-episode-28-learning-lokad-cqrs/","title":"E28 - Learning Lokad CQRS","text":"E28 - Learning Lokad CQRS Kerry and Rinat dig into the Lokad CQRS project that laid the code foundation for their current projects. This open source sample provides a working slice of a production system's registration and user management subdomains that were implemented with Aggregates and Event Sourcing. While they get into the details of its multi-Aggregate implementation, they cover the communication and workflow between these aggregates, how data storage and querying is achieved, and even get into some authentication and authorization options. Episode References: Lokad CQRS ( codebase version at the moment of recording) Stact - An actor library and framework for .NET Feedback and Finding Us What do you think?"}
{"url":"/btw/2013-05-15-episode-29-acting-like-we-get-the-message/","title":"E29 - Acting Like We Get The Message","text":"E29 - Acting Like We Get The Message Kerry and Rinat introduce the Actor in the Actor Model of Computation. They wonder if the Actor's embodiment of communication (via messaging) may simplify the way that they reason about and implement their solutions. They discuss this potential use of the Actor Model in the context of their current usage of Domain-Driven Design (DDD), Application Services, Aggregates, and Event Sourcing. Episode References: The Actor Model (everything you wanted to know, but were afraid to ask) The Actor Model (Wikipedia) Carl Hewitt's homepage Scala programming language Akka toolkit Erlang programming language Stact - An Actor library and framework for .NET Red programming language Feedback and Finding Us What do you think?"}
{"url":"/btw/2013-06-04-episode-30-the-30-thirty-experience/","title":"E30 - The 30 Thirty Experience","text":"E30 - The 30 Thirty Experience Rinat returns home from his trip to the United States which was filled with milestones, memories, and marriage. The guys briefly catch-up on some of the insights that were gained from looking over the fence at the Java ecosystem, and how it has made more progress than .NET in specific areas. They wrap-up with an intro to the DDD Summit, and how the things learned there and along the way will shape the upcoming episodes and code. Episode References: The Scala programming Language Akka toolkit and runtime Play Web Framework for Java and Scala built on Akka Erlang programming language MassTransit - Service Bus and Stact - Actor framework helped by Chris Patterson . Paxos protocols Greg Young's open-sourced Event Store Slav - contributor to Lokad CQRS 'The Blue Book' and DDD Design Pattern summaries by Eric Evans . Feedback and Finding Us What do you think?"}
{"url":"/btw/2013-07-05-episode-31-cloud-clearing-events/","title":"E31 - Cloud Clearing Events","text":"E31 - Cloud Clearing Events Kerry and Rinat choose the event storage to use for their project. They're using Windows Azure, so should they stick with Lokad CQRS event storage, or switch to Event Store from Event Store LLP? They discuss some of the pros and cons of their options, review some scaling approaches, and guess at what a transition may look like. Subscribe via RSS | Subscribe for free in iTunes Episode References: Event Store LLP's open-sourced Event Store product Lokad CQRS sample project Paxos protocols Raft: In Search of an Understandable Consensus Algorithm Basho Riak storage SEDA - Staged Event Driven Architecture C# infrastructure for doing eventsourcing using aggregates from Yves Reynhout Jonathan Oliver's Event storage (NEventStore) Feedback and Finding Us What do you think?"}
{"url":"/btw/2013-07-23-episode-32-questionable-approach/","title":"E32 - Questionable Approach","text":"E32 - Questionable Approach Kerry and Rinat answer listener questions about code syntax, differences between event sourcing and relational storage, and concrete examples of Domain-Driven Design (DDD) concepts. Along the way, your questions lead them to questioning themselves and to consider an alternate approach. Episode References: Question #1 - C# dynamic syntax in sample code BTW sample code related to question #1 Question #2 - Event sourcing and relational databases Rinat's post about helping non-tech people to understand the data Rinat's post about event versioning that was mentioned during question #2 Question #3 - Concrete DDD examples Feedback and Finding Us What do you think?"}
{"url":"/btw/2013-09-11-episode-33-client-side-components/","title":"E33 - Client-side Components","text":"E33 - Client-side Components Kerry and Rinat return from their summer break to discuss their initial explorations into client-side development. This episode, and the two episodes after it, cover a two month period in which Rinat and Kerry went off to write some new client code for their Getting Things Done(tm) sample project. They start this three-part spike with a simple WinForms client, and end it with the beginnings of their cross-platform mobile application. Episode References: BTW Sample Code for E33 on GitHub Staged event-driven architecture Model–view–controller (MVC) pattern Model View ViewModel (MVVM) pattern Event Store Feedback and Finding Us What do you think?"}
{"url":"/btw/2013-10-20-episode-34-model-view-who/","title":"E34 - Model View Who?","text":"E34 - Model View Who? This episode is the second of a three part series on client-side development. Kerry and Rinat discuss the changes that were made to the sample's WinForms client to make it easier to add new features. They review the three main contexts in their GTD sample, the benefits of passing around immutable state, and dig into graphical user interface architectures like MVC and MVVM. This sets them up for the next episode where they apply MVVM to the sample's cross-platform mobile application. Episode References: BTW Sample Code for E34 on GitHub Martin Fowler's article on GUI Architectures Free MvvmCross Training Videos compliments of @slodge - Stuart Lodge MvvmCross code on GitHub @mvvmcross on Twitter Micorsoft's Roslyn compilers Model–view–controller (MVC) pattern Model View ViewModel (MVVM) pattern Feedback and Finding Us What do you think?"}
{"url":"/btw/2014-01-09-episode-35-totally-crossed-out/","title":"E35 - Totally Crossed Out","text":"E35 - Totally Crossed Out This episode is the third of a three part series on client-side development. Kerry and Rinat discuss their initial experiment with applying the MvvmCross framework to the GTD sample's cross-platform mobile application. They review why they selected this approach, the initial code in the solution, and plans to revise their Core Domain and Published Language into reusable Portable Class Libraries (PCLs). Episode References: BTW Sample Code for E35 on GitHub Free MvvmCross Training Videos compliments of @slodge - Stuart Lodge The Tablet Show (podcast) - Stuart Lodge Builds Cross Platform Apps with MvvmCross MvvmCross code on GitHub @mvvmcross on Twitter JabbR chat room for MvvmCross Microsoft and Xamarin Partner Globally to Help You Build Great Apps PCL and .NET NuGet Libraries are now enabled for Xamarin Greg Shackles on Twitter Jon Dick on Twitter Greg and Jon's 'Gone Mobile' podcast Tomasz Cielecki 'Cheesebaron' on Twitter AngularJS Knockout with Durandal Model View ViewModel (MVVM) pattern Feedback and Finding Us What do you think?"}
{"url":"/btw/2014-03-06-episode-36-break-it-down-again/","title":"E36 - Break It Down Again","text":"E36 - Break It Down Again In this episode Kerry and Rinat discuss some of the benefits of decomposing your software into even smaller components. Some may refer to this technique as \"micro services\" or \"system-of-systems\". They end their conversation with a look at how this approach was applied to the Lokad codebase to result in a more evolvable design. Episode References: Design, Composition and Performance Implementing Micro Service Architectures - Video Breaking the monolith: Towards a system-of-systems architecture - Video Model storming - a different approach to collaborative model discovery Feedback and Finding Us What do you think?"}
{"url":"/btw/2014-03-28-episode-37-variety-is-the-spice-of-pancakes/","title":"E37 - Variety Is The Spice of Pancakes","text":"E37 - Variety Is The Spice of Pancakes Kerry and Rinat discuss some of the learning that Rinat and the HappyPancake team did to evaluate their language and tool options for their new project. Rinat makes some observations about his long history with C#, and the benefits of looking at other languages and approaches. In Episode 38, they may even discover new ways that C# can do some things that Rinat says can be hard to do using traditional C# and .NET techniques. They finish up by getting into more of the \"Microservices\" or maybe \"component-based\" approaches that Rinat has experienced recently. Episode References: Dr. Erik Meijer Videos on Functional Programming Fred George on implementing Micro-Services Micro-Services - Java, the UNIX way HappyPancake R&D (Rinat's new gig) Docker Syncspace (collaborative drawing) Java Metrics and a .NET port ServiceStack Event Store Google+ Hangouts SimCity Feedback and Finding Us What do you think?"}
{"url":"/btw/2014-04-18-episode-38-glenn-block-evolves-apis/","title":"E38 - Glenn Block Evolves APIs","text":"E38 - Glenn Block Evolves APIs This is an archive copy of BeingTheWorst podcast Episode 38 by Kerry Street and Rinat Abdullin. Check out the original site for comments, RSS and iTunes feeds. Kerry and Rinat welcome guest Glenn Block to discuss hypermedia Web APIs, and the new book he co-authored on the subject. They get into some of the differences between systems that are designed to take advantage of the constraints of the REST architecture style, with those that are not. What is hypermedia? Should you always adhere to REST constraints and use hypermedia when building a Web API? What are some ways to implement these concepts on the client and server and how do they interact? What are some common implementation mistakes? API security? What can your book help me with? UCWA? And much more. Episode References Book: Designing Evolvable Web APIs with ASP.NET – Harnessing the power of the web GitHub source code for the Web API book Roy Fielding: Architectural Styles and the Design of Network-based Software Architectures Glenn Block on twitter Hypermedia Representational state transfer (REST) Hypertext Transfer Protocol (HTTP) Mike Amundsen apiary Mark Seemann’s Hyperlinkr library on GitHub OpenRasta Lync – Microsoft’s Unified Communications Web API (UCWA) HAL – Hypertext Application Language Microsoft Azure Simple Object Access Protocol (SOAP) Feedback and Finding Us"}
{"url":"/btw/2015-01-26-event-driven-specs/","title":"Event-driven specs","text":"Event-driven specs Last week I continued working on the back-end API server for Getting Things Done task-manager. GTD domain is something we explored for some time with Kerry Street in BeingTheWorst Podcast . This project is a continuation of that journey but with the things learned recently from the 👍 HappyPancake project and DDD community. Use-Cases Specification micro-framework seems to be stable now. I wrote it from scratch, while bearing in mind the limitations and issues we discovered in previous event-driven project. That is how single API use-case looks like: func given_unchecked_task_when_check_then_event () * env . UseCase { taskId := lang . NewTaskId () return & env . UseCase { Name : \"Given new task, when PUT /task with check, then event\" , Given : spec . Events ( lang . NewTaskAdded ( newEventId (), taskId , \"ho-ho\" ), ), When : spec . PutJSON ( \"/task\" , seq . Map { \"checked\" : true , \"taskId\" : taskId , }), ThenResponse : spec . ReturnJSON ( seq . Map { \"taskId\" : taskId , \"name\" : \"ho-ho\" , \"checked\" : true , }), ThenEvents : spec . Events ( lang . NewTaskChecked ( IgnoreEventId , taskId ), ), Where : spec . Where { IgnoreEventId : \"ignore\" }, } } Such use-cases focus on a single specific task: they describe behavior of a module in a declarative way via its public contract : consumed and published events; RPC API requests and responses. As such, these are not exactly unit-tests or integration tests. They have some nice properties when applied with the event-driven design process. I'll talk about that later. Other applications for these use-cases are: printing human-readable API usage stories to impress managers, generating detailed up-to-date API documentation to impress developers, generating system dependency diagrams to scare managers and generating realistic load-tests to stress the heck out of hardware and impress DevOps people. Partial Responses While implementing use-case library, the trickiest part was to figure out how to test only partial responses. For example, an API call might be returning a large task object, however in a specific use-case only three specific fields would be relevant. We would want to test them and ignore the rest of the object. At this point I solved the problem by constructing response objects from maps by hand and comparing them with the actual response. ThenResponse : spec . ReturnJSON ( seq . Map { \"taskId\" : taskId , \"name\" : \"ho-ho\" , \"checked\" : true , }), Of course, one could pass directly a strongly-typed response object instead of a weakly-typed map. That would defeat the whole purpose of API tests, though. We want to test actual responses. The same approach applies to events that we expect and actually get. Verifying Use-Cases Verifying behavior of an application in a clear and non-fragile way - is the primary field of application for these use-cases. I implemented a simple verification utility, which plugs into the unit-testing pipeline of Go. It runs all tests, printing a detailed information for each failure. Behind the scenes, verification runner loads a module into memory, wiring it to the real http server and in-memory event bus. After dispatching events and performing http requests, outputs are compared with expectations. ✗ Given new task, when PUT /task with check, then event Given_events: 1. TaskAdded { \"eventId\": \"13bcdee63b9888cb0000000300886534\", \"taskId\": \"13bcdee63b9887dc0000000200886534\", \"name\": \"ho-ho\" } When_request: PUT /task Expect_HTTP: 200 { \"checked\": true, \"name\": \"ho-ho\", \"starred\": false, \"taskId\": \"13bcdee63b9887dc0000000200886534\" } Actual_HTTP: 500 { \"error\": \"Not implemented\" } Expect_Events: 1 0. TaskChecked { \"eventId\": \"\", \"taskId\": \"13bcdee63b9887dc0000000200886534\" } Manually comparing expected and actual responses can be counter-productive. It might be better to work with a clear list of issues, fixing them one by one. Use-case verifier now does exactly that: Issues_to_fix: 1. Expected 'Body.checked' to be 'true' but got 'nothing' 2. Expected 'Body.name' to be 'ho-ho' but got 'nothing' 3. Expected 'Body.starred' to be 'false' but got 'nothing' 4. Expected 'Body.taskId' to be '13bcdee63b9887dc0000000200886534' but got 'nothing' 5. Expected 'Status' to be '200' but got '500' 6. Expected 'Events.length' to be '1' but got '0' 7. Expected 'Events[0].$contract' to be 'TaskChecked' but got 'nothing' 8. Expected 'Events[0].taskId' to be '13bcdee63b9887dc0000000200886534' but got 'nothing' Productivity Golang compiler can build sources and run tests in milliseconds. The rest of tooling is comparably fast as well. We could even perform all these operations almost instantaneously: reformat code save file perform static code analysis for common issues build run unit tests and verify use-cases In fact, I configured F2 to perform all these operations in sequence on single a key-press. This makes the process of implementing API logic rather simple: Pick the next issue in the Issues_to_fix list. Type a few lines of code. Hit F2 . Verify that the issue is gone, then go to 1. If I were doing that with .NET and MSBuild, then I'd need to insert \"Take a break\" between 3 and 4. Next Steps This API back-end is going to provide persistence and scalability to Flux/React.JS front-end, which I'm building in parallel (currently front-end uses in-memory storage). So my next steps would be: Implement all API features needed to wire front-end to back-end. Implement Projects and Contexts . Add some persistence to the back-end."}
{"url":"/btw/2015-01-27-episode-39-in-flux-and-back-to-react/","title":"E39 - In Flux and Back to React","text":"E39 - In Flux and Back to React This is an archive copy of BeingTheWorst podcast Episode 39 by Kerry Street and Rinat Abdullin. Check out the original site for comments, RSS and iTunes feeds. Kerry and Rinat return. The guys catch-up on some of the development that Rinat’s been doing while they were away and set their recording timer to 30 minutes. They discuss that their sample GTD domain, from past episodes, is being implemented with Facebook Flux and React.JS as they set the stage for future episodes. Episode References Rinat’s Talk on ReactJS and Facebook Flux at DDDBE in Belgium GitHub source code GTD Domain in Flux and React.JS React Developer Tools from Facebook for Google’s Chrome browser Feedback and Finding Us"}
{"url":"/building-ai-assistants-course","title":"building-ai-assistants-course","text":"building-ai-assistants-course Building Reliable AI Assistants: Patterns and Practices About · Who It’s For · What You’ll Get · What’s Inside · Pattern Library · Format · Access · Pricing · FAQ · Buy A self-paced course for people who design, scope, ship or oversee LLM-powered assistants. This course is about making sound technical decisions for LLM-powered products across domains, so the systems you build: deliver stable quality stay controllable and testable keep their quality as functionality expands. We’ll look at where systems fail and how reliable solutions are designed. I’ll show you a practical approach to predictable behavior and continuous improvement. That approach comes from my hands-on work shipping AI systems and advising teams on architecture and quality. Buy Course in AI Labs Why I made this course Teams keep running into the same problems with LLM systems: unstable quality, unclear architecture, weak control over output, and too much time spent rediscovering lessons that already exist. I built this course to shorten that path. Inside, I walk through how AI assistants fail, how to diagnose the real causes, and how to use proven patterns to build systems that are more reliable, more testable, and easier to extend. Who this course is for This course is for people who design, scope, ship, or oversee LLM-powered assistants — whether you write the code yourself or guide the team making the build decisions. It is a strong fit once you already have some exposure to LLM-driven products and want a more systematic way to make them production-reliable. Engineers Trace failures to the information flow, reproduce them quickly, and resolve them with proven patterns. Tech Leads and CTOs Design architectures that remain structured as scope grows, with control points for quality, testing, and evolution. Product Leaders Choose feasible AI use cases, define measurable quality targets, and turn vague ideas into scope teams can ship quickly. Founders Pick MVP-friendly approaches, avoid slow dead ends, and set a quality trajectory that supports the next stage of the product. When it’s a mismatch You’re just getting started and haven’t built any LLM-based systems yet. Your focus is exclusively local-model infrastructure (this course focuses on patterns and engineering principles that apply broadly). You want a framework tutorial (LangChain/LlamaIndex setup, indexing walkthroughs). The course focuses on architecture and is framework-agnostic. This course is also a mismatch, if you want to learn more about deploying cutting-edge agents like OpenClaw or building something similar. The reason for that - the course is based on the statistics of successful cases of AI adoption in the business. While AI Agents are hot in theory, in practice they lack a history of successful adoptions. It will take some time to establish proven patterns and practices of shipping trustworthy and reliable AI Agents. The course on AI Agents will have to wait until then. In the meanwhile, if you are really interested in leading R&D around AI Agents - check out BitGN - my platform for autonomous agents and amazing teams building them. What you’ll walk away with Practical methods to diagnose and reduce hallucinations in real workflows. A clearer sense of what makes an AI assistant behave reliably in production. The ability to trace failures to the information flow and fix them with repeatable approaches. A working mental model for choosing architectures that hold up as scope grows. More control over outputs through structured responses and guided reasoning. A way to evaluate quality with concrete checks, not intuition. The Pattern Library proven in real-world success cases, helping you choose approaches that hold up in production. What’s inside Module 1 — Foundations for reliable AI assistants We start from a familiar document-assistant scenario, reproduce the failure modes, and work downward until the behavior becomes clear. You will build intuition for how LLMs behave, how context engineering shapes quality, why retrieval quality matters, how hallucinations get triggered, and how structured outputs and custom chain of thought improve control. From there, we move back up to testing, evaluation, trust, and AI case mapping. Module 2 — Pattern Library The second module turns those foundations into repeatable implementation patterns drawn from real successful AI cases. It moves from prompt patterns and knowledge base design to search, workflows, routing, structured data extraction, feedback loops, Schema-Guided Reasoning, and LLM + Domain-Driven Design. For each pattern, I show the task framing, the real constraints, where quality breaks, and what produces stable results in production. Together, these patterns help you recognize recurring problem shapes and reach workable solutions faster. Why the Pattern Library matters Most teams spend too much time rediscovering the same failure modes. The Pattern Library is based on 40+ real AI success cases across 20+ companies that I helped teams ship. It gives you a reusable set of design moves drawn from successful AI implementations. You can adapt them to your domain and move with more confidence when a case grows in complexity. Format Recorded video lessons with chapter navigation and supporting materials. 4+ hours of course video. Self-paced, so you can move through the material on your own schedule. Two practical exercises are included and can be skipped if you are not coding. Access The course is delivered through my platform, AI Labs , and purchase happens there as well. Authentication at the AI Labs is done via Gmail. Personal purchases unlock the course in your AI Labs account. You can buy access for yourself, buy an activation code for someone else, or purchase seats for a team. Purchases for someone else and team purchases are delivered as activation codes, so seats can be assigned later without immediate activation. Pricing Personal access — 1 seat EUR 280.00 Team access 5-seat pack: EUR 1400.00 10-seat pack: EUR 2800.00 Tax calculated at checkout where applicable. Companies can add a billing address and EU VAT ID during purchase. Billing documents are generated automatically and sent by email. Payments are handled via Stripe. Buy in AI Labs FAQ Will this course fit me if I work in Europe, the US, or elsewhere? Yes. The course is built for an English-speaking audience worldwide, and the patterns come from real AI implementations across industries and countries. Do I need programming experience? You do not need to write code yourself to benefit from the course. The material focuses on design, scoping, architecture, and quality decisions. Engineers can apply it directly in code. Is there a practical part? The course is lecture-led and comes with supporting materials. Two practical exercises are included and can be skipped if you are not coding. Can I buy access for another person? Yes. You can buy an activation code and give it to someone else. Can I buy access for a team? Yes. Team purchases are available through AI Labs. Team seats are delivered as activation codes, so they can be assigned later without immediate activation. How do I get access? Purchase happens in AI Labs. Authentication there is done via Gmail. Personal purchases unlock the course in your AI Labs account. Still have a question? Write to biz -bot @ abdullin . com This course gives you the foundations, the Pattern Library, and the decision framework for building LLM-powered assistants with more control over behavior and quality. Buy Course in AI Labs"}
{"url":"/child-friendly-hobby/","title":"Let's build robotic arm","text":"Let's build robotic arm This blog post is time-boxed by my sleeping 4mo old son. I most likely have 12 more minutes of uninterrupted time left before he wakes up, so I’ll keep it short. Spoiler alert, Erik slept for 1 hour and 51 minutes! :) Work Situation Since last year, I’ve been working with SoftwarePark - an Austrian software company focusing on turn-key solutions for its clients. At work, I’ve been mostly involved with infrastructure for data engineering and machine learning. Simply put, we are creating an environment for the data scientists to design, push to production and improve their models . Technology revolves around Kubernetes, Helm, Python, Jupyter, Apache Airflow, Dask Distributed and Ceph over S3 with a pinch of Google Cloud. If compared to my previous projects, this work is less pure R&D and more plain engineering. This puts me in a little happy place — with two kids, there isn’t enough time and focus left to, say, work on a consistent distributed application-specific database (like 📦 SkuVault . Domain design in our current project focuses on creating a comfortable and coherent R&D environment for data scientists. In a sense, it feels like OOP at the level of software systems - we are still removing excessive concepts, refactoring and renaming things. New Hobby Along the way, I’ve accidentally found a new hobby to replace software development for fun. It is less demanding for undivided attention, long chunks of undisturbed time and availability of both hands. There is an interesting company called Haddington Dynamics that is building a low-cost high-precision robotic arm . They focus on bringing the cost down, while going for really high precision (measured in 10-50nm). Most of their work is done open-source , except for the FPGA code and the latest designs. Low cost in this case means: from 6800 to 9800 USD. My two other colleagues and I are trying to design and build such an arm from scratch , while achieving similar levels of precision but using less expensive parts and tech. This constraint is where all the fun comes from. Current Status We have built a first tech prototype out of two 3D printed parts (from the Haddington Dynamics STLs), Lego Technic blocks, Arduino and a cheap DC motor. It has an optical encoder that helps to know and maintain its position more or less precisely. However, enormous backlash is killing any hope of establishing even crude haptic feedback. To move forward, we need to get our hands on a speed reducer with smaller backlash (speed reducer is needed, since electric motors operate best when spinning fast). It seems that our best bet is to build a 2-stage cycloid drive with 56:1 gear reduction or so. There is a simple single-stage prototype being printed by our friend right now (and our own 3D printer is en-route from China). In parallel, we are investigating how to improve speed and precision of an encoder. Current designs will get the most out of ATtiny micro-controller. In the future we might try using LVDS IO on an FPGA to work as an Analog-to-Digital Converter (ADC) for an encoder disk. Stellamar did it, Haddington Dynamics did it, so it could probably be reproduced. Similar to how stack flattening works in software design, bringing ADC into the digital logic might allow to make a relatively simple high-frequency converter. After all, if we cache all signals produced by the encoder disk, this could be used as a lookup table to speed up ADC sampling (we’ll check the most probable signal levels first). I’m probably being the worst person in the world to work on a nanometer-precision robotic hand with the haptic feedback, however: this is mostly an excuse to learn new skills; we still have a couple of years filled with chunks of time when we have at least one free hand and partially uninterrupted focus; it is an opportunity to do something engaging with the older kids."}
{"url":"/command-query-separation/","title":"Command-Query Separation","text":"Command-Query Separation Command-query separation (or CQS for short) is a design principle devised by Bertrand Meyer. CQS states that every method should either be a command that performs an action (changes state), or a query that returns answer to the caller (without changing the state or causing side-effects), but not both. In other words, functions should return a value only if they are pure (don't make any visible state changes). This convention, if followed consistently, can simplify programs, making them easier to understand and reason about. Following method violates CQS principle, it does too much: func SaveUser ( name string , age int ) * Result { if name == \"\" { return NewError ( \"Name is empty\" ) } if age <= 0 || age >= 100 { return NewError ( \"Age is out of range\" ) } users = append ( users , NewUser ( name , age )) return nil } We can simplify this code by breaking it into two methods. Command : func CreateUser ( name string , age int ) { users = append ( users , NewUser ( name , age )) } and query : func ValidateUser ( name string , age int ) * Result { if name == \"\" { return NewError ( \"Name is empty\" ) } if age <= 0 || age >= 100 { return NewError ( \"Age is out of range\" ) } return nil } We can reuse and test these methods separately or compose them. For example, to print validation errors: func InputChanged ( name string , age int ) { var result = ValidateUser ( name , age ) if ! result . Valid { RenderErrors ( result ) } else { RenderSaveButton () } } Saving user while safe-guarding against invalid input: func SaveButtonClicked ( name string , age int ) { var result = ValidateUser ( name , age ) if ! result . Valid { // protect from programmer error panic ( result ) } SaveUser ( name , age ) } Following CQS principle is a good guideline for a project coding style. However, it might be unfeasible to follow it everywhere. Instead, we could be consistent within a context. For example: accept that UI is messy but keep backend clean. Exceptions CQS works best when it we treat it as a design principle. If used consistently, it helps people understand code better and faster. Yet, there always are exceptional edge cases, where it could be wise to step away from such principle: Operations related to concurrency, where it is important to both mutate state and get the result back: Interlocked.Increment or sync.Add . Well-established data structures, where functions with result and side-effect are common: queue.Dequeue or Stack.Pop It is possible to modify all of these methods to follow CQS principle. Yet, that would introduce additional complexity and deviate from the expected behavior. It might be better to accept the deviation and note it explicitly (e.g. in documentation, inline comments or method names). CQS in Languages CQS is prominent in Functional Programming . Languages supporting it often have this principle baked right into the language design itself. Haskell and F# emphasize functional programming. Learning them would be a good exercise and can make you a better developer. Languages following Object-Oriented Programming approach don't enforce CQS by default. Yet, following these principles can lead to simpler code that is more predictable and easy to reason about. Some ecosystems offer additional tooling to make CQS principle more explicit to the developers. For example, Microsoft Code Contracts introduces Pure Attribute to indicate methods that don't have visible side effects. References Command-query separation - Wikipedia Pure Attribute"}
{"url":"/courses","title":"My Courses","text":"My Courses Building AI Assistants: Patterns and Practices This is a recorded online course on building efficiently LLM-driven products in different industries. The course is based on practical experience and AI research with teams in EU and USA. It covers both the fundamentals, as well as concrete patterns and case studies. The course includes ~5 hours of densely packed video materials, private community and a set of practical exercises English version · Русская версия AI+Coding This will be a recorded online course for teams about patterns and practices of software development with modern AI tools. Vibe coding is mentioned there as well, however the content mostly focuses on systematic development approach for the large projects and teams. English version - ETA Q3 2026. Join waitlist to be notified."}
{"url":"/designing-privacy-first-analytics/","title":"(Over) Designing privacy-first analytics","text":"(Over) Designing privacy-first analytics While getting rid of Google Analytics and setting up Lean web analytics for my website, I hit a problem. It is a balance between privacy and gathering feedback. My current setup uses standard web access logs. These include IP addresses and timestamps. That is enough to track individual user actions on the website, regardless of the ad blocker. However, I don't want to record IP addresses or use session cookies , since this involves personal information (IP) and cookie consent. At the same time, I want to see: which pages are interesting for people - to get the motivation to write more. how do people arrive at a specific page and where do they go next - to improve information structure. I don't want to spy on specific people, just to have aggregate feedback. So what if we change common web analytics this way? don't persist IP addresses , not even after hashing; don't use any cookies, fingerprints, or identifiers ; On every page request from the website, send two additional headers: X-Page-Num : is it the first page visited on this site, 2nd or Nth? X-Page-Sec : how much time was spent on the previous page (rounded to seconds)? Server-side, append this information to a daily table. For example: { \"date\" : \"2022-06-13\" , // daily bucket \"referer\" : \"/lean-web-analytics\" , // HTTP Referer header \"url\" : \"/privacy-first-analytics\" , // Requested url \"visits\" : { 0 : 12 , 1 : 2 , 2 : 1 }, // number of visits, by page num \"time\" : [ 1 : 2 , 2 : 10 , 3 : 1 , 6 : 1 , 12 : 1 ] // seconds spent on a page ] Page visits from multiple users (and bots) traveling through the website get aggregated. Without a way to link visits to a session or IP address (because we aren't storing them), it is impossible to figure out who did what. How can we deduce any statistics from that? The number of browser sessions per day? Total count of visits with page num 0 Average time spent on a page per day? Group all rows by URL, then average time . Where do people go from page X ? Find all rows, where referer==X and weight by the number of visits How do people come to page X ? Find all rows where url==X How many people do leave the site after page X ? The number of visits to page X minus number of visits from X is the number of leaves. Gotchas What we can't deduce from the statistics? The number of unique visitors. We can only count unique sessions. Which sequence of pages leads to page X . We know only pages that lead directly to page X , but we can't reconstruct the entire flow. So we can't exactly track the sales funnel. Can ad blockers prevent this setup from working? Only by disabling JavaScript on the website or messing up with the headers. Can this setup be misused? Yes, by recording IP addresses and timestamps, like it is done in web server access logs. How to implement this setup? Here are just a few ideas. First, add Hotwire Turbo (or equivalent) to the website, then add an event listener to include page number and time: var page_num = 1 ; var page_time = Date . now (); ( function () { addEventListener ( 'turbo:before-fetch-request' , async ( event ) => { const headers = event . detail . fetchOptions . headers ; headers [ 'X-Page-Num' ] = page_num ; headers [ 'X-Page-Sec' ] = Math . round ( Date . now () - page_time ); page_time = Date . now (); page_num ++ ; }) })() Server-side, drop standard web access logs . Replace them with custom logs that don't persist IP addresses or timestamps. Push these as updates to a database (e.g. Clickhouse). Don't record individual events - aggregate them immediately. If using a message bus between the web server and Clickhouse , set a short retention period (e.g. a day), so that individual analytics messages don't live long. RedPanda might be a good fit (resource-efficient, Kafka-compatible and passes Jepsen). Or even use NSQ without spill-over to the disk for small cases. NSQ used to handle 10-30M events per day from thousands of connected devices on a single box (read on [[2017-09-12 real-time analytics|real-time analytics with LMDB, Go and NSQ]]). Next Let's validate if this approach actually works. We'll take a few shortcuts and use clickhouse-local to build a PoC: Analyze website logs with clickhouse-local ."}
{"url":"/dive-into-fpga/","title":"Dive into FPGA, PCB and 3D printing","text":"Dive into FPGA, PCB and 3D printing In the previous post I mentioned that my colleague Aigiz and I had started a hobby that was particularly friendly to parents of young kids. We are trying to replicate the technology used by Haddington Dynamics in their high-precision robotic arm but to use less expensive parts and processes, where possible. \"Kids-friendly\" comes from the fact that we have years filled with small chunks of time during which our hands might be busy, but cognitive skills and brains are free. I've started typing this blog post while helping my son to hold his bottle, only to finish the post 2 months later. As it turns out, there is a lot to learn, so this hobby is taking us down the rabbit hole (in a good way). We've split our roles: Aigiz focuses on the mechanical design (3D design and printing, milling, power drivers, interop with the stepper motors); I focus on the electrical design and the software (FPGA, PCBs, interop between the high-level code and the HDL programs). Our next milestone is to have a simple high-precision robotic joint with a haptic feedback. This includes designing out and building: speed reducer gear with ratio 1:50 to 1:100; FPGA pipeline to translate analogue signals from the encoder wheel to absolute position and to control the stepper motor with low latency; interop between the FPGA and the computer (could be an ARM running on the FPGA dev board). Mechanical Design Aigiz took the ownership of the mechanical design : everything necessary for a NEMA 17 motor with a speed reducer gear. The progress so far: We have Ender-3 3D Printer that is almost working (one motor driver on the board was dead on the delivery, there is a replacement on the way); we've learned that it is better to pick 3D printers that have CPU and the motor control separately (especially the ones that allow you swapping individual drivers); we read everything available on the internet about the cycloid drives; they are a pain to build without the high-precision equipment and processes (which we don't have at this point); lowered expectations for the next milestone to use the planetary gear or whatever that would work (since size isn't the issue at this point); learned a great deal about the control drivers for the stepper motors and interacting with them; started working with the Fusion 360 (free CAD/CAM software). Next steps: to get the 3D Printer working and print out a planetary gear for the NEMA 17. FPGA and Verilog We started going through the FPGA basics together. Aigiz picked the proper IDE route with his Altera Cyclone IV, while I've started with the Verilog via the Verilator route (free Verilog HDL simulator). It was amazing to discover that simulation (something that doesn't happen frequently in the software design) is an expected norm in the electrical engineering. There is a lot to learn in this field! We didn't get that far along the route, however the journey has already helped to connect a lot of dots in the way modern computers work (and start appreciating all the work needed to produce even a simplest micro-controller). In addition to the Cyclone IV, we have a MiniZed (a single-core Zynq 7Z007S development board) en route to Russia. It has enough computing power and XADC IO in order to drive a single robotic joint or two. This development branch is currently paused while we are waiting for the parts and the speed reduction gear. PCBs The original Haddington Dynamics design has little optical boards close to the LED/Phototransistor pairs (for removing the noise and calibrating the IR LED intensity). Although we don't need anything like that for the first iteration, PCB design might be helpful to move forward later, so we've taken up the task of rebuilding these PCBs. Current progress: learned about reflow soldering; ordered enough SMT components for 5 optical boards; started designing a PCB in order to have it printed out (finding and/or designing footprint models for the components is the most time-consuming part here). The next step is to finish the PCB design and order a batch of boards. Current Status This is what this milestone looks like at the moment. We aren't even connecting optical encoder, since the cycloidal speed reducer just keeps on jamming. Hence the need for another version of the speed reducer in order to move forward."}
{"url":"/e/df-2024","title":"DataFest-2024","text":"DataFest-2024 Data Fest 2024 Это список ссылок и материалов к моим докладам Data Fest 2024 Knowledge Maps: Как бороться с галлюцинациями в RAG-системах Три паттерна успешных внедрений проектов с LLM под капотом Больше информации про кейсы продуктов с LLM под капотом можно найти в телеграм-канале LLM под капотом . Например: Кейс про захват рынка с LLM Кейс с ChatGPT про 10.000 € в месяц (бизнес-переводчик) Персональный ассистент в компании Автоматическое исправление ошибок в каталогах продуктов Ассистент маркетолога в компании Переводчик со знанием специализированной терминологии Нишевый lead generator для B2B продаж Платформа для автоматизации бизнес-процессов в компании Про свой ChatGPT для муниципалитетов Слайды с презентации я выложу в канале в течение суток. Бенчмарки LLM в первую очередь я публикую тоже в нашем канале."}
{"url":"/erc/","title":"Enterprise RAG Challenge","text":"Enterprise RAG Challenge Enterprise RAG Challenge is a friendly competition that compares different RAG architectures. The goal is to build an AI-driven system that will be able to answer questions about annual reports of the companies. You can find more technical details in this github repository . Round 1 Round 1 was organised by TimeToAct Austria ( read more ). Solution using Checklist pattern with Structured Outputs took the first place. Second place used a classical vector database with LangChain. AIR - teams leveraged my AI Research TTA - teams were a part of TimeToAct community Round 2 Round 2 was organised by TimeToAct Austria ( read more ) and sponsored by IBM WatsonX AI. Teams had to build a solution that would automatically answer 100 randomly generated questions about 100 Annual Reports. Largest PDF was 1047 pages. Some questions required looking up multiple PDFs and comparing companies. Read more: ERC Submission Server (still running) Video: Keynote by Stefan Gillich: The Power of Context - RAG enhanced AI Video: Winner announcement and recap Deep dive from the winner : How I Won the Enterprise RAG Challenge and source code . Below you will find the top leaderboard for teams (regardless of their prize nomination status). This leaderboard focuses on the R&D process and will also be updated with late submissions. You can also jump straight to the deep dive from the winner of the competition : Ilya Rice: How I Won the Enterprise RAG Challenge If you want a canonical competition leaderboard - go to the TAT ERC page . Time - time it took the team to produce the results R - Retrieval Score. Max: 100 G - Generation Score. Max: 100 Score - Final score (R/3+G). Max: 133 🤗 - Team participates in AI R&D community. 🔒 - this is a fully local solution. Click on the table row to read more about the architecture and lessons learned. Show Local Models Only 🏠 Show submissions within hours Team / Experiment Time R/G Score 1. ▶ Ilia Ris 🤗 49 min 83/81 123.7 Ilya Rice Best experiment : Dense Retrieval; Router; LLM reranking; o3-mini Signature : f1d79f Summary : Dense retrieval combined with LLM reranking and SO CoT. Article : How I Won the Enterprise RAG Challenge Source code : Github Models used : o3-mini-2025-01-31 Architecture Ilya Rice solved the problem by making it easy to run numerous experiments before the competition has even started. He created an evaluation pipeline that let him quickly evaluate different architectural solutions. The best solution was also among the fastest ones. The winning experiment had this configuration: PDF Analysis: Documents are processed using a highly modified Docling Library from IBM. Modifications were needed to preserve page references. Router Pattern: First step in question answering flow picks the most suitable agent. Dense Retrieval: The system searches for relevant information based on semantic similarity ( FAISS library and OpenAI vector embeddings ). Parent Document Retrieval: Instead of retrieving only the chunk, full page is loaded to preserve relevant context. LLM Reranking: Retrieved information is re-evaluated and reordered by the LLM. Reasoning Patterns: Improve LLM accuracy within a single prompt by controlling its thinking process with Custom Chain-of-Thought and Structured Outputs. Final Answer generation: The optimized result is generated using o3-mini. Self-Consistency with Majority Vote: Multiple answer variations are generated, compared, and the most consistent one is selected. R&D Experiments Total experiments submitted : 11 Other approaches : Dense Retrieval; LLM Reranking; Router; SO CoT; o3-mini Dense Retrieval; Router; SO CoT; llama3.3-70b Dense Retrieval; Tables serialization; Router; LLM reranking; o3-mini Dense Retrieval; llama-3.3 70b Dense Retrieval; llama-3.1 8b Full Context; gemini-2.0 thinking Dense Retrieval; Router; LLM reranking; Self-Consistency; o3-mini Dense Retrieval; Router; LLM reranking; Self-Consistency; llama-3.3 70b What didn't work? Using llama-3.1 8b for reranking Incorporating Full Context with gemini-2.0 thinking Future experiments: Evaluating various local embedding models for fully offline solutions Experiment journal: 16 min → R: 83.9 , G: 72.8 , Score: 114.8 ▲ - Dense Retrieval; LLM Reranking; Router; SO CoT; o3-mini 23 min → R: 81.4 , G: 74.7 , Score: 115.4 ▲ - Dense Retrieval; llama-3.3 70b 49 min → R: 83.8 , G: 81.8 , Score: 123.7 ▲ - Dense Retrieval; Router; LLM reranking; o3-mini 50 min → R: 81.1 , G: 68.7 , Score: 109.3 - Dense Retrieval; llama-3.1 8b 51 min → R: 75.5 , G: 75.0 , Score: 112.8 - Full Context; gemini-2.0 thinking 66 min → R: 83.0 , G: 78.8 , Score: 120.3 - Dense Retrieval; Tables serialization; Router; LLM reranking; o3-mini 22 hours → R: 83.5 , G: 81.8 , Score: 123.6 - Dense Retrieval; Router; LLM reranking; o3-mini 22 hours → R: 80.8 , G: 75.7 , Score: 116.1 - Dense Retrieval; llama-3.3 70b 33 hours → R: 83.4 , G: 79.8 , Score: 121.6 - Dense Retrieval; Router; LLM reranking; Self-Consistency; o3-mini 33 hours → R: 81.3 , G: 79.7 , Score: 120.3 - Dense Retrieval; Router; LLM reranking; Self-Consistency; llama-3.3 70b 2. ▶ Emil Shagiev 🤗 55 min 86/78 121.6 Emil Shagiev Best experiment : LLM_Search Signature : 0a8782 Summary : A multi-step process involving query expansion, efficient search, question answering, and answer finalization. Models used : gpt-4o-mini-2024-07-18 gpt-4o-2024-08-06 o3-mini-2025-01-31 Architecture The best solution didn't use vector embeddings, it leveraged a structured approach: the input query is expanded to enhance search coverage and enable semantic search; relevant pages are retrieved using a cost-effective and rapid LLM; retrieved information is then passed to powerful LLM to generate answers; answers are refined and finalized for presentation. R&D Experiments Total experiments submitted : 3 Other approaches : LLL_Search_2: Similar architecture with added capability for mathematical operations. Experiment journal: 55 min → R: 86.3 , G: 78.5 , Score: 121.6 ▲ - LLM_Search 21 hours → R: 86.1 , G: 77.5 , Score: 120.5 - LLL_Search_2 3. ▶ Dmitry Buykin 🤗 8 hours 81/76 117.5 Dmitry Buykin Best experiment : Dynamic Structured Output with SEC EDGAR Ontologies Signature : 6b0d78 Summary : Dynamic structured output with query expansion and page-focused chunking. Models used : gpt-4o-2024-08-06 Architecture Used SO/CoT approach with ontologies to retrieve relevant information. Key highlights: embeddings and vector databases were not used; dynamic structured output approach combined with SEC EDGAR ontologies for query expansion (SO CoT); utilized CBOW similarity for majority selection across multiple runs, focusing on balancing pages versus tokens during chunking significant effort was dedicated to evaluating PDF quality heuristics to optimize OCR input synthetic tags were implemented to stabilize page detection and assess model quality. 4. ▶ Sergey Nikonov 🤗 30 hours 85/73 116.4 Sergey Nikonov Best experiment : main v2 Signature : 00c0e1 Summary : For every question, all pages are processed using gpt-4o. Models used : gpt-4o o1-mini Architecture Solution involves feeding all pages of the provided documents into the gpt-4o model for each question. This simple but practical approach ensures comprehensive coverage of the content to extract accurate answers. R&D Experiments Total experiments submitted : 2 Other approaches : Finding the PDFs that correspond to questions, cutting the PDFs by page, running the question against each PDF page by loading the PDF directly into gpt-4o (through the assistant API), scanning all PDF pages for the answer, and combining the answers by simple logic. What didn't work? Using the o3-mini model instead of o1-mini in the architecture. Experiment journal: 5 hours → R: 85.3 , G: 69.0 , Score: 111.6 ▲ - Main 30 hours → R: 85.1 , G: 73.9 , Score: 116.4 ▲ - main v2 5. ▶ ScrapeNinja.net 🤗 23 hours 82/71 112.5 ScrapeNinja.net Best experiment : fixed multiple companies search Signature : 417bbf Summary : Node.js-based architecture utilizing pgvector for efficient data handling. Models used : Gemini Flash 2.0 Gemini Flash Lite 2.0 Flash Thinking Exp Architecture The solution used Node.js for backend operations and pgvector for vectorized data processing. It focused on efficient handling of complex queries and data retrieval tasks. The team utilized: Gemini Flash 2.0 Gemini Flash Lite 2.0 Flash Thinking Exp. R&D Experiments Total experiments submitted : 2 Other approaches : OCR and PG Experiment journal: 20 hours → R: 82.6 , G: 64.2 , Score: 105.5 ▲ - OCR and PG 23 hours → R: 82.6 , G: 71.2 , Score: 112.5 ▲ - fixed multiple companies search 6. ▶ xsl777 🤗 16 hours 79/71 110.9 xsl777 Best experiment : multi-query, gpt-4o Signature : 66ab5c Summary : Structured PDF parsing, metadata extraction, query expansion, hybrid search, reranking, and CoT. Models used : gpt-4o gpt-4o-mini Architecture The architecture integrates following patterns: structured PDF parsing and chunking; metadata extraction; query expansion; hybrid search mechanisms; reranking strategies. It synthesizes document metadata and chunks while utilizing Chain-of-Thought (CoT) reasoning to enhance response accuracy and relevance. gpt-4o and gpt-4o-mini help with high-quality language understanding and generation capabilities. R&D Experiments Total experiments submitted : 2 Experiment journal: 16 hours → R: 79.4 , G: 71.2 , Score: 110.9 ▲ - multi-query, gpt-4o 3 days → R: 80.1 , G: 70.7 , Score: 110.7 - Open source, Advanced RAG 7. ▶ nikolay_sheyko(grably.tech) 🤗 25 hours 81/69 110.4 nikolay_sheyko(grably.tech) Best experiment : nikolay_sheyko(grably.tech)_with_o3_mini Signature : db8938 Summary : Relevant pages are identified and processed to generate answers. Models used : gpt-4o-mini o3-mini Architecture The solution employs a two-step process: first, it identifies relevant reports for a given question and evaluates the relevance of each page asynchronously using the gpt-4o-mini model; then , all relevant pages are compiled into a prompt, and the o3-mini model is utilized to generate the final answer. R&D Experiments Total experiments submitted : 7 Other approaches : Dynamic data extraction with pydantic classes Binary checks per page Parallel question splitting Subquestion generation for multi-entity queries Single-page reference experiments What didn't work? Binary checks per page Single-page reference experiments Experiment journal: 55 min → R: 77.2 , G: 51.2 , Score: 89.9 ▲ - grably.tech/with_extra_reasoning_from_different_pages_hacked96160725 25 hours → R: 81.1 , G: 69.8 , Score: 110.4 ▲ - nikolay_sheyko(grably.tech)_with_o3_mini 25 hours → R: 79.7 , G: 60.2 , Score: 100.1 - nikolay_sheyko(grably.tech)_dummy 8 days → R: 80.5 , G: 64.3 , Score: 104.6 - o3-mini-no-restrictions 8 days → R: 80.5 , G: 66.3 , Score: 106.6 - o3-mini-no-restrictions-fixed-names 12 days → R: 81.2 , G: 67.1 , Score: 107.7 - o3-mini-no-restrictions-single-reference 12 days → R: 80.5 , G: 67.3 , Score: 107.6 - o3-mini-no-restrictions-fixed-names-and-boolean 8. ▶ Felix-TAT 🤗 7 days 80/69 109.4 Felix-TAT Best experiment : Gemini-4o Multiagent RAG Signature : a2faff Summary : Multiagent, mixed-model approach with delegation and execution agents. Models used : gemini-2.0-flash gpt-4o-2024-08-06 Architecture The solution uses a multiagent architecture where a delegation manager (OpenAI) splits the user query into company-specific subqueries. These subqueries are processed by expert agents using Google's Gemini flash model, which has access to the entire company PDF in context. The responses are then aggregated and synthesized by an execution agent (OpenAI) to produce the final answer. R&D Experiments Total experiments submitted : 4 Other approaches : Gemini Naive IBM-4o-based Multiagent RAG OpenAI Multiagent RAG What didn't work? Using a single model without multiagent delegation Relying solely on vector database retrieval without full PDF context Experiment journal: 6 days → R: 79.0 , G: 60.3 , Score: 99.8 ▲ - Gemini Naive 7 days → R: 81.7 , G: 47.3 , Score: 88.2 - IBM-4o-based Multiagent RAG 7 days → R: 82.2 , G: 66.0 , Score: 107.1 ▲ - OpenAI Multiagent RAG 7 days → R: 80.2 , G: 69.3 , Score: 109.4 ▲ - Gemini-4o Multiagent RAG 9. ▶ A.Rasskazov/V.Kalesnikau 30 hours 84/67 109.3 A.Rasskazov/V.Kalesnikau Best experiment : multi_agent_ibm_openai Signature : efabd4 Summary : A multi-agent system leveraging LLMs for question answering using similarity-based retrieval. Models used : meta-llama/llama-3-405b-instruct ibm/granite-embedding-107m-multilingual text-embedding-3-small gpt-4o-mini Architecture The solution employs a multi-agent architecture to address the challenge. Initially, it generates a database for the Retrieval-Augmented Generation (RAG) model. Upon receiving a query, the system extracts key metrics such as company, industry, and currency. These metrics are then used to identify the most similar question in the database. The answer associated with this similar question is retrieved and refined using a Large Language Model (LLM). Finally, the system consolidates and presents the answer to the user. R&D Experiments Total experiments submitted : 2 Other approaches : pjatk_team_002: A system that preprocesses questions, retrieves relevant PDF pages using a vector database, and extracts answers with page references using LLMs. What didn't work? Alternative embedding models for retrieval. Different strategies for key metric extraction. Experiment journal: 30 hours → R: 84.0 , G: 67.2 , Score: 109.3 ▲ - multi_agent_ibm_openai 7 days → R: 82.5 , G: 64.0 , Score: 105.2 - pjatk_team_002 10. ▶ Dany the creator 🤗 3 hours 82/67 108.4 Dany the creator Best experiment : gpt-4o-mini + pgvector Signature : ee29ae Summary : Utilized a structured approach to parse and analyze text chunks, creating embeddings and generating questions. Models used : gpt-4o-mini Architecture The solution preprocesses text by chunking, generating embeddings with pgvector library, and formulating questions that could be answered by the respective chunks. 11. ▶ SergC 🤗 7 days 77/69 108.1 SergC Best experiment : submission_1 Signature : c0d776 Summary : QE + SO + CoT Models used : gemini 2.0 Architecture The solution uses a combination of: Query Expansion (QE) Semantic Optimization (SO) Chain of Thought (CoT) reasoning to enhance the performance of the Gemini 2.0 model. 12. ▶ Swisscom Innovation Lab 🔒 21 hours 83/66 107.8 Swisscom Innovation Lab Best experiment : Multi-Agent Langgraph-Llamaindex-MarkerPDF-Llama3.3 Signature : debcf6 Summary : A multi-agent system leveraging LangGraph, LlamaIndex, MarkerPDF, and Llama 3.3 for accurate and contextual multi-company query processing. Models used : llama-3.3-70b-instruct Architecture This offline solution uses a multi-agent architecture with: LangGraph for workflow orchestration LlamaIndex for data indexing MarkerPDF for document parsing Llama 3.3 for natural language processing. Solution supports multi-company queries by: extracting relevant entities validating inputs processing each entity individually retrieving and evaluating documents aggregating results for numeric-based comparisons. R&D Experiments Total experiments submitted : 3 Other approaches : Iterative refinement of query processing pipeline Enhanced document retrieval mechanisms What didn't work? Simplified single-agent architecture Direct query-to-response mapping without intermediate validation Experiment journal: 80 min → R: 83.3 , G: 65.2 , Score: 106.8 ▲ - Multi-Agent Langgraph-Llamaindex-MarkerPDF-Llama3.3 21 hours → R: 83.3 , G: 66.2 , Score: 107.8 ▲ - Multi-Agent Langgraph-Llamaindex-MarkerPDF-Llama3.3 13. ▶ fomih 🤗 10 days 83/65 107.4 fomih Best experiment : gemini-flash CoT with question type detection fixes Signature : 60bc28 Summary : Enhanced question type detection for improved accuracy. Models used : gemini-flash 2.0 Architecture The solution utilized the gemini-flash 2.0 model, incorporating a refined approach to question type detection. This enhancement aimed to improve the accuracy and relevance of the responses generated by the system. The architecture involved preprocessing input documents into structured formats, creating knowledge bases tailored to specific question types, and leveraging these resources during the question-answering phase. The system identified the question type and relevant entities, retrieved pertinent knowledge base entries, and generated answers by combining the question with the retrieved data. R&D Experiments Total experiments submitted : 4 Other approaches : gemini-flash CoT with structured output gemini-flash CoT with structured output and small fixes gemini CoT with structured output final What didn't work? Initial handling of 'n/a' cases Fallback processing without structured knowledge bases Experiment journal: 10 days → R: 83.2 , G: 59.9 , Score: 101.5 ▲ - _gemini-flash CoT + structured output _ 10 days → R: 82.9 , G: 62.8 , Score: 104.3 ▲ - gemini-flash CoT + structured output small n/a handling fixex 10 days → R: 83.0 , G: 65.9 , Score: 107.4 ▲ - gemini-flash CoT + so small fixes in question type detection 12 days → R: 83.3 , G: 64.4 , Score: 106.1 - gemini CoT + SO final 14. ▶ Al Bo 12 days 81/65 105.9 Al Bo Best experiment : albo Signature : 1e89b6 Summary : Docling, Vector, Agent with search tool into documents Models used : gpt-4o Architecture The solution utilized a sophisticated architecture combining document processing (Docling), vector-based representation, and an agent equipped with a search tool for document retrieval. 15. ▶ NumericalArt 8 days 70/70 105.3 NumericalArt Best experiment : Vhck-R0-002 Signature : 32aae7 Summary : Preprocessing questions, raw retrieval, filtering, retrieval, detailed page analysis, and answer generation. Models used : 4o-mini 4o 3o-mini Architecture The best employs a structured approach to information retrieval and answer generation. The process begins with preprocessing the input questions to enhance clarity and relevance. This is followed by an initial raw retrieval phase to gather potential information sources. Subsequently, a filtering mechanism is applied to refine the retrieved data. The refined data undergoes a detailed page analysis to extract precise and contextually relevant information. Finally, the system generates answers based on the analyzed data, leveraging the capabilities of the LLM models 4o-mini, 4o, and 3o-mini. R&D Experiments Total experiments submitted : 2 Other approaches : Parsing text from PDFs only, separate VDB for each document, one chunk equals one page, extract four pages by entity value from question (excluding company name), detailed parsing of extracted pages, asking LLM question with detailed information in context. Experiment journal: 7 days → R: 75.9 , G: 63.3 , Score: 101.3 ▲ - Vhck-R0 8 days → R: 70.0 , G: 70.3 , Score: 105.3 ▲ - Vhck-R0-002 16. ▶ Pedro Ananias 🤗 4 hours 80/64 104.9 Pedro Ananias Best experiment : rag-3w-cot-gpt-4o-mini Signature : d44b72 Summary : A 3-way FAISS MMR Search & Stepped Chain Of Thought RAG Models used : openai/gpt-4o-mini Architecture The solution uses a 3-way FAISS MMR Search mechanism combined with a Chain Of Thought (CoT) approach. FAISS MMR Search involves query expansion, file selection based on exact matches and cosine similarity, and database searching using maximum marginal relevance. CoT pipeline consists of three sequential model calls with specific prompts for reasoning, formatting, and parsing. This architecture leverages the openai/gpt-4o-mini model for processing. R&D Experiments Total experiments submitted : 5 Other approaches : rag-3w-cot-gpt-4o-mini-hi-res rag-3w-cot-deepseek-r1-distill-llama-8B-fast-fp16 rag-3w-cot-deepseek-r1-distill-llama-8B-hi-res-fp16 rag-3w-cot-microsoft-phi4-14B-hi-res-int8 What didn't work? Using lower resolution PDF extraction for certain tasks Employing fully local processing without cloud integration in some scenarios Experiment journal: 4 hours → R: 80.4 , G: 64.7 , Score: 104.9 ▲ - rag-3w-cot-gpt-4o-mini 9 hours → R: 70.6 , G: 56.0 , Score: 91.3 - rag-3w-cot-deepseek-r1-distill-llama-8B-fast-fp16 9 hours → R: 77.0 , G: 64.6 , Score: 103.1 - rag-3w-cot-gpt-4o-mini-hi-res 11 hours → R: 72.3 , G: 58.0 , Score: 94.2 - rag-3w-cot-deepseek-r1-distill-llama-8B-hi-res-fp16 31 hours → R: 78.1 , G: 59.7 , Score: 98.7 - rag-3w-cot-microsoft-phi4-14B-hi-res-int8 17. ▶ Daniyar 3 days 62/72 104.1 Daniyar Best experiment : Fixed reference page indices Signature : 8bb723 Summary : The architecture utilizes fixed reference page indices for efficient information retrieval. Models used : gpt-4o Architecture Solution uses a strategy of fixed reference page indices to enhance the accuracy and efficiency of document parsing and question answering. This approach ensures that the model can quickly locate and utilize relevant information from the provided documents, leveraging the capabilities of the GPT-4o model. R&D Experiments Total experiments submitted : 2 Other approaches : Sliding window PDF page reading with checklists over questions addressed to files. What didn't work? Alternative indexing methods or dynamic page referencing strategies. Experiment journal: 3 days → R: 62.2 , G: 72.9 , Score: 104.0 ▲ - First draft 3 days → R: 62.4 , G: 72.9 , Score: 104.1 ▲ - Fixed reference page indices 18. ▶ RubberduckLabs 🔒 2 days 74/66 103.3 RubberduckLabs Best experiment : RubberduckLabs - RAG experiment attempt 001 Signature : ee7519 Summary : A multi-step LLM processing pipeline for document question-answering. Models used : deepseek-r1-distill-llama-70b:bf16 llama-3.1-70b-instruct:bf16 Architecture The architecture preprocesses documents to generate detailed page-level summaries and extracting structured metadata, particularly focusing on financial data. The retrieval process employs a two-stage approach: document selection based on metadata matching; precise page identification using semantic relevance and explicit reasoning. Answer generation utilizes 'Context-Guided Response Generation' combining retrieved contexts with structured reasoning to ensure factual accuracy and traceability. The system maintains explicit reasoning trails and incorporates robust error handling for production stability. R&D Experiments Total experiments submitted : 2 19. ▶ Machine Learning Reply 28 hours 74/66 103.2 Machine Learning Reply Best experiment : ML Reply - Submission 1 Signature : fa34f3 Summary : Integration of Azure Document Intelligence and Azure AI Search. Models used : GPT-4o Architecture This solution utilized a combination of Azure Document Intelligence for document processing and Azure AI Search for efficient information retrieval. R&D Experiments Total experiments submitted : 2 Other approaches : ML Reply - Submission 2 Experiment journal: 28 hours → R: 74.5 , G: 66.0 , Score: 103.2 ▲ - ML Reply - Submission 1 29 hours → R: 74.0 , G: 63.5 , Score: 100.5 - ML Reply - Submission 2 20. ▶ Aleksandr Podgaiko 🤗 3 days 81/62 103.0 Aleksandr Podgaiko Best experiment : smolagent_simple_v1 Signature : 6afedb Summary : Utilized smolagents library with basic PDF extraction and a coding agent. Models used : openrouter/google/gemini-2.0-flash-001 Architecture The solution employed the HuggingFace smolagents library for agent-based interactions, integrating basic PDF extraction using PyPDF2 . The architecture featured a default coding agent equipped with two tools: pdf_search for keyword-based search with contextual display and pdf_content for full-page content retrieval upon request. Additionally, the final_answer tool was customized to adhere to the submission format. 21. ▶ Vlad Drobotukhin (@mrvladd) 🤗 🔒 6 days 68/68 102.3 Vlad Drobotukhin (@mrvladd) Best experiment : Qwen 2.5-72b + Multi-Query BM25 + Domain-Specific Information Extraction + Router Signature : fa77e2 Summary : System combining LLM-based reasoning with optimized retrieval techniques. Models used : Qwen-2.5-72b-INT4 Architecture This offline solution employs a multi-step process: start with question analysis to determine the type and domain; generate multiple search queries to maximize recall; relevant pages are retrieved using OpenSearch and processed with domain-specific LLM extractors to build structured knowledge; final answers are synthesized with reasoning and confidence scores. R&D Experiments Total experiments submitted : 10 Other approaches : Qwen2.5 72b + FTS (rephrase query) +SO + CheckList's Qwen2.5 72b + FTS +SO + CheckList's Qwen2.5 + FTS (rephrase query) + SO + CheckList's Qwen 2.5-72b + Multi-Query BM25 + Domain-Specific Information Extraction Qwen 2.5-72b + Multi-Query BM25 (top 15 pages) + Domain-Specific Information Extraction + Router Qwen 2.5-72b + Multi-Query BM25+ Domain-Specific Information Extraction + Router Qwen 2.5-72b-4bit + BM25 + Domain-Specific Information Extraction + Router MagicQwen-4bit + BM25 + Domain-Specific Information Extraction + Router Qwen 72b-4bit + FTS + Domain-Specific Information Extraction 0803 What didn't work? Simplified query generation without diversification Lack of domain-specific term boosting Absence of structured output validation Experiment journal: 3 days → R: 74.7 , G: 59.2 , Score: 96.5 ▲ - Qwen2.5 72b + FTS (rephrase query) +SO + CheckList's 3 days → R: 71.8 , G: 62.3 , Score: 98.2 ▲ - Qwen2.5 72b + FTS +SO + CheckList's 4 days → R: 74.7 , G: 59.2 , Score: 96.5 - Qwen2.5 + FTS (rephrase query) + SO + CheckList's 5 days → R: 69.1 , G: 65.7 , Score: 100.2 ▲ - Qwen 2.5-72b + Multi-Query BM25 + Domain-Specific Information Extraction 6 days → R: 68.3 , G: 68.2 , Score: 102.3 ▲ - Qwen 2.5-72b + Multi-Query BM25 + Domain-Specific Information Extraction + Router 7 days → R: 67.6 , G: 67.4 , Score: 101.2 - Qwen 2.5-72b + Multi-Query BM25 (top 15 pages) + Domain-Specific Information Extraction + Router 8 days → R: 64.6 , G: 62.0 , Score: 94.3 - Qwen 2.5-72b + Multi-Query BM25+ Domain-Specific Information Extraction + Router 9 days → R: 61.9 , G: 63.0 , Score: 93.9 - Qwen 2.5-72b-4bit + BM25 + Domain-Specific Information Extraction + Router 9 days → R: 69.2 , G: 63.2 , Score: 97.8 - MagicQwen-4bit + BM25 + Domain-Specific Information Extraction + Router 10 days → R: 78.4 , G: 63.0 , Score: 102.2 - Qwen 72b-4bit + FTS + Domain-Specific Information Extraction 0803 22. ▶ Ivan R. 🤗 71 min 79/62 101.9 Ivan R. Best experiment : Round 2 submission Signature : b29973 Summary : A multi-step approach leveraging LLMs for question decomposition, search, and validation. Models used : gpt-4o gpt-4o-mini Architecture The solution employs a structured pipeline: document loading using PyPDFDirectoryLoader from LangChain; question decomposition with GPT-4o; multiple OpenAI assistants, each dedicated to a specific company, perform targeted searches using GPT-4o-mini; results undergo answer validation with GPT-4o local FAISS vector store is used for similarity search to collect reference pages. 23. ▶ PENZA_AI_CREW 🤗 7 days 72/65 101.3 PENZA_AI_CREW Best experiment : gpt-4_claude3.5_unstructured Signature : 67ee86 Summary : A multi-step pipeline leveraging OCR, table/image analysis, and knowledge mapping for accurate question answering. Models used : gpt-4-mini claude 3.5 gpt-4o Architecture This RAG pipeline was composed of the following steps: PDF text is parsed using Unstructured library with OCR Tables and images are analyzed using Claude 3.5 Knowledge map is constructed using gpt-4-mini, utilizing Structured Outputs. Questions are analyzed in conjunction with the knowledge map using gpt-4-mini with Pydantic schema. Answers are generated by gpt-4o, employing chain-of-thought reasoning and Pydantic schema (SO CoT). R&D Experiments Total experiments submitted : 2 Other approaches : RAG_PNZ_PAYPLINE: OCR with Unstructured, table/image analysis with Claude 3.5, metadata extraction with gpt-4-mini, and final reasoning with gpt-4o. What didn't work? Alternative OCR methods not utilizing Unstructured. Direct question answering without intermediate knowledge mapping. Experiment journal: 7 days → R: 12.2 , G: 11.0 , Score: 17.1 ▲ - RAG_PNZ_PAYPLINE 7 days → R: 72.5 , G: 65.0 , Score: 101.3 ▲ - gpt-4_claude3.5_unstructured 24. ▶ Yolo leveling 25 hours 82/59 101.0 Yolo leveling Best experiment : Marker + Gemini Signature : 31b473 Summary : Convert PDFs to markdown, extract company names, and generate JSON representations. Models used : Surya (OCR) Flash 2.0 Architecture The solution starts converting each PDF document into markdown format using the Marker tool with OCR capabilities. Afterward, the system identifies the company name within the content. In cases where multiple companies are mentioned in the query, the system employs a hallucination control mechanism to determine the most relevant company. The markdown content is then incorporated into the context for the LLM, which extracts and generates a structured JSON representation of the required information. R&D Experiments Total experiments submitted : 2 Other approaches : Gemini 1M pdf \"thinking\" + 4o parser What didn't work? Queries involving multiple companies were marked as N/A in alternative approaches. Experiment journal: 25 hours → R: 76.0 , G: 60.0 , Score: 98.0 ▲ - Gemini 1M pdf \"thinking\" + 4o parser 25 hours → R: 82.2 , G: 59.9 , Score: 101.0 ▲ - Marker + Gemini 25. ▶ ArtemNurm 🤗 7 days 77/61 99.9 ArtemNurm Best experiment : brute_flash2.0&brute_flash2.0 Signature : 46e0e0 Summary : PDF2MD with Flash, relevant data extraction with Flash, the data is sent to LLM with questions using SO (no CoT). All steps include generator-critic workflow. Models used : Gemini Flash 2.0 OpenAI o3-mini Architecture The winning experiment employs a robust architecture leveraging the Gemini Flash 2.0 and OpenAI o3-mini models. The process involves converting PDF documents to Markdown format using Flash, extracting relevant data, and querying the LLM with specific questions using a straightforward approach without chain-of-thought reasoning. A generator-critic workflow is integrated into all steps to ensure high-quality outputs. R&D Experiments Total experiments submitted : 8 Other approaches : brute_flash2.0&CoT_flash2.0 index_flash2.0&brute_flash2.0 index_flash2.0&CoT_4o-2024-11-20 index_flash2.0&CoT_flash2.0 index_flash2.0&CoT_o3-mini-high index_flash2.0&CoT_o3-mini flash2.0_sees_all_content What didn't work? Using chain-of-thought reasoning in 'brute_flash2.0&CoT_flash2.0' did not outperform the winning approach. Concatenating all Markdown files into a single string in 'flash2.0_sees_all_content' was less effective. Experiment journal: 7 days → R: 77.8 , G: 61.0 , Score: 99.9 ▲ - brute_flash2.0&brute_flash2.0 7 days → R: 77.7 , G: 61.0 , Score: 99.8 - brute_flash2.0&CoT_flash2.0 7 days → R: 68.5 , G: 57.6 , Score: 91.8 - index_flash2.0&brute_flash2.0 7 days → R: 66.4 , G: 56.8 , Score: 90.0 - index_flash2.0&CoT_4o-2024-11-20 7 days → R: 66.3 , G: 57.6 , Score: 90.7 - index_flash2.0&CoT_flash2.0 7 days → R: 65.6 , G: 58.8 , Score: 91.6 - index_flash2.0&CoT_o3-mini-high 7 days → R: 65.9 , G: 59.3 , Score: 92.2 - index_flash2.0&CoT_o3-mini 7 days → R: 71.8 , G: 55.6 , Score: 91.4 - flash2.0_sees_all_content 26. ▶ ndt by red_mad_robot 🤗 🔒 9 days 72/63 99.7 ndt by red_mad_robot Best experiment : qwen32b+bge_m3 Signature : 30f0d1 Summary : PDFs were converted to markdown, vectorized using bge m3, and queried with Qwen 32B. Models used : Qwen 32B instruct BGE-M3 Architecture This offline solution involved processing PDF documents by converting them into markdown format using the Pymupdf library. These markdown representations were then vectorized using the popular BGE-M3 model. Qwen 32B instruct model was used to answer user queries by leveraging the vectorized data for relevant context retrieval. R&D Experiments Total experiments submitted : 5 Other approaches : full open-source + roter agent qwen7b-router-agent What didn't work? Directly querying without vectorization Using alternative LLMs for vectorization Experiment journal: 23 hours → R: 27.2 , G: 54.0 , Score: 67.6 ▲ - full open-source + roter agent 7 days → R: 73.2 , G: 51.0 , Score: 87.6 ▲ - qwen7b-router-agent 9 days → R: 73.2 , G: 59.0 , Score: 95.6 ▲ - ndt by red_mad_robot 9 days → R: 72.9 , G: 63.2 , Score: 99.7 ▲ - qwen32b+bge_m3 27. ▶ Neoflex DreamTeam 🤗 🔒 30 hours 77/58 96.9 Neoflex DreamTeam Best experiment : Simple LLM Brute Force Signature : 34a266 Summary : Utilized a straightforward LLM brute force approach for each page with predefined questions and example answers. Models used : Qwen 2.5 Architecture Solution used Qwen 2.5 model to process each page individually, applying a brute force methodology with a set of predefined questions and corresponding example answers to extract relevant information effectively. R&D Experiments Total experiments submitted : 2 Other approaches : Checklist based RAG What didn't work? Alternative configurations of the Checklist based RAG approach Experiment journal: 30 hours → R: 77.8 , G: 58.0 , Score: 96.9 ▲ - Best run 7 days → R: 67.3 , G: 51.7 , Score: 85.4 - neon_team 28. ▶ nightwalkers 🔒 6 hours 72/60 96.7 nightwalkers Best experiment : nightwalkers-baseline Signature : 356ef4 Summary : Utilized a vector database for efficient document retrieval and LLM for response generation. Models used : deepseek-r1-distill-llama-70b Architecture The team implemented vector database search using embeddings from all-MiniLM-L6-v2 and ibm/granite-embedding-107m-multilingual models. This facilitated the retrieval of the most relevant page and document based on the query. The retrieved information was then processed by the deepseek-r1-distill-llama-70b LLM to generate relevant answers. 29. ▶ Gleb Kozhaev 🤗 32 hours 79/56 95.5 Gleb Kozhaev Best experiment : pymupdf4llm + Structured Output Signature : 1442cb Summary : Utilized pymupdf4llm with structured output and three distinct system prompts/roles. Models used : gpt-4o-mini Architecture RAG solution employed the pymupdf4llm framework, leveraging Structured Outputs to enhance data processing and comprehension. Three distinct system prompts/roles were utilized to optimize the model's performance and ensure accurate and efficient results. 30. ▶ AndreiKopysov 🤗 33 hours 76/57 95.3 AndreiKopysov Best experiment : Gemini2.0 and DeepSeek R1 Integration Signature : 574182 Summary : The architecture processes PDF pages using Gemini2.0 and refines responses with DeepSeek R1. Models used : Gemini2.0 DeepSeek R1 Architecture This RAG solution used a two-step pipeline: each page of the PDF document is processed using the Gemini2.0 model to extract relevant information; extracted responses are refined and analyzed using the DeepSeek R1 model to ensure accuracy and relevance. R&D Experiments Total experiments submitted : 2 Other approaches : Reused the same architecture in different configurations. Experiment journal: 33 hours → R: 76.2 , G: 57.2 , Score: 95.3 ▲ - AndreiKopysov 33 hours → R: 76.2 , G: 57.2 , Score: 95.3 - AndreyKopysov 31. ▶ Serj Tarasenko 3 days 82/54 95.0 Serj Tarasenko Best experiment : complicated second Signature : a5cf25 Summary : RAG pipeline with query enhancement and re-ranking. Models used : gpt-4o-mini text-embedding-3-small Architecture The winning solution implemented a Retrieval-Augmented Generation (RAG) pipeline. The process involved extracting content from PDFs, segmenting it into manageable chunks, and indexing these chunks using FAISS for efficient vector-based retrieval. Queries were enhanced with financial terms to improve relevance, followed by a retrieval step that included re-ranking to prioritize the most pertinent information. Finally, an LLM was employed to generate comprehensive answers based on the retrieved data. The source code for this implementation is publicly available. 32. ▶ AAV 7 days 62/62 93.9 AAV Best experiment : Agent+Router Signature : 5e0479 Summary : The architecture employs an agent-based approach with a routing mechanism. Models used : gpt-4o-mini Architecture The solution uses the 'gpt-4o-mini' model in an architecture combining an agent with a router. This design enables efficient task delegation and processing, optimizing performance for the challenge requirements. R&D Experiments Total experiments submitted : 6 Other approaches : Agent Agent + sim search + tfidf What didn't work? Using 'private model' instead of 'gpt-4o-mini' Excluding the router component Experiment journal: 7 days → R: 60.7 , G: 62.8 , Score: 93.1 ▲ - llm1-sim-preselected 7 days → R: 62.9 , G: 62.5 , Score: 93.9 ▲ - llm2-sim-preselected 7 days → R: 62.7 , G: 57.3 , Score: 88.7 - llm2-sim-not-preselected 7 days → R: 61.0 , G: 60.8 , Score: 91.3 - llm1-sim-not-preselected 7 days → R: 25.1 , G: 60.9 , Score: 73.5 - llm1-sim-ifidf-not-preselected 7 days → R: 27.2 , G: 62.8 , Score: 76.4 - llm2-sim-tfidf-not-preselected 33. ▶ AI Slop 🤗 3 hours 80/53 93.5 AI Slop Best experiment : AI Slop Cursor+Sonnet 3.7 Signature : fc3dc9 Summary : Utilized a streamlined approach leveraging LLMs for direct question answering. Models used : gpt-4o-mini Architecture The team employed the gpt-4o-mini model to process and answer questions directly from the provided PDF documents. By utilizing metadata and targeted queries, they efficiently narrowed down relevant information, ensuring accurate and concise responses. The approach avoided complex retrieval-augmented generation (RAG) or OCR techniques, focusing on the inherent capabilities of the LLM. 34. ▶ RAG challenge Orphist 🔒 63 min 78/53 92.4 RAG challenge Orphist Best experiment : Iterative LLM Prompting with BM25 Signature : e98c1b Summary : The solution employs BM25 for document retrieval and iterative LLM prompting for query expansion and summarization. Models used : gemma-2-9b-it Architecture The solution utilized an architecture combining BM25plus for document retrieval and iterative prompting of the gemma-2-9b-it LLM. The process involved chunking PDF documents for ingestion, storing them in an in-memory local storage, and applying BM25plus for query matching with meta-filters. Due to a last-minute issue with embedding models, the team opted for a non-hybrid pipeline. The iterative prompting expanded the initial query and used a scratchpad for summary collection, culminating in a final prompt to extract the requested information. 35. ▶ Dennis S. 🤗 7 days 81/50 91.0 Dennis S. Best experiment : Deepseek naive questionfilter Signature : 53630f Summary : A question-centered approach leveraging document parsing and heuristic-based analysis. Models used : Deepseek V3 Architecture The solution employs a question-centered methodology to efficiently extract relevant information from documents. Initially, PDFs are parsed using PyMuPDF and Tesseract for OCR when necessary. The system analyzes provided metadata and questions to identify relevant companies and metrics, classifying questions into single_fact or aggregate types. It processes documents in parallel, extracting answers based on the question type, and aggregates results accordingly. This approach prioritizes speed and cost-efficiency. R&D Experiments Total experiments submitted : 2 Other approaches : Deepseek v3 - bruteforce questionfilter What didn't work? Using regex-based logic for question classification Dividing questions into first occurrence and aggregated types without clear pipeline integration Experiment journal: 7 days → R: 79.8 , G: 50.0 , Score: 89.9 ▲ - Deepseek v3 - bruteforce questionfilter 7 days → R: 81.9 , G: 50.0 , Score: 91.0 ▲ - Deepseek naive questionfilter 36. ▶ Slava RAG 🤗 7 hours 65/57 90.7 Slava RAG Best experiment : Slava RAG Signature : 282787 Summary : Embedding: OpenAI text-embedding-3-small, LLM: GPT-4o, Vector Database: Pinecone, PDF Processing: PyMuPDF, Chunk Processing: Custom algorithm Models used : gpt-4o Architecture This architecture combined: OpenAI's text-embedding-3-small for embedding generation; GPT-4o as the primary LLM; Pinecone for vector database management; PyMuPDF for efficient PDF processing; a custom algorithm for chunk processing. 37. ▶ Alex_dao 95 min 68/56 90.7 Alex_dao Best experiment : Alex_Dao_v1_final Signature : 93c0ef Summary : Utilized a kv-index architecture. Models used : gpt4o Architecture The winning solution implemented a key-value index (kv-index) architecture, leveraging the capabilities of the GPT-4 model (gpt4o) to efficiently retrieve and process information. This approach ensured high performance and accuracy in the challenge tasks. 38. ▶ Mykyta Skrypchenko 🤗 31 hours 42/64 85.3 Mykyta Skrypchenko Best experiment : Kyiv-bge1.5 Signature : d5fb15 Summary : Integration of advanced text retrieval and vector database with LLM for question answering. Models used : gpt-4o-2024-08-06 Architecture The solution is a multi-component architecture: Fitz for efficient text retrieval BAAI/bge-base-en-v1.5 Sentence Transformer for embedding generation ChromaDB as the vector database for storage and retrieval OpenAI API for question answering 39. ▶ F-anonymous 🤗 🔒 5 days 73/47 83.8 F-anonymous Best experiment : Fully local, own DeepThinking Signature : 2a2a1b Summary : Fully local graphRAG with hybrid search and custom-tuned LLM. Models used : Qwen2.5 14b Architecture The solution by F-anonymous a fully local graph-based Retrieval-Augmented Generation (RAG) architecture. They utilized their proprietary DeepThinking framework in conjunction with a custom-tuned Qwen2.5 14b model. The system integrated a hybrid search mechanism combining vector-based and BM25 methodologies to enhance retrieval accuracy and relevance. 40. ▶ DataNXT 🔒 5 days 54/55 82.6 DataNXT Best experiment : Prototype-RAG-Challenge Signature : 0e942a Summary : Pipeline with specialised prompted LLM Calls Models used : OpenAi-4o-mini Architecture The solution utilized a pipeline architecture with specialized prompted calls to the OpenAi-4o-mini model. This approach allowed for efficient and accurate information retrieval and generation. 41. ▶ AValiev 🔒 4 hours 43/60 81.8 AValiev Best experiment : IBM-deepseek-agentic-rag Signature : 493744 Summary : Agentic RAG with type validation, Pydantic typing, Qdrant vector store querying. Models used : deepseek/deepseek-r1-distill-llama-70b Architecture This RAG solution was based on an Agentic Retrieval-Augmented Generation (RAG) architecture. It utilized type validation and Pydantic typing for robust data handling, and Qdrant vector store querying for efficient information retrieval. PDF documents were processed using PyPDF and Docling for accurate text extraction. R&D Experiments Total experiments submitted : 5 Other approaches : openai-agentic-rag IBM-mixtral-agentic-rag granite-3-8b-instruct_rag_agentic deepseek/deepseek-r1-distill-llama-70b_sophisticated_chunking_rag_agentic What didn't work? Alternative LLM models such as OpenAI-gpt-4o-mini and mistralai/mixtral-8x7b-instruct-v01 were explored but did not achieve the same performance as the winning model. Experiment journal: 54 min → R: 43.5 , G: 60.0 , Score: 81.8 ▲ - openai-agentic-rag 3 hours → R: 43.5 , G: 33.0 , Score: 54.8 - IBM-mixtral-agentic-rag 4 hours → R: 43.5 , G: 60.0 , Score: 81.8 - IBM-deepseek-agentic-rag 4 hours → R: 43.5 , G: 48.5 , Score: 70.2 - granite-3-8b-instruct_rag_agentic 34 hours → R: 35.8 , G: 53.0 , Score: 70.9 - deepseek/deepseek-r1-distill-llama-70b_sophisticated_chunking_rag_agentic 42. ▶ bimurat_mukhtar 🤗 🔒 32 hours 36/31 49.4 bimurat_mukhtar Best experiment : bm_v1 Signature : c25e30 Summary : Multi-agent architecture with specialized branches for diverse answer generation. Models used : deepseek-r1 gemini Architecture The solution is a multi-agent architecture inspired by Self RAG, where input PDFs are converted to text, preprocessed, and filtered to extract relevant information. Different branches are utilized to handle specific types of queries, leveraging the strengths of the LLMs deepseek-r1 and gemini. 43. ▶ ragtastic 7 days 4/3 5.4 ragtastic Best experiment : ragtastic Signature : 43d4fd Summary : The architecture leverages the Mistral-large model for its implementation. Models used : mistral-large Architecture The solution used Mistral-large model to achieve its objectives. The architecture is designed to optimize performance and accuracy, ensuring robust results. Round 3 Round 3 is in the planning. We are going to make R&D process more focused and rewarding for the entire participants. Second round was won by a team that took time in advance to prepare a proper evaluation and experimentation framework. They simply iterated on various architectures and took a few best ones into the round 2. The goal of round 3 is to give such capabilities to everybody in advance. We are planning to prepare a proper evaluation and experimentation framework upfront. We also want to ground challenge deeper into the business, making insights more valuable and applicable to all the participants. Let's see how this turns out. ETA for the next round - May-June 2025."}
{"url":"/erlang/basics","title":"Erlang Basics","text":"Erlang Basics Erlang is a general-purpose programming language and a runtime environment. It has built-in support for concurrency, distribution and fault-tolerance. These are my notes taken while learning about its basics and reading through Learn you some Erlang for Great Good! by Fred Hebert . It is available online. The document covers only the very introduction to Erlang. OTP and application design are outside of the scope. Table of Contents 1. Overview 2. Erlang basics 2.1. Getting started 2.2. Immutable variables 2.3. Atoms 2.4. Boolean algebra 2.5. Comparison and Equality 2.6. Tuples 2.7. Lists 2.8. List comprehensions 2.9. Binary Data 2.10. Binary Operations 2.11. Binary strings 2.12. Binary comprehensions 3. Modules 3.1. Defining 3.2. Compiling 3.3. Referencing 3.4. Macros 3.5. Metadata 4. Functions 4.1. Pattern matching 4.2. Guards 4.3. Guard patterns (ifs) 4.4. Case … if 4.5. Choosing between guards, if and case 4.6. Bound and unbound variables 5. Types 5.1. Type conversions 5.2. Type-test BIFs 6. Recursion 6.1. Basics 6.2. Tail recursion 6.3. More recursive functions 7. Higher order functions 7.1. Basics 7.2. Anonymous functions 7.3. More function helpers 8. Exceptions 8.1. Error 8.2. Internal Exits 8.3. Throws 8.4. Catching exceptions 8.5. Catch construct 9. Common data structures 9.1. Records 9.2. Key/Value Stores 9.3. Other data structures 10. Concurrency 10.1. Spawning Processes 10.2. Message passing 10.3. Receiving messages 10.4. Better design 10.5. Timeouts 10.6. Selective receive 11. Errors and Processes 11.1. Terminating a process 11.2. Links 11.3. Traps 11.4. Monitors 11.5. Naming processes 1 Overview Whenever somebody starts talking about building robust distributed systems in .NET, Java, Go etc, then that is the response they can get ( from HN ): …basically re-creating Erlang from the ground up, except half broken and not learning from Erlang's 20-30 year history Joe Armstrong's thesis is said to capture these lessons, but it is 295 page doctorate thesis. A shorter version is in one of his talks . The other sources include Erlang Factory talks, books Learn You Some Erlang, Joe's, etc. They say, that these lessons can be summarized as: CPUs aren't getting faster these days, so we are stuck building distributed and highly concurrent systems. These systems are better to be fault-tolerant. Especially we don't want a single corner-case scenario (triggered by an unlucky user) to bring down system for millions of the other connected clients. This tolerance can be achieved by composing systems from components, isolated to prevent failure from spreading. Isolation can be achieved via: A runtime that prevents memory sharing (OS Processes or Erlang VM, except that Erlang allows to have millions of procs). Proving that memory can't be shared (Rust's compiler can do this, making sure that you will not have data races. This is really exciting). Running in a container, VM or different PC. Keep in mind: a service running on a single machine isn't fault-tolerant. Isolated components need to communicate (they can't share memory), this can be achieved by sending messages. Such approach is called \"actor model\". Besides Erlang, there are many libraries and frameworks supporting it: Orleans, Akka etc Erlang is special, since it has: very light-weight processes which could be restarted very quickly (300 words each and take microseconds to create); ability for processes to monitor the other processes; communication only via messages which are copied, passed asynchronously and could easily go to another node. In addition to processes and actors, Erlang also comes with: Functional programming approach: no side effects, immutable data, explicit state updates. This makes code more explicit. Open Telecom Platform is a collection of middleware, libraries and tools (including an application server). It could be used to build, monitor and distribute these components. Monitoring and debugging capabilities. You could connect to Erlang VM node to inspect, trace, debug or even update code. All that without stopping a system. Decades worth of experience of building such systems. For example, Erlang is used heavily by Motorola, Ericsson, T-Mobile, Amazon, Yahoo, Facebook etc. Elixir is Erlang-esque language with improved code organization capabilities ( source ): Compile-time macros (which work on AST) Pipeline operator Polymorphism via protocols Mix tool (to like npm for node.js) It still runs on Erlang VM How hard is it to start developing Erlang? 2 Erlang basics 2.1 Getting started Install On OSX brew install erlang . Launch REPL erl . Abort Ctrl+G then q or q(). 2.2 Immutable variables All variables are immutable and named PascalCase . They are assigned for the first time using the pattern matching operator = . > One = 1. > Two = One + 1. 2.3 Atoms Atoms are constant literals, with a backing 4 or 8 byte integer. They are written in lowercase or with_understore . They are used in pattern matching. 2.4 Boolean algebra Erlang has the usual boolean comparison operators and , or , xor , not . These operators always evaluate both sides of the equation, while andalso and orelse are shortcut operators. 2.5 Comparison and Equality Equality operators are freaky: =:= and =/= . They test for precise equality. Comparison operators are slightly lighter: == and /= , they are more relaxed. For example: > 1 =:= 1.0. False > 1 == 1.0. True The other comparison operators are just slightly weird: < , > , =< and >= . 2.6 Tuples Tuples group multiple items together: {point,10,5} . You can also destructure a tuple via pattern matching to an unbound variable: > Point = {point,10,5}. {point,10,5} > {point, X , Y }= Point . {point,10,5} > X . 10 Tuples could be nested easily: {point,{10,5}} . 2.7 Lists Lists in erlang are constructed like [E1,E2,...E3] and could have different elements inside: [1,2,{point,{3,4}},4.5] . There is a catch. Strings in erlang are also lists of numbers. If all numbers in a list could represent a letter, then erlang will print it as string. To glue lists together use ++ and to split -- : > [1,2,3] ++ [4,5]. [1,2,3,4,5]. > [1,2,3] -- [1,2]. [3] These operations are right-associative, meaning that they are done from right to left: > [1,2,3]--[1,2]--[3]. [3] You can use [Head|Tail] pattern to compose lists (adding head is fast in erlang) and also to descructure them, where | is called a cons. In fact, any list can be constructed with [E1|[E2|[E3]]] . Note, that we can construct an improper list [2|3] , which will throw errors. 2.8 List comprehensions List comprehensions allow to generate and modify lists in a conscise way: ~[ Expr || Gen1 , Gen2 ...., Cond1 , Cond2 ...]~. For example: > [{ X , Y }|| X <-[1,2], Y <-[1,2], X == Y ]. [{1,2},{2,2}] We can also use pattern matching and existing lists: > Weather = [{ufa,sun},{moscow,rain},{odessa,fog},{spb,rain}]. [{ufa,sun},{moscow,rain},{odessa,fog},{spb,rain}] > Rainy = [ City || { City ,rain} <- Weather ]. [moscow, spb] 2.9 Binary Data To express binary data we can use: hexadecimal notation: 16#AACCEE . bit syntax: <<16#AACCEE:24>> (put these bytes into 24 bit space). For example: % declare a bunch of bytes > Bytes = <<13,234,12,34,41,1,151>>. % Take first byte and treat the rest as binary > << First :8, Rest /binary>> = Bytes . In general, binary segment could be described as: Value Value:Size Value/TypeSpecifier Value:Size/TypeSpecifier Where Size is in bits, if TypeSpecifier is not provided. The latter could be a hyphen-separated list of: Type: integer (default), float, binary, bitstring, bits, utf8, utf16, utf32, bytes (synonym for binary), bits (synonym for bitstring). Sign: signed and unsigned (default), matters only for integer . Endianness: big (default), little and native (from the current CPU). This applies for integer, utf16, utf32 or float. Unit: this is written unit:integer , where the value must be between 1 and 256. It is used for field alignment. Given that, it is trivial to parse TCP segment: << SourcePort :16, DestinationPort :16, AckNumber :32, DataOffset :4, _Reserved :4, Flags :8, WindowSize :16, CheckSum : 16, UrgentPointer :16, Payload /binary >> = SomeBinary . 2.10 Binary Operations Erlang has a common set of operators: bsr and bsl - bit shift left/right. band , bor , bxor , bnot 2.11 Binary strings Binary strings are a bolted abstraction on top of lists: <<\"Some string\">> . It is hard to pattern-match them, so they are mainly used for text storage. 2.12 Binary comprehensions You can deal with binaries using a special form of comprehension syntax. % Let us define some bytes > Pixels = <<123,34,21,45,102,32,65,61,62>>. <<123,34,21,45,102,32,65,61,62>> % Read them as a sequence of RGBS > RGBs = [ { R , G , B } || << R :8, G :8, B :8>> <= Pixels ]. [{123,34,21},{45,102,32},{65,61,62}] % convert these RGBs back to bits > Bits = << << R :8, G :8, B :8>> || { R , G , B } <- RGBs >>. <<123,34,21,45,102,32,65,61,62>> 3 Modules 3.1 Defining Erlang modules are represented as files with a bunch of attributes in -name(Attribute) form. The only required attribute is name: -module(useless) , it should match to the file name without the extension. -module (useless). -export ([ add/2 , multiply/2 ]). add ( A , B ) -> A + B . multiply ( A , B ) -> A * B . 3.2 Compiling In order to compile a module: in command line: erlc flags module.erl ; in shell or a module: compile:file(module.erl) ; in shell: c(useless). ; in Emacs buffer: C-c C-k (it will also load the module in REPL ). 3.3 Referencing We call functions from the other modules as module:function(args) . In order to importn the namespace: -import ( Module , [ Function1 / Arity , ..., FunctionN / Arity ]). 3.4 Macros Macros in Erlang are similar to compiler directives. They are defined as -define(MACRO,SomeValue). (where SomeValue is True if skipped), and could be referenced in the code as ?MACRO . We also have some predefined macros: ?MODULE - current module as an atom ?FILE - current file name as a string ?LINE - current line as an integer Macros can be tested: ifdef , else and endif . For example, you could provide a debugging macro: -ifdef ( DEBUGMODULE ). -define ( DEBUG ( S ), io : format ( \"dbg: \" ++ S )). -else . -define ( DEBUG ( S ), ok). -endif . ? DEBUG ( \"entering some function\" ) If DEBUGMODULE macro is defined, then we will have debugging, otherwise we will have an ok atom. 3.5 Metadata Compiler will capture all module metadata and make them available in module_info/0 function, for example useless:module_info() . 4 Functions 4.1 Pattern matching Erlang provides pattern matching in the form of multiple function clauses separated by ; . function ( X ) -> Expression ; function ( Y ) -> Expression ; function ( _ ) -> Expression . We can match on lists in the patterns: head ([ Head | _ ]) -> Head ; head ([]) -> []. We can do even fancier stuff, by destructuring in function head: add ({ Date = { YYYY , MM , DD }, Increment = {days, Days }}) -> { YYYY , MM , DD + Days }. 4.2 Guards Guards are additional that can go in function head to make it more specific. The syntax is like this: % constrain by Expr1 (must be true) function (args) when Expr1 -> Expression . % Expr1 and Expr2 must be true function (args) when Expr1 , Expr2 -> Expression . % Expr1 or Expr2 must be true function (args) when Expr1 ; Expr2 -> Expression . You can also use andalso and orelse inside guards. They are similar, but can be nested, however shortcut operators can fail (normal guard can proceed even if the first argument fails with exception). Guards will not accept user-defined functions! This is done to guarantee that we don't get any side effects there. 4.3 Guard patterns (ifs) Guard patterns are defined with if inside functions, they share the syntax. They must have catch-all clause, though: some ( A ) -> if A =:= 1 -> one; true -> something_else end . Generally, if patterns were written to get benefit if case statement (below) but without the need to write full syntax. 4.4 Case … if Case allows to have normal pattern matching inside a function. case Arg of : Pattern [ Guards ] -> Expression ; Pattern [ Guards ] -> Expression ; _ -> Expression end . In essence, this expression could be rewritten as fun ( Pattern ) [ Guards ] -> Expression ; fun ( Pattern ) [ Guards ] -> Expression ; fun ( _ ) -> Expression . 4.5 Choosing between guards, if and case Just use whichever makes the code more simple. Performance differences are negligible. 4.6 Bound and unbound variables Bound and unbound variables behave differently when we do pattern matching: unbound - attaches value to them; bound - error, unless new value is the same as the old one. 5 Types Erlang has a dynamic type system . This is aligned with a general belief that a failure in one subsystem shouldn't bring the others down. Erlang worries more about keeping the systems running, that trying to have error-free code. It is easier to do hot code reloading in a system with dynamic typing. Erlang is also strongly typed, performing type checks at runtime. For example 1 + lama would throw an error. 5.1 Type conversions Type conversions in Erlang are implemented with BIFs in erlang module: > erlang : list_to_integer ( \"54\" ). 54 > erlang : integer_to_list (42). \"42\" > erlang : atom_to_list (atom). \"atom\" There are many more conversions of this type. 5.2 Type-test BIFs Type-test BIFs are special BIFs that can be used in guard clauses: is_binary , is_atom etc. They help to write declarative code: Func( Arg ) when is_atom ( Arg ) -> atom; Func ( Arg ) when is_binary ( Arg ) -> binary; ... 6 Recursion Erlang has functions that can call themselves - recursive functions. That is the looping construct (aside from list comprehensions) that exists in the language. 6.1 Basics It is usually advised to start writing a recursive function from a base case (a well known scenario, when recursion terminates). Here is an example of function that calculates length of the list: len ([]) -> 0; len ([ _ ]) -> 1; len ([ _ | Tail ]) -> 1+ len ( Tail ). 6.2 Tail recursion Tail recursion aims to eleminate stacking of recursive operations as they happen by reducing them as they happen. Erlang could optimize tail calls. Let's rewrite our function to transform it into a tail recursion: tail_len ( N ) -> tail_len (0, N ). tail_len ( Acc ,[]) -> Acc ; tail_len ( Acc ,[ _ ]) -> Acc +1; tail_len ( Acc ,[ _ , Tail ]) -> tail_len ( Acc +1, Tail ). If function is calling itself in a tail position (last expression to be evaluated is the function itelf), then Erlang VM could avoid storing current stack frame. This is called tail call optimization (a specific case of last call optimization ). Such optimizations make tail recursions useful and alvoid wasting a lot of memory. Even if tail call optimization doesn't work, large per-process stack of Erlang VM could help to handle some scenarios. 6.3 More recursive functions Just to get into the habit. repeat/2 function: repeat (0, _ ) -> []; repeat ( N , X ) when N > 0 -> [ X | repeat ( N -1, X )]. and tail-recursive function: tail_repeat ( N , X ) -> tail_repeat ([], N , X ). tail_repeat ( L , 0, _ ) -> L ; tail_repeat ( L , N , X ) when N > 0 -> tail_repeat ([ X | L ], N -1, X ). Tail-recursive reverse function (note, that there is a BIF lists:reverse/1 ): reverse ( L ) -> tail_reverse ([], L ). tail_reverse ( Acc , []) -> Acc ; tail_reverse ( Acc , [ H | T ]) -> tail_reverse ([ H | Acc ], T ). Tail-recursive sublist function: head ( L , N ) -> reverse ( head ([], L , N )). head ( Acc , _ , 0) -> Acc ; head ( Acc , [], _ ) -> Acc ; head ( Acc , [ H | T ], N ) -> head ([ H | Acc ], T , N -1). 7 Higher order functions Higher order function is a function which can accept reference to another function as an argument. 7.1 Basics Let's define a map/2 function: map ( _ , []) -> []; map ( F , [ H | T ]) -> [ F ( H )| map ( F , T )]. If we put it into a useless module and also add inc(X)->X+1 , then: > useless : map ( fun useless:/1, [1,2,3]). [2,3,4] 7.2 Anonymous functions Always declaring functions can be boring, hence the inline version. > Fn = fun ( A ) -> A end . #Fun<erl_eval.6.90072148> > Fn (1). 1 In general an anonymous function can be declared as: fun ( Arg1 ) -> Expr1 , Expr2 , ... Expr N ; ( Args2 ) -> Expr1 , Expr2 , ... Expr N ; ( Args3 ) -> Expr1 , Expr2 , ... Expr N end In erlang anonymous functions inherit the scope that they have been declared in. We can have closures (capturing some variables that were a part of the scope). In the shell: > Var = 1. 1. > Closure = fun () -> Var end . #Fun.... > Closure (). 1 Assigning function to a variable Fx = fun useless: is_odd/2 7.3 More function helpers Just a bunch of common helpers: lists:reverse/1 ; lists:map/2 ( Select in LINQ); lists:filter/2 ( Where in LINQ); lists:foldl/3 ( Aggregate , starting left); lists:foldr/3 ; all/2 and any/2 ; dropwhile/2 and takewhile/2 ; partition/2 - generates two lists, where predicate matches and not; flatten/1 - select many. 8 Exceptions There are 3 types of exceptions in erlang: errors , exits and throws . 8.1 Error Errors can be created by erlang:error(Reason) . They will abort execution of the function and return a stack trace with all the arguments. 8.2 Internal Exits These are called with exit/1 . They don't have a stack trace and are generally used to pass \"last breath\" information between the processes. 8.3 Throws Throws (created by throw/1 ) are used to control the excecution flow. They can also be used for non-local returns in deep recursion (e.g. deep function would throw exception for a top-level function to catch and return a default value to the user). 8.4 Catching exceptions Exceptions can be handled with try...of...catch block: try Expression of SuccessPattern -> Expression ; SuccessPattern -> Expression catch TypeOfError : ExceptionPattern -> Expression ; TypeOfError : ExceptionPattern -> Expression end . where TypeOfError can be: error , throw or exit , defaulting to throw if skipped. There also is a special catch-all pattern: _:_ , which will handle any exception type. Expression between try...of could be a function or just a whole bunch of expressions: Expr1, Expr2... ExprN . This section is called protected . It is also possible to have finally block, which can't retrun any value but would be used for its side effects (e.g. closing a file). It is called after in erlang. The protected section can't be tail-recursive, since the VM would keep a reference, in case an exception shows up. Code between catch..of isn't protected and could be tail-recursive, unless after block is specified. 8.5 Catch construct This is a weird one. Keyword catch could be used alone to capture all either exception or a good result out of an expression: catch Expression . 9 Common data structures 9.1 Records Records are a hack that were added to the language later. They provide a syntactic sugar on top of the ordinary typles. 9.1.1 Defining Records Records are defined as module attributes: -module (record). -compile (export_all). -record ( creature , { name, color, life=1, attack=0, abilities=[] }). This is a creature with some fields and default values. first_creature () -> # creature { name= \"Zombie\" , color=black, attack=1, abilities=[drain_life] }. If we try running the code, tuple would be visible: > record:first_creature(). {creature,\"Zombie\",black,1,1,[drain_life]} Ok, these are the raw tuples. Let's load record definitions into the shell by rr(Module) : > rr(record). [creature] > record:first_creature(). #creature{name = \"Zombie\",color = black,life = 1,attack = 1, abilities = [drain_life]} There are a few erlang shell helpers to deal with records: rd(Name, Definition) - define records in the shell, as if this were a module attribute; rf() , rf(Name) and rf([Names]) - flush all records, specific one or a list; rl() , rl(Name) and rl([Names]) - print shell records in a copy-pasteable way. 9.1.2 Reading values You can read values from a record either via pattern matching (as if they were a tuple) or via a helping syntax (which is like an object cast): > Zombie#creature.name. \"Zombie\" Things could get pretty messy, if we have nested records (each one has to be cast separately). On the bright side, we get a nice and short pattern matching (without the need to write full tuple declaration): damage (# creature {life= Life , name= Name }, Damage ) when Damage >= Life -> \"Die \" ++ Name ; damage (# creature {color=black,name= Name }, Damage ) -> \"Suffer you, black \" ++ Name . Then, in the console: > record:damage(Zombie,0). \"Suffer you, black Zombie\" > record:damage(Zombie,1). \"Die, Zombie\" 9.1.3 Updating Records Records would be useless, if we couldn't update them: real_damage ( Meat = # creature {life= Life }, Damage ) when Damage < Life -> Remaining = Life - Damage , Meat # creature {life= Remaining }. This syntax is just a short-cut for calling erlang:setelement/3 behind the scenes. Then, deal some damage in console manually and via the function: > ZombieKing = Zombie#creature{life=5}. #creature{name = \"Zombie\",color = black,life = 5,attack = 1, abilities = [drain_life]} > record:real_damage(ZombieKing, 2). #creature{name = \"Zombie\",color = black,life = 3,attack = 1, abilities = [drain_life]} 9.1.4 Sharing records Theoretically, it is possible to share records by declaring them inside a shared header file, which would then be included in the other erlang files: % just some header file -record ( damage , {amount = 1, color}). This block could be included into the original module, and then used normally: -include ( \"header.hrl\" ). black_damage ( Amount ) -> # damage {amount = Amount , color = black}. Now, reload the records and test the command: > rr(record). [creature,damage] > record:black_damage(1). #damage{amount = 1,color = black} However, sharing records between modules like this isn't a good idea. It exposes fragile data structure implementations (guts) to the outside world and is as bad as sharing database between multiple services. A better approach would be to expose getter and setter functions, which would allow manipulating records while hiding the implementation details. 9.2 Key/Value Stores For storing small amounts of data we have a property list ( proplist ) and an ordered dictionary ( orddict ). 9.2.1 Proplist Prolist is a loose bag of [{Key,Value}...] useful for storing small amounts of data. You create it by constructing manually, update in a light-weight manner by appending a new item NewList = [NewItem|OldList] . Then you can use methods in prolists module to lookup, replace, compact etc. Proplists are defined very loosely and are usually used for managing configuration settings. 9.2.2 Orddict orddict module provides a more formal key-value store that works well for up to 75 elements: unique keys that are also ordered; faster lookups; items must respect strict {Key,Value} structure. It is best to avoid manipulating the orddict instance directly and rely on the methods: new/0 and from_list/1 to create; store/3 to save; find/2 , fetch/2 to get; erase/2 to delete. 9.2.3 Dict vs gb-tree dict module exposes the same interface as orddict . It works well for larger datasets and also has fold/2 method. gb_tree is a rebalancing tree with slow insertions (updates and reads are rather fast). It provides more direct access to the data structure (e.g. allows to perform updates without checks) but doesn't have fold/2 (requires using iterators on a subtree instead). 9.3 Other data structures 9.3.1 Sets Erlang has for main modules to deal with sets: ordsets - slowest, but the most simple, implemented as ordered list; sets - same interface as ordsets but scales better and has faster reads, supports =:= operator; gb_sets - same interface, but provides more fine-grained control; sofs - sets of sets, for mathematical reasoning. 9.3.2 Graphs Erlang allows manipulating with graphs via: digraph - construction and manipulation of a directed graph; digraph-utils - navigation, testing of these graphs. Also sofs module allows converting families to the graphs and back. 9.3.3 Queues Queues in erlang are FIFO, provided by queue module. Internally they are implemented as two lists: add to first list; pop from the second list; when second list is empty - reverse the first and swap them. APIs could be split in 3 groups: basic: new/0 , in/2 and out/1 ; advanced: peek/1 , drop/1 , reverse/1 etc; Okasaki API: you probably don't need it. 10 Concurrency Concurrency is when two tasks can run at the same time, but don't necessarily do (e.g. multi-tasking on a single core). Parallelism is when these tasks do run at the same time (e.g. on multi-core machine). Erlang had concurrency from the very beginning, parallelism was possible by running erlang on multiple machines. Modern multi-core systems allow parallelism on a single machine. However, your parallel program goes only as fast as its slowest sequential part . Erlang wouldn't be a good fit for all problems. 10.1 Spawning Processes A process in erlang is just a function that can be scheduled to run via spawn/1 . This function returns pid in form of <0.160.0> which can be used to communicate with the process. The process terminates when the function returns. BIF self/0 returns the pid of the current process. 10.2 Message passing You send messages with ! (bang) operator, which works in form of Pid ! hello and returns a message (so that it can be passed to multiple processes). 10.3 Receiving messages You can dump all messages via flush/0 which would simply print them. For real work use receive expression, which looks similar to case pattern matching: receive Pattern1 when Guard -> Expr1 ; Pattern2 when Guard -> Expr2 ; Pattern3 -> Expr3 ; _ -> Expr4 end . Let's write us a small module: -module (spell). -compile (export_all). cast () -> receive {heal, Amount } -> io : format ( \"Healing ~p~n\" ,[ Amount ]); {damage, Amount } -> io : format ( \"Ouch for ~p~n\" , [ Amount ]); _ -> io : format ( \"WTF?~n\" ) end . This function could be launched either via spawn(fun spell:cast/0) or via a helper spawn(spell, cast, []) , which takes a module, function and arguments. Once launched, cast would sit around waiting for the receive to get a message. It will process it and terminate. > S1 = spawn(spell, cast, []). <0.181.0> > S1 ! {boom}. WTF? {boom} The only way to know if the recipient is alive and got message is by sending a reply. We can do that by packaging return address into a tuple. cast2 () -> receive { From , heal, Amount } -> io : format ( \"Healing ~p~n\" ,[ Amount ]), From ! \"healing!\" ; { From ,damage, Amount } -> io : format ( \"Ouch for ~p~n\" , [ Amount ]), From ! \"damage\" ; _ -> io : format ( \"WTF?~n\" ) end . Then run in a shell: > spawn(spell,cast2,[]) ! {self(), heal,2}. Healing 2 {<0.155.0>,heal,2} > flush(). Shell got \"healing!\" ok Now we just need to make sure that the process can process more than one message. We can do that by making it tail-recursive: summon ( Life ) -> receive { From ,damage, Amount } when Amount >= Life -> % no recursion here io : format ( \"Dead! ~p~n\" , [ Amount ]), From ! { self (), \"dead\" }; { From ,damage, Amount } -> io : format ( \"Ouch! ~p~n\" , [ Amount ]), From ! { self (), \"this hurts!\" }, summon ( Life - Amount ); { From , heal, Amount } -> io : format ( \"Healing ~p~n\" ,[ Amount ]), From ! { self (), \"healing!\" }, summon ( Life + Amount ); _ -> io : format ( \"WTF?~n\" ), summon ( Life ) end . The process will continue running till we damage it too much. Note that we also pass the state through that recursion. 10.4 Better design Message structure is a bit like internal implementation detail. Do we really need to expose it that much? Let's encapsulate the details. summon_wolf () -> spawn (? MODULE , summon, [3]). fireball ( Pid ) -> Pid ! { self (), damage, 2}, receive { _ , Message } -> Message end . ointment ( Pid ) -> Pid ! { self (), heal, 1}, receive { _ , Message } -> Message end . This hides all the dirty details and allows us to focus damage dealing: 0> Wolf = spell:summon_wolf(). <0.289.0> > spell:fireball(Wolf). Ouch! 2 \"this hurts!\" > spell:fireball(Wolf). Dead! 2 \"dead\" 10.5 Timeouts There is a problem. If we send spell:fireball/1 to a non-existent process (or the dead one) from our shell, then it will freeze. We are stuck in a receive deadlock. To work around the issue, receive has an after Timeout construct: fireball2 ( Pid ) -> Pid ! { self (), damage, 2}, receive { _ , Message } -> Message after 1000 -> timeout end . ointment2 ( Pid ) -> Pid ! { self (), heal, 1}, receive { _ , Message } -> Message after 1000 -> timeout end . After also can accept infinity atom (in case timeout is passed as an argument and we want to wait forever). There are two special cases - sleeping ( receive without any patterns) and trying to get messages without waiting ( receive with zero timeout). 10.6 Selective receive We can perform a selective receive by ignoring some messages (which puts them into a save queue for later processing). This is done via: important () -> receive { Priority , Message } when Priority > 10 -> [ Message | important ()] after 0 -> normal () end . normal () -> receive { _ , Message } -> [ Message , normal ()] after 0 -> [] end . The approach has a pitfall: selective receive puts non-matching messages into a special save queue, which is then traversed on each message. We can work around that by: provide a catch-all Unexpected variable which will log and discard; store messages in min-heap , gb_trees module or whichever structure is applicable. 11 Errors and Processes 11.1 Terminating a process When a process terminates, it always terminates with an exit reason , which can be of any term. If exit reason is atom normal then the process terminated normally. A process can terminate: when a run-time error occurs with {Reason,Stack} ; terminate by itself by calling exit(Reason) , error(Reason, [Args]) , fault(Reason, [Args]) ; when exit signal is received with a reason other than normal . Terminating processes emit exit signals to all linked processes. These kill a process, unless the reason is normal . A process can call exit(Pid, Reason) to emit an exit signal with Reason to the target process. Sender is unaffected. 11.2 Links You can link/1 current process to another by PID, making them emit exit signals to each other on termination. Linking is idempotent. Note, that in link(spawn(Function)) spawned process could finish before link is established. It is better to use atomic spawn_link/1 . 11.3 Traps A process can be configured to trap incoming exit signals and convert them into messages of {'Exit', FromPid, Reason} that are put into mailbox. One exception - if a reason is kill (e.g. from exit(Pid, kill) ), then this will bypass the trap and kill the process. Upon termination, an exit signal with killed is sent to all linked processed (to avoid killing them unconditionally). 11.4 Monitors Monitors are a unidirectional way to monitor if the process is alive and when it goes down. A process can setup a monitor via monitor(process, Pid) which would return Ref reference to the monitor. That reference could be used to demonitor/1 the process. When a monitored process dies, we get a message: { 'DOWN' , Ref , process, Pid , Reason } If the process didn't exist in the first place we would get the message immediately with Reason set to noproc . 11.5 Naming processes To give a process a name, use erlang:register(Name,Pid) . If the process dies, it will loose its name. Or you can use unregister/1 to do it manually. Then, use whereis(Name) to get Pid (or an undefined )."}
{"url":"/essential-software-2021/","title":"My Essential Software","text":"My Essential Software List of my essential software and services at the moment. On a laptop (MacBook Pro, pre-M generation): JetBrains IDEs: PyCharm and GoLand Firefox (because of Containers for work accounts) and Safari (energy efficient) MailMate - outstanding mail agent that keeps me sane MS Teams - team collaboration (quite buggy) MS Outlook - to accept office invites (mail is handled by MailMate) Vorta and Borg Backup - incremental backups of important files Typora - Markdown editor in which I'm typing this blog post. Apple Notes - gets the job done and surprisingly lean Monosnap - making screenshots and quickly annotating them Anki Desktop - to manage my collections of flash cards for learning iTerm2 - terminal Mobile (Pixel 3a, Android): Firefox (because runs uBlock) AnkiDroid - flash cards client for Anki. This where I learn German vocabulary CoolReader - for reading books in epub and fb2 Telegram and WhatsApp - primary chats CLI (shared across systems via dotfiles ): vim - lightweight editing Emacs - just for the org-mode (timesheets) zsh - just a shell that I'm used to tmux - to keep things running and have multiple tabs in the terminal On a server : Jupyter Notebook - to play with ideas, data and charts PyJournal - web app to extend my brain when working with multiple projects Wireguard - private VPN (mobile and laptop are in that network) Services : Hetzner - dedicated server for apps, VPN gateway and encrypted off-site backups Integromat - automation service used for tracking interesting (for me) tweets and posting them to my Telegram channel Spotify and Netflix DigiCal - because beautiful calendars AWS - S3, Route 53 and CloudFront to host this blog iPad : Adobe Draw - for sketches (discontinued) GoodReader - to read and annotate PDFs"}
{"url":"/event-sourcing/intro/","title":"♻️ Event Sourcing","text":"♻️ Event Sourcing Event Sourcing is a software pattern : we record application state as a sequence of events throughout the time. Events could be then be projected to create application state at any moment in time. We could even derive new facts from the events."}
{"url":"/events/2025-kandddinsky","title":"KanDDDinsky 2025","text":"KanDDDinsky 2025 KanDDDinsky 2025 Here are the links and references for the attendees of my talk at KanDDDinsky 2025: When DDD Met AI: Practical Stories from Enterprise Trenches Slides : PDF 18MB First of all, there is a lot of structured material on Schema-Guided Reasoning: Schema-Guided Reasoning (SGR) SGR Patterns SGR Examples SGR Demo SGR Adaptive Planning Demo of building reasoning SGR agent in 160 lines of Python code is particularly interesting, because it tends to \"clicks\" for some engineers. A team of enthusiasts took that SGR Demo and built an Open Source chat with deep research capabilities and tool use. It got so popular, that even banks started studying and taking it apart. Here is the Github repository . If you want to dive deeper into the core AI research and all insights from the past few years of community research, a lot of that is available in the public materials: Enterprise RAG Challenge (make sure to scroll to the second leaderboard and expand rows of individual teams - there are mini-reports for each one) LLM Benchmark Reports that also include product and architectural insights: latest report , full archive My personal low-frequency newsletter If you want to learn more about patterns and practices of building LLM-driven products in different industries - check out Courses . If you are an enterprise or large business that needs expertise get a project done - go to TIMETOACT Austria (I'm head of ML and Innovation there). Especially, if you are interested in software modernisation projects and enterprise AI projects."}
{"url":"/foundationdb-is-back/","title":"FoundationDB is Back!","text":"FoundationDB is Back! I'm very excited today - Apple has just open-sourced FoundationDB . What is FoundationDB? FoundationDB is a distributed scalable transactional (ACID) key-value database with crazy thorough testing . It was so good that three years ago the entire company was bought by the Apple and shut down. Despite that unfortunate fact, FDB was still used as an unsupported product in companies like: Snowflake - Data warehouse for the cloud. They say that FoundationDB \"is a key part of our architecture and has allowed ... to build some truly amazing and differentiating features\". Wavefront - Cloud monitoring and analytics by VMWare \"has been using FoundationDB extensively with over 50 clusters spanning petabytes of data in production\". Update: SkuVault - online warehouse management system, used FoundationDB as a distributed event store (commit log) holding billions of events and a coordination layer for nodes in the cluster. Obviously, Apple used FoundationDB internally as well, pushing a lot of improvements into it over the years. One of the improvements was the system keyspace sharding which, as Will Wilson mentioned , was the major blocker for the petabyte-scale databases . Another interesting improvement since 3.0 is the introduction of version stamp operations - atomic operations that put the commit version and batch number of a transaction into a key or value. That could help, for example, when you need to push a lot of writes in order to a cluster (e.g while building a consistent and distributed commit log). I used FoundationDB in my projects, too. It is just that good and reliable, even when the cluster runs in the cloud environment with flaky networks and rebooting machines. In fact, it is perfectly designed for that environment because of the rigorous testing. Data Model FDB is a key-value database with a \"weird\" design. You can think of it as a giant sorted dictionary, where both keys and values are byte arrays. You can do your normal operations with that dictionary (e.g. set, get, delete, get range) while joining multiple operations into a single ACID transaction. It is similar to the LMDB - an embedded B-Tree database which I also love and use. It works best, if you use FoundationDB data modeling primitives with it. This interface is very low-level, however it allows you to build up your own data layers on top, while hosting them on a single cluster: object storage; lists; tables; graphs; indexes; blob storage; high-contention queues; distributed commit-logs; pub/sub. In essence, FoundationDB is a database constructor . We've implemented all sorts of layers at SkuVault. Once you get used to the data modeling approach, using traditional specialized data storage systems feels very limiting. At some point in the past I scraped Web Archive copy of FoundationDB to preserve their data modeling recipes . Testing FoundationDB team developed the database inside a deterministic simulation . They abstracted away IO operations like network and disk, which allowed injecting all kinds of faults while running clusters under the load inside an accelerated time. Examples of faults that could be introduced in such environment: network outage; buggy router (dropping network packets, duplicating or delaying them); disk outages (corrupting data, running out of the disk space or slowing down); machine reboots and freezes; human errors (e.g. swapping IP addresses or disks). FoundationDB also had a real hardware cluster which had all kinds of faults thrown at it: power outages, network disruptions and disk filling. Deterministic simulation in this context means that you can simulate an entire cluster using a single thread . When a bug manifests itself, you could replay that simulation as many times as you want. It will always be the same. Just keep the initial rand-seed. Running the simulated system under a custom scheduler also enables you to accelerate the time , just like with any discrete-event simulation . If you know that for the next 10ms nothing interesting is going to happen, you could instantly fast-forward the world to that point in time. Time acceleration means that you can simulate years of cluster life in hours of real time. FoundationDB runtime is extremely rigorous and probably has an acceleration factor of x10. Simpler approaches can get acceleration of x100 to x1000. You can learn more about the approach from this brilliant talk by Will Wilson: Testing Distributed Systems w/ Deterministic Simulation . By the way, Kyle Kingsbury (\"Aphyr\") didn't bother running Jepsen against FDB because FoundationDB's internal testing was much more rigorous that Jepsen. Simulation Prototype I've been trying to reverse-engineer and replicate this approach for some time. The research started with a simple CPU job scheduler but quickly escalated to a distributed cluster simulation. You can check out the last prototype which introduces: Configurable system topology - machines, services and network connections. Simplified simulation of TCP/IP . This includes connection handshake, SEQ/ACK numbers and reorder buffers. There is now proper shutdown sequence and no packet re-transmissions. Durable node storage in form of per-machine folders used by the LMDB database. Simulation plans that specify how we want to run the simulated topology. This includes a graceful chaos monkey. Simulating power outages by erasing future for the affected systems. Network profiles - ability to configure latency, packet loss ratio and logging per network connection. The prototype builds upon the C# async/await to mimic the benefits of the FoundationFB Flow language and allow writing code that looks like normal parallel async but could be run sequentially in a simulation, rescheduled or erased. Unlike his naive simulation research, FoundationDB is the real thing . It has all the edge-cases worked out, internal transaction pipelines - tuned and the overall experience battle-tested with huge clusters. Summary I'm very excited that Apple has decided to open source FoundationDB. It is an outstanding piece of engineering that helps to build scalable and self-healing distributed systems. FoundationDB is free to use at any scale. Check it out on github . You can continue reading this story in a blog post about how SkuVault used FoundationDB all these years."}
{"url":"/golang/","title":"Go: The Good Parts","text":"Go: The Good Parts Go is an opinionated programming language initially developed by Google in 2007. It is a statically typed language with a simple syntax , resembling C or JavaScript. It features garbage collection, type safety and large standard library. Go can be statically compiled into a single executable binary, which could target a large number of operating systems (from Linux and Windows to Plan 9) and processors (i386, amd64 and ARM). In HappyPancake project we found that Golang was a good fit for developing event-driven backend services. Other contenders included Scala, C#, Haskell and Erlang. Here is how code can look like in Golang: func ( m * ContactsModule ) handleList ( r * ApiRequest ) ApiResponse { return NewObjectResponse ( & ContactListModel {}) } type ContactsModule struct {} func ( m * ContactsModule ) Register ( r Runtime ) { r . HandleApi ( \"GET\" , \"/contacts\" , m . handleList ) } Good parts Concise and simple syntax, easy to get started with. Good facilities for writing concurrent programs that share state by communicating (goroutines and channels). Good ecosystem for developing backend servers, all major drivers and libraries are generally available. Decent integration story with native code. Lightweight development stack with major IDE-like features provided by command-line tools (and available in different editors). Unified formatting style for the language, provided by gofmt . Compilation is fast even with large projects. Go supports all major OSes and CPU architectures. Lightweight type system with interfaces \"by example\" and type embedding. As a new language, Go attracts a lot of talent and interested developers. There are some interesting projects taking place in this ecosystem. Opinionated parts Golang is an opinionated language which started as an experiment. Some of its design decisions can take time to get used to. This especially applies to developers with a strong background in Object-Oriented Programming. Yet, these opinionated decisions ultimately add to the strengths of the language. Focused Packages Golang ecosystem encourages separation of code into compact cohesive modules which do one thing and do it well: packages . It is common to have packages with only one or two files. This way of structuring applications follows the Unix way of building software. It also aligns well with micro-services architecture. This approach might feel foreign for people coming from ecosystems which favor larger modules (e.g. projects in .NET). Yet it leads to code that is more simple and reusable. Type System Type system in golang is an opinionated one. It includes following features: Data types in form of structs (plain DTOs). Embedding types in each other. Functions and methods (functions with a receiver for dot syntax: response.Render() ). Implicit interfaces \"by example\" (types implement interfaces by providing methods with matching signatures). Lack of generics. Go is unique in the way methods, interfaces, and embedded types work together. Not so good parts You must set and tweak GOMAXPROCS variable for each program which needs parallel execution. You can build web applications and native UIs with Golang, but ecosystem there is rather immature, if compared to the other platforms. Ecosystem can be lacking in unexpected places. For example we had to deal with shortcoming of default SQL layer and patch PostgreSQL drivers. Debugging golang code is neither easy nor reliable. Golang has prominent explicit error handling style; if followed blindly, it can turn code into a spaghetti of error checks. Golang stores all dependencies as folders in GOPATH , e.g. abdullin/chk . Having a single global workspace makes things tricky, when you need to work with multiple versions of a library. This is the state of Golang at the moment of writing. Things will improve over time. More Examples This is an example of a simple web server in go. It maintains a list of accepted jobs, which are served for GET requests. You can POST new jobs to the server. It will process them asynchronously, adding to the list. We coded this example with Tom Janssens while working on Happy Pancake project. Neither of us had any prior knowledge of go. package main import ( \"fmt\" \"net/http\" ) var queue chan string var joblist [] string func init () { queue = make ( chan string , 10 ) joblist = make ([] string , 0 ) } func handler ( w http . ResponseWriter , r * http . Request ) { switch r . Method { case \"POST\" : val := r . PostFormValue ( \"job\" ) fmt . Fprintln ( w , \"VALUE: \" , val ) queue <- val case \"GET\" : fmt . Fprintln ( w , joblist ) default : fmt . Fprintf ( w , \"Not supported\" ) } } func projection () { for req := range queue { joblist = append ( joblist , req ) } } func main () { go projection () http . HandleFunc ( \"/\" , handler ) http . ListenAndServe ( \":8080\" , nil ) } Summary I think, Golang is a great language for building fast back-end servers or utilities that work across operating systems. Ecosystem and tooling around golang are young. This attracts some talent but might create additional problems for certain projects. References An introduction to programming in Go Effective Go A Tour of Go Docker and Go: why did we decide to write Docker in Go? (Slides) SO Post : Erlang vs Go Video on concurrency patterns / principles"}
{"url":"/happypancake/","title":"👍 HappyPancake","text":"👍 HappyPancake At the end of 2013 Tomas Roos invited me to join the team of HappyPancake, largest free dating site in Sweden. It was initially written in ASP.NET with MS SQL Database server, soon to become a rather complex solution that was expensive to scale. Together with Pieter Joost , our small distributed team of 3 people started redesigning this site towards a simpler design that would be easier to evolve in the future. This is the story of that project."}
{"url":"/happypancake/2013-12-17/","title":"New Team","text":"New Team My starting days in HappyPancake were quite intense and interesting, despite the fact that I could spend only 20 hours per week on the project. I learned a lot of things that would be be completely out of reach for a .NET developer within Microsoft OS. Here are some bullet-points: Google Hangouts work nicely for team collaboration and screen sharing. SyncSpace drawing app is probably the best collaborative white-board for a distributed team. Mindmups are great for collaborative and personal brain-storming. Erlang is a great language for building low-latency and CPU-efficient apps. It has some learning overhead for a .NET guy like me. Golang is a great language with good ecosystem and really good performance. If compared to erlang, golang has a lower learning overhead for a .NET guy. Within these days we invested time in establishing high-level design guidelines for the system to be implemented. High-level goals were: Iterative development with emergent design (we don't know all the answers) Micro-services with support for experimentation and A/B testing (to get these answers and discard wrong assumptions) Base design evolution on reality and measuring it to validate assumptions Ubiquitous language for communication between services : HTTP with JSON/HTML Any language within the service (as long as it runs on Linux) Designing for a distributed team that wants to spend years learning things, experimenting and playing with cool tech. Making this work requires a lot of responsibility and ownership, which have to be factored into the design, as well. We currently believe that Micro-Services approach with a bit of Programmer Anarchy might work well for our case, as a foundation for building things up. For the upcoming week I plan to continue catching up with Golang (currently reading through The Way To Go) and then start drafting a prototype of low-latency Message Bus with Event Store capabilities and FoundationDb backend (codename BroStore )."}
{"url":"/happypancake/2013-12-23/","title":"Language is an Implementation Detail","text":"Language is an Implementation Detail Although everything about working at HPC is interesting, last week was quite peculiar on its own. There was an interesting discussion about use of async pub-sub messaging for communications between micro-services. That's what Fred George does, for example, with event messages. However, command messages have their own value as well, due to the behaviour that we associate with them (only one message handler could deal with command messages, unlike event messages, where there could be 0 or more). Yet, after a bit of discussions with Tomas, we discovered that introduction of command messaging breaks our nice decoupling with regards to ease of upgrades, experimenting or continuous delivery. Besides, if needed, you can always implement command messing within the boundaries of the service. This is possible, since we place a clear separation between high-level design decisions (the ones which talk about how mServices should behave and communicate) and implementation details (which govern how mServices should be actually implemented). For example, here are some high-level design decisions: Protocol for communications - JSON/HTML over HTTP in our case Messaging semantics - async pub/sub in our case Approaches for deployments, versioning them and experimenting with them - rapid iterations, A/B testing, using business metrics as the driver Set of recommended languages and technologies - still evaluating (see below) Design and development priorities - creating fun environment to work in, keeping things real, small and simple Execution and hosting constraints - Linux, clustered in our own DC with geo-replication Additional constraints - low latency and failure-tolerant Curiously enough, we are still iterating through the suitable languages for implementing new version of HPC (while also addressing the design and domain questions). So for this week I'm going to spend more time learning Haskell (in addition to doing dives into Erlang and Golang during the previous weeks). At the same point, our rewrite will probably start in .NET with micro-services design. Reason for that being - .NET, despite it's shortcomings and costs is the language where we all would be most productive and could release initial versions fast. This is crucial for gaining real-world feedback for evolving the system. By the way, if you listen to Fred George, he mentions that at one point 150000 lines of Java code where rewritten in 4000 lines of Closure code (or so). Based on my exposure to Haskell so far, I'd say that C# is almost as verbose as Java in this sense. Then, as the need arises, micro-services will be rewritten in one of more Linux-friendly functional languages to: Save on licensing costs. Improve performance and reduce latency. Make code more simple and concise. Stand on the shoulders of giants, reusing ecosystem and communities of languages we choose. In other words, languages are treated just like implementation details of the system. Even though there are some recommendations and guidelines, developers should be able to choose the tool they want in order to get the job done in the most efficient way."}
{"url":"/happypancake/2014-01-18/","title":"Moving Forward with Golang","text":"Moving Forward with Golang After a couple of iterations we settled for the go language as the primary language for rewrite of happy pancake from C#. Ideally we'll converge on Haskell later (that's something I would really like, due to the powerful type system and high suitability for capturing domain models). However, for the time being the primary language will be go. Reasons for that being: Simplicity of the language and similarity to C# Excellent ecosystem for the development of backend servers Availability of go drivers for FoundationDB Linux development stack (Ubuntu + Sublime/Vim) without large license fees Language is expressive enough for our needs Excellent resources and help tools Why FoundationDB is so important to us would be another blog post later (long-story short: it is like fast Redis with proper clustering support and only one data structure sorted key-value ranges). There are a few downsides of golang that we are going to live with: Concept of workspaces is somewhat messed up (imagine, that you have to work with two versions of a library). However, this is not nearly as bad as dll and nuget hell in .NET world Absence of generics or type inference that would work as such Getting started with golang was rather simple. We went with Tomas through: An introduction to programming in Go Effective go A tour of go All of these resources are an easy read (mostly attributed to the simplicity of the language itself). While doing that I setup an Ubuntu (LTS) with Sublime Text 2 and GoSublime package. Given all that, it was relatively easy to start porting layer code for FoundationDB from python to golang. I'm still running my dev environment as VM on my MacBookAir, although Ubuntu can live fine with 1GB of RAM, unlike Windows VM that had to ask for 2GB. Plus, since Parallels does not work well with Linux VMs, VMWare Fusion is used. While working on layer code, I had also to encounter Python along with its REPL. Syntax was a bit odd in the beginning, but quite simple in the long run. No tutorials even needed. For the next week I plan to finish porting queue and pub/sub layers for FoundationDB from python to golang. We'll see how it goes from there."}
{"url":"/happypancake/2014-02-02/","title":"Starting with FoundationDB","text":"Starting with FoundationDB During the last week at HPC my focus has been on FoundationDB . FDB is a nice NoSQL database which has a bunch of great properties: It stores key-value pairs of bytes, where keys are always sorted. You have usual GET/SET/DELETE operations along with range operations that come from sorted key nature Multiple key operations can happen in a transaction Many advanced operations can be implemented as Layers on top of that storage abstraction. There is even a SQL layer for that. FDB scales nicely as you add new nodes to the cluster Cluster of up to 6 nodes can be used for free in production We (or Tomas, to be more precise :]) managed to get 75k write transactions out of a small cluster we setup at the digital Ocean Setting up a cluster is a no-brainer even for a Linux noob like me FDB handles load distribution automatically, it moves data as necessary, too FDB has client libraries for python, golang, erlang, Node.js and even .NET Their team is extremely helpful and humble You can configure level of replication (e.g.: single, double, triple) before write is ACKed FDB can be configured to store views in memory or on disk, transaction store is always durable I personally really like that FDB is extremely opinionated about what it does (just a replicated transactional key-value storage), but it does this extremely well so far. We are planning to use FDB as our event store and for persisting view models (which will be replicated across the cluster). I'm actually the one having fun with that implementation. Event Storage itself is a simple abstraction, however making implementation work properly with FDB key-value storage is something that requires better insight into inner workings of FDB. Plus, I get to do that in a go language. My next week will focus on getting our full planned stack to play together (in an extremely jacky way), so that we could start developing components for the HPC2. PS: my current development environment looks like this (Ubuntu LTS + \"awesome\" tiling manager + sublime + GoSublime):"}
{"url":"/happypancake/2014-02-08/","title":"Evolving the Stack and learning Nanomsg","text":"Evolving the Stack and learning Nanomsg Last week with HappyPancake was my first full-time week with the team. Time flew fast and left me wishing for more . I explored nanomsg (glorified sockets) and how it is used in golang. nanomsg is going to be our communication layer between components within the app, hence understanding its usage patterns was important. It was extremely exciting to pair with Pieter on go programming exercises (and also picking up some Sublime/Linux tricks along the way). While developing my first golang+nanomsg+FDB prototype, I was quite proud of the first code that was written. It was relatively robust, simple and somewhat performant. During a few next days I realised that it was actually an overcomplicated and under performing piece of software. Pieter shared his process about structuring and expressing ideas in golang. Tomas explained how to make that code brutally simple and yet more performant. That was awesome! Previously it would take me years or months before I could arrive to that breath-taking understanding of how stupid I were. With this team everything happens so much faster . Love it. During the week we set ourselves a goal of building a system which could return uncached reads from the front (reverse proxy) within 25ms under the load of 50000 HTTP requests per second , while degrading gracefully under increased load. This, assuming that we run 3 relatively small app servers, 2 reverse proxies and FDB cluster of 5 nodes. Obviously, throwing more hardware to the system, should scale it out. It is nice to be working on a system, where latency is one of the design constraints and having fun is another one . When I had to leave on Friday evening, Tomas and Pieter were discussing process of getting rid of state in services by pushing it all the way to reverse-proxy (using Lua on nginx to take care of HTTP connections, while preserving true stateless asynchrony over nanomsg on the inside). This approach has a synergy with building resilient system of small systems (aka micro-services architecture) communicating over asynchronous events and continuously evolving (versioning, A/B testing and continuous delivery are among the goals of the goals). Apparently, during the course of the evening, they refined the approach to make it even more simple and robust. I can't wait to see what this idea has turned into. By the way, FoundationDB just hit 2.0 version . They published a golang client library within that release, making use of FDB – a breeze. By the way, as Pieter and Tomas reported, upgrading our test cluster to v2.0 took 4 minutes. ETCD was also bumped to 0.3.0 ."}
{"url":"/happypancake/2014-02-17/","title":"Designing for Throughput and Low Latency","text":"Designing for Throughput and Low Latency For the last week I spent most of the time pairing with Pieter, learning more about our the performance and behaviour of our anticipated stack (for the second version of HappyPancake.com). It was thoroughly interesting exercise in systems engineering. Here is what our anticipated design looks like right now: With this design we want to have 25ms latency of HTTP reads (non-cached, 99% percentile) given the throughput of 50000 requests per second. A/B testing, feature toggling, continuous delivery and live upgrades (with ghost mode and ramp-up) included. Here is a short summary of lessons learned within the last week: Tomas is an absolute beast when it comes to crunching out small open source libraries It is quite easy to publish statistics from an app and then gather them in a nice WebUI for crunching (using client library fsd to publish to local statsD app via UDP. StatsD currently pushes stats to Librato Metrics with delay of 10 seconds). HTTP servers in Go are quite good, but can be a pain to extend or augment Nanomsg is really nice and performant, however the documentation is lacking. Profiling capabilities of Golang are absolutely stunning. Spending a week developing and debugging golang apps, while benchmarking them on a Digital Ocean cluster - teaches you a thing or two about Linux environment. It is awesome. Software engineering is about making theories about how your code will behave in production, then running experiments to validate these theories. You iterate and learn. Pairing up with somebody is an amazing opportunity to transfer knowledge and produce better quality code (I lost track of the number of times I was stunned and humbled by the experience and insight of other team members - so much to learn). We currently use TeamViewer (best image and keyboard sharing experience) and Skype for the voice. Campfire is for group chats (and chat ops). For the upcoming week I'll be working on pushing our stack closer to the desired performance numbers (we don't meat the goal, yet). It is an interesting exercise which forces you to learn a lot and go deep (to the point of tuning the OS)."}
{"url":"/happypancake/2014-02-24/","title":"Containers, virtualization and clusters","text":"Containers, virtualization and clusters Last week was a bit hectic. We are waiting for a bunch of datacenters to provide us with the test access to virtualised and hardware servers. These have to be benchmarked in order to see how they perform. Some time during the week we realised two things: We aren't going to get decent numbers out of the machines on DigitalOcean Apparently DigitalOcean is using some cheap virtualisation environment which massively underperforms compared to VMWare This realisation lead us to the point where we started evaluating dedicated hardware option instead of the VMs. We are going to run Docker containers in them anyway, so there is not going to be any vendor or hardware lock-in. Here are a few notes on that: Dedicated hardware is fast Good virtualisation software adds little overhead on top of HW; bad virtualisation - kills any performance Docker containers add very little overhead (should be way below 10% in our scenario) but help a lot with software compartmentalisation Within the last week I was following the footsteps of Pieter, learning from him and writing my first docker containers. There are a few gotchas, but the entire concept is amazing! It is probably the best IT thing that happened to me since beer. It solves \"works on my machine\" syndrome in most of the cases, making it extremely easy to work with software both locally and in remote environments. The experience is a lot better than Lokad.CQRS abstractions for file and Azure backends that I came up with earlier. Eventually, while setting up the containers over and over again, we came to the conclusion that we want to automate the entire thing. Running a script to deploy new versions requires context switching which Tomas and Pieter don't like (I never been as productive in Linux as these guys, but I start feeling this too). Hence, we are thinking about using either Drone or fleet to deploy and manage containers across the cluster. We will probably be using ubuntu 12.04 LTS for the containers (long-term support and stable code). Trying something like CoreOS for the host OS seems compelling because it is native to etcd (awesome) and fleet. We'll see how it goes. This week is going to be about strengthening our continuous delivery story and getting more numbers from the stack in different configurations. A few more other highlights from the previous week: We switched to Slack from campfire (it is used for persisted chats between the team). Native client is awesome, works so much better than campfire and Skype group chats wrk is an awesome tool for doing load testing while measuring throughput and latencies"}
{"url":"/happypancake/2014-03-19/","title":"Benchmarking and tuning the stack","text":"Benchmarking and tuning the stack I focused on testing our current stack, understanding how it behaves under the load and trying to improve it. We are currently running everything in a cloud environment with VMWare virtualization, setting everything up from scratch at the beginning of the day and tearing everything down at the end of the day. This helps to focus on automation from the very start. Our testing setup is quite simple at the moment: Benchmark Box (2 cores with 4GB RAM) - we run weighttp and wrk load tests from this one. Proxy and Application Boxes (8 cores with 4 GB RAM) - proxy box hosts terminator and web aggregator services, while app box hosts specialized services. FoundationDB Box (2 cores with 5GB RAM) - a single FoundationDB node Each of the boxes is by default configured with: Ubuntu 12 LTS and upgraded to the latest kernel (docker and FDB need that); Docker is installed (with override to let us manage lifecycle of images); ETCD container is pulled from our repository and installed, using new discovery token for the cluster; Logsd and statsd containers (our logging/statistics daemons) are downloaded and installed on proxy and app . Appropriate services are downloaded and installed on proxy and app boxes as containers (our build script creates containers of our code, pushing them to a private docker repository) FoundationDB is installed on fdb box. All services and containers are wired into the Ubuntu upstart (equivalent of windows services management). Whenever a service starts, it interacts with ETCD cluster to publish its own endpoints or get the needed endpoints from it. So for the last week I was polishing these install scripts (refactoring BASH is a fun exercise, actually) and also performing some tuning and optimization of that code. Currently we are using plain bash scripts to set up our environment. However bash scripts are just like imperative languages: they tell exactly what you want to do in steps. I’d see that trying out more functional tools might be beneficial for us in the longer term (ansible, puppet, chef or something like that). We have following baseline scenario right now: We run weighttp load testing tool on bench with keep-alive, 256 concurrent clients, 2 threads (1 per core) and enough requests to keep everything busy for 10 minutes; Each http request goes to terminator service on proxy box. Terminator, running basic http server of golang, handles each http request in a new goroutine. It simply serializes request to a message and pushes it to nanobus (our own thin wrapper library around nanomsg for golang). This will create an http context, which consists of a single go channel. Then goroutine will sleep and wait for the response to arrive on that channel. Timeout is another alternative. Nanobus will add a correlationId to the message and publish it to TCP endpoint via BUS protocol of nanomsg. Semantically this message is event , telling that an http request has arrived. Any subscribed service can get this message and choose to handle it. In our case there currently is a web aggregator service running in a different container and showing interest in these messages. Nanobus in web will grab the message and dispatch it to associated method (while stripping correlationID). This method will normally deserialize the request and do something with it. Currently we simply call another downstream service through a nanobus using the same approach. That downstream service is located on another box (for a change) and actually calls FoundationDB node to retrieve stored value. When web service is done with the processing, it will publish response message back to the BUS socket of terminator . nanobus will make sure that the proper correlationID is associated with that message. Nanobus in terminator service will grab all incoming messages on BUS socket and match them against currently outstanding requests via correlationId. If match is found, then we dispatch the response body into the the associated go channel. http handler method in terminator will be brought back to life by incoming message in go channel. It will write its contents back to the http connection and complete the request. In case of timeout we simply write back Invalid Server Operation . When I started benchmarking and optimizing this stack we had the following numbers (as reported by our statsD daemon): 12.5k http requests per second handled; 99th percentile of latency: ~18ms (99% of requests take less than 18 ms, as measured from the terminator ); CPU load on the proxy box: 9 (1 min average as reported by htop). Here are some improvements (resulting from a few successful experiments out of dozens of failed ones): Replacing BSON serialization/deserialization in nanobus with simple byte manipulation : +1k requests per second, –1ms in latency (99th), CPU load is reduced by 1; Switching to new libcontainer execution driver in docker : +0.5k requests per second, –0.5ms in latency (99th), CPU load reduced by 0.5; Removing extra byte buffer allocation in nanobus (halfing the number of memory allocations per each nanobus message being sent): +1k requests per second, –1ms in latency (99th), CPU load reduced by 1; Tweaking our statistics capturing library to avoid doing string concatenation in cases where sample is discarded afterwards : +1.5k requests per second, –1ms latency (99th). Hence, the final results are: 18k http requests per second; ~12.5ms latency (99th percentile). Our next steps would be to add more realistic load to this stack (like dealing with profiles, news feeds and messages), while watching the numbers go down and trying to bring them back up."}
{"url":"/happypancake/2014-04-07/","title":"Change of Plans","text":"Change of Plans Monday came with the change of plans in our team. Tomas and Pieter realized that although our planned architecture looks really awesome (with all that messaging and dynamic component switching) it is too futuristic for our current goals . We want to migrate out of .NET+SQL, for a start. We also want to learn more about our domain before investing a lot of time to optimize it to perfection. We archived our existing prototype code and switched gears to: Single git repository for the entire project Single process application with in-memory components Event-driven domain Our short-term goal is to capture HappyPancake domain in golang in the simplest possible way. Then we'll improve design from there . I spent part of the week working on our EventStore (which is just a layer on top of FoundationDB). After benchmarking it we encountered a very foundational problem: it is hard to append sequentially to a global event stream (transaction log) which is replicated on a cluster . You either get a horrible throughput or you need to sacrifice consistency which affects reading speed. Another alternative is to find something else to sacrifice in a way that has the least possible effect on event-sourced application. This was an interesting tech challenge, neatly isolated and spinning in our minds. We agreed to take some time to think about it before moving forward with the implementation. Today is the day we share our ideas. I also spent some time drafting a simple prototype of basic HappyPancake functionality decomposed into a bunch of event-driven components. It was an extremely rewarding experience to see concepts from C# being expressed in go language. This weekend I went to Chelyabinsk to deliver a talk on software design (masked under the title of \"micro-services in .NET\") at dotnetconf . Tomas was mostly dealing with the UI and UX, while sharing in some papers on algorithms and maintaining the first version of HappyPancake (something we have been spared from). Pieter was reevaluating golang web frameworks while also trying to make them work with PJAX for fast updates in the web UI."}
{"url":"/happypancake/2014-04-14/","title":"Back to Basics","text":"Back to Basics By the beginning of the last week I ported infrastructure for event-driven design (with aggregates, projections and in-memory views) from C# to golang. However, later on it was agreed that going through the event-driven modeling is yet not the fastest and simplest route to the working code. So this code was completely discarded (we could get back to it later) and we started evaluating something even more simple - CRUD approach with CouchDB and MySQL. FoundationDB, does not have any projection or querying capabilities at the moment. This means additional effort required to design and maintain those and might be a premature optimization at this point. While thinking about storage constraints in our design, I've been focusing on messaging functionality for the HappyPancake. Currently we have 150000 messages going through per day with text size up to 2000 characters (served by a large MS SQL database). 20000 users are usually chatting at the same time. Ideally, next version would make this experience more enjoyable and interactive. More messages sent == better for the business. I focused on prototyping a simple chat, where messages and presence notifications are served to the client with long polling http requests. CouchDB and mySQL were evaluated as storage engines at this point. Pieter, at the same time, focused on the storage problem from the perspective of profiles, storing and updating them, serving through http as documents and search feeds. We discovered that our favorite http library in go \"Revel\" can barely serve 4k requests per second due to all the magic it provides (including templates). Bare http server of golang can serve up to uncached 17k requests (to resources with templates) per second on the same machine. So there are some trade-offs to be made. I personally think we could stock to basic http library just fine, since Tomas is pushing extra effort to make our UX and UI extremely simple. CouchDB is a really old document database that has nice master-master replication, support for map-reduce and query engine. It is even used to support some experiments on Large Hadron Collider. To make things even more nice, CouchDB exposes change streams per database, to which you could subscribe from the client code. API is served over HTTP, while the core is written in Erlang. Unfortunately CouchDB didn't fit well to a simple CRUD scenario. Reason being - CouchDB is IO bound, all caching is delegated to the operating system. mySQL was, surprisingly enough, another contender for our storage engine. It previously felt to me that this is a legacy database from the early days of Internet. However, after starting to read \"High Performance mySQL\", I quickly realized that this exactly is its strongest point. This database was optimized and polished by the biggest internet companies in the world. It is rock-solid for both SQL and noSQL. Performance is predictable and tooling is rich. Yet, mySQL can't do miracles if your IO operations are limited by the virtual environments. We can have no more than ~400 operations per second on Glesys machines . So all through the weekend I've been searching for articles on clustered messaging architectures at a large scale, trying to figure out the simplest approach that would fit two constraints: Provide fast and responsive messaging UX implementation of which is capable of serving ~20000 new long polling requests per second; Have clustering capabilities (multiple app servers handling the load); Work with a relatively slow storage engine, using no more than 10-20 requests per second. Fortunately for us, we can live with: Occasional write failures are tolerable Cached data is fine in a lot of cases Systems are not mission critical If you think about it (and sleep over it a few nights, too), these design \"relaxations\" allow to deal with our domain with quite some ease: we can store messages and presence notifications simply in memory (replicated for some degree of reliability) going to the disk only for batched writes and cache misses (usually loading conversations that happened quite a while ago). Amount of memory dedicated for message cache can be tuned to find the sweet spot here. So, at this point, we don't really care about the choice of the storage engine for the purposes of messaging, presence and notifications: CouchDB, mySQL or FoundationDB. Each one would work just fine. However, I would personally prefer mySQL at this point, since it is easier to capture the domain. Some reading Long polling does not totally suck (article) Facebook architecture for 600M users (slides) High performance mySQL, 3rd edition (book) Google groupcache (project) Getting your first 200MM page views (erlang, video) Goodbye, CouchDB (article) CouchDB, the definitive guide (online book) Besides that, I started reading \"Programming Distributed Computing Systems\" by Carlos A. Varela, which is a very foundational and intense book. Highly recommended."}
{"url":"/happypancake/2014-04-21/","title":"Messaging - Heart of a Social Site","text":"Messaging - Heart of a Social Site At the beginning of the last week I had a simple responsive prototype of a chat server. It was a simple in-memory implementation in go, delivering messages and \"user is typing\" updates instantly over long polling http requests. Obviously, a single server chat application wouldn't be enough for HappyPancake.com. We want to serve live updates to 20000 online visitors (numbers from the current version of the web site), while also adding in some headroom for the scalability and fault tolerance. So the last week was dedicated to search, trials and failures on the way to multi-node clustered chat server . I started by reading a lot about existing chat designs and approaches outside of golang. Erlang and Akka were recurring theme here. So I tried to move forward by implementing something like akka's actor model (Singleton actor patter) in golang while using Leader Election module of ETCD. What is ECTD? ETCD is a highly-available key-value storage for shared configuration and service discovery. It is written in GO and uses RAFT algorithm (simpler version of PAXOS) for maintaining consensus across the cluster. That was a dead-end: Re-implementing akka in golang is a huge effort (too many moving parts that are easy to get wrong) Leader Election module in ETCD is theoretically nice. Yet, in practice it is considered as experimental by CoreOS team. Besides, go-etcd client library does not support it, yet. At some point we even pondered if switching to akka was a viable strategy. However, NSQ messaging platform (along with the other projects from bitly) served as an inspiration of getting things done under the golang. A few more days of consuming information on the design and evolution of social networks and I had an extremely simple working design of a multi-node chat. There were 2 small \"break-throughs\" on the way: You can use basic functionality of ETCD keys (with TTLs and CompareAndSwap) to implement entity ownership across the cluster We don't really need a concept of actors to implement a scalable cluster. Dead-simple semantic of Http redirects would do the job just fine. All chat conversations are associated with one out of N (where N is an arbitrary number) virtual chat groups through consistent hash of the conversation ID. A chat group can either be owned by a node (as indicated by the renewed lease on ETCD key) and be available in its memory. All other nodes will know that because of the ETCD registry and will redirect requests to that node. Alternatively, a chat group can be owned by nobody (in case of cold cache or if the owning node is down). Then a random node (smarter algorithm could be plugged later) would take ownership of that chat group. Why bother with concept of chat groups? Querying ownership of 100000 of chat conversations can be pretty expensive, besides we would need to send heartbeats for each of this conversations. It is easier to allocate N chat groups, where N is a fixed number. This can be changed later, though. Result of all that : a dead-simple chat prototype that runs on 1-K nodes, where nodes are extremely simple and can discover each other dynamically, sharing the load. If a node dies - another one would take ownership of the chat conversation. All chats are reactive. \"user is typing\" notifications and actual messages are immediately pushed to the UX over http long-polling connections. More types of events will be pushed over these channels later. Part of that simplicity comes from the fact that golang simplifies work with concurrency and message passing . For example, the snippet below flushes all incoming messages to disk in batches of 100. If there were no messages for a second, it will also flush captured messages. var buffer [] * Record for { select { case r : = <- spill : buffer = append ( buffer , r ) if len ( buffer ) >= 100 { persistMessages ( buffer ) buffer = make ([] * Record , 0 ) } case <- time . After ( time . Second ): if len ( buffer ) > 0 { persistMessages ( buffer ) buffer = make ([] * Record , 0 ) } } } For this week I plan to move forward: Finish implementing a proper node failover (currently nodes don't load chat history from FoundationDB) Made nodes inter-connected between each other (we actually need to publish notifications for a user in real-time, if he gets a message, flirt or a visit from another user). NSQ (real-time distributed messaging platform in golang by bitty) seems like a really nice fit here. During the week I also did some benchmarking of ID generation algorithms in golang for our event store layer on FoundationDB (not a big difference, actually). Here is the speed of appends (1 event in a transaction, 200 bytes per event, ES running in 1 node on Glesys with 5GB RAM, 2 Cores and VMWare virtualization; client - 4 core VM with 8GB RAM): 10 go routines : 1 k per second , 10 ms latency ( 99 ) 50 go routines : 3.5 k per second , 12 ms latency ( 99 ) 100 go routines : 5 k per second , 20 ms latency ( 99 ) 250 go routines : 7 k per second , 35 ms latency ( 99 ) Meanwhile, Pieter was working on profiles, news feeds and registration flows. He was stressing out different database engines by uploading data from the existing user base of happypancake.com. There is a lot to learn about the behavior of different technologies in our scenarios. In the end we seem to be converging back on FoundationDB as the primary storage. Tomas was mostly busy with admin, UI design and maintenance of the first version (protecting Pieter and me from the boring stuff). Here is some reading of the week: NSQ Messaging platform (project) Lessons learned optimizing NSQ (slides) About NSQ (slides) About NSQ (video) Service discovery in the cloud (article) Distributed algorythms in noSQL databases (awesome article)"}
{"url":"/happypancake/2014-04-28/","title":"Event-driven week","text":"Event-driven week The last week started as planned. First, I implemented a persistence for a simple chat service, then moved forward with multi-master design for application nodes. In this design each application node can handle any request the same way. This approach: simplifies the design; does not prevent us from partitioning work between nodes later (e.g.: based on a consistent hashing, keep user-X on nodes 4,5 and 6); forces to think about communication between the nodes. The most interesting part was about the UX flows. For example, in a newsfeed page we want to : Figure out the current version of the newsfeed for the user (say v256) Load X newsfeed records from the past, up to v256. Subscribe to the real-time updates feed for all new items starting from v256 There is an additional caveat. While loading history of activities, we merely display them on the screen (with the capability of going back). However, activities that come in real-time need more complicated dispatch: Incoming messages need to pop-up as notifications and update unread message count in the UI. Incoming flirts and profile visits have to go directly into the newsfeed. Modeling these behaviors lead to some deeper insights in the domain. By the beginning of the week I wasn’t even able to articulate them properly :] Caveats of Event Sourcing Tracking version numbers in a reliable way was also a big challenge initially. The problem originated in the fact that our events are generated on multiple nodes. We don’t have a single source of truth in our application, since achieving that would require either consensus in a cluster or using a single master to establish a strict order of events (like Greg’s EventStore does, for example). Both approaches are quite expensive for high throughput, since you can’t beat the laws of physics (unless you cheat with atomic clocks, like Google Spanner) Initially, I implemented a simple equivalent of vector clocks for tracking version numbers of a state (to handle situation of reliably comparing state versions in cases, where different nodes will get events in different order). However, after a discussion with Tomas we agreed to switch to simple timestamps, which sacrifice precision for simplicity. We are ok with loosing 1 message out of 10000 in newsfeed, as long as it always shows up in the chat window in the end. NSQ For communication tech I picked NSQ messaging platform, since it already has a lot of tooling that boosts productivity. NSQ is used only as glorified BUS sockets with buffering and nice UI. Hence, if Tomas later on manages to push towards nanomsg, we could do that with quite an ease. A nice benefit of using something like nanomsg with ETCD or NSQ is that this system does not have a single point of failure. All communications are peer-to-peer. This increases reliability of the overall system and eliminates some bottlenecks. Micro-services Understanding of micro-services keeps on evolving in an predictable direction. We outgrew approaches like “event-sourcing in every component” and “CRUD CQRS everywhere” to a more fine-grained and balanced point of view. A component can do whatever it wants with the storage, as long as it publishes events out and keeps its privates hidden. Even in a real-time domain (where everything screams “reactive” and “event-driven”), there are certain benefits in implementing certain components in a simple CRUD fashion. This is especially true in case where you can use a scalable multi-master database as your storage backend. Pieter was exactly working on the CRUD/CQRS part of our design, modeling basic interactions (registration, login, profile editing and newsfeed) on top of FoundationDB. This also involved getting used to the existing HPC data, different approaches in FoundationDB and go web frameworks. Tomas was mostly busy with the admin work, supporting our R&D and gaining more insights about existing version of HPC (with the purpose of simplifying or removing features that are not helpful or aren't used at all). Plans This week is going to be a bit shorter for me - we have May 1st and 2nd as holidays in Russia. Still, I will try to finish modeling event-driven interactions for the newsfeed and chat. This would involve UX side (I still didn’t fit transient events like user typing notification into the last prototype) plus implementing a decent event persistence strategy. The latter would probably involve further tweaking our event storage layer for FoundationDB, since I didn’t address scenario, where the same event can be appended to the event storage from multiple machines. We want to save events in batches, while avoiding any conflicts caused by appending the same event in different transactions."}
{"url":"/happypancake/2014-05-05/","title":"Reactive Prototype","text":"Reactive Prototype Last week, as planned, was quite short but very interesting. Development of the reactive prototype at some moment hit a complexity point where a dead-simple hacky approach could no longer work. Although go language (with its simple flavor of behavior composition) allowed to go pretty far on that route, in order to move forward, I had to bite the bullet and refactor things from a big ball of mud to a collection of components. That’s when I realized that I already enjoy coding in golang as much as I enjoyed working with C# in Visual Studio with ReSharper after 8 years of practice in it. After that refactoring I was able to move forward with the domain exploration (in case of HappyPancake domain includes both the social networking and technical peculiarities of developing reactive application at a social scale). One of the interesting aspects of the development was the interplay between: reactive nature of this prototype (new notifications are rendered on the server and pushed to the client through http polling feed); different ways of handling the same event from different contexts and screens (e.g.: a chat message would be appended to the conversation in a chat screen but it will show up as a notification box in another screen); different ways of persisting and delivering information to the users (e.g.: chat history is loaded via range read from FoundationDB, while all updates to this history are pushed to the client through the notification feed); focus on reducing CPU and latency for the mobile devices (e.g. last 75 messages in a chat come pre-rendered in the page HTML on the first page request, while new messages are pushed incrementally by appending server-generated HTML to the DOM); our desire to have graceful degradation of the user experience for some of the older mobile platforms (users could still get some chat experience even if javascript does not work at all). At this point, I think, we have a pretty good understanding of the domain around messaging and notification feeds. We have: a bunch of implementations and use cases captured in the tangible and working code; strategy for scaling the performance in a variety of scenarios (with known price to pay for that in terms of complexity or consistency); some understanding of how we would deal with devops later on. Meanwhile, Pieter was working on the other half of HappyPancake understanding and developing interactions around document based flows in the social network - registration, logins, profile editing and reviewing. All with PJAX and basic http handlers (we discarded Revel, since it does too much CPU-intensive magic). Tomas , as usually, focused on backing up our development. He took care of the v1 maintenance, campaigns and also invested in capturing use cases for us to move forward. It was extremely interesting to sync up with Tomas and Pieter occasionally, sharing concerns and discoveries along the road. It felt like getting an instantaneous deeper insight into the problem we are trying to tackle here. Another really awesome part of the last week was about gradual transition from purely technical use cases (consistency, availability and latency issues) to practical use cases that matter to our users (flirts, messages, visits etc). Although technology is an important part of HappyPancake, users are the domain that we ultimately trying to understand and master. The upcoming week will be a bit longer than the previous for me, but still only 4 days (May 9th is another holiday). We plan to start my prototype into Pieter’s prototype, while moving forward and adding more use cases. I hope to also move forward with newsfeeds. They require a balance between consistency and availability that is different from notifications and chat messages (more like the Instagram news feeds)."}
{"url":"/happypancake/2014-05-12/","title":"Tactical DDD","text":"Tactical DDD I started merging bits of my reactive prototype into the document-driven prototype of HappyPancake that Pieter was working on. While at that, we spent a lot of time discussing the design and iterating over it. It was really cool to see how the structure of the solution shifted focus from technical model to functional model . Previously our golang packages (which roughly map to lightweight .NET projects) contained files grouped by their technical intent (e.g.: controllers, models, documents). This added friction to development: a lot of context switching was required in order to work on a single use case, touching multiple packages; solution structure enforced certain architecture style upon the codebase (when you have folders like models, controllers, views and documents, naturally you will be trying to fit your implementation into these); merge conflicts were unavoidable, since too much code was shared. Over the course of the week, we switched to a different design, aligning packages with use cases. You might consider this to be a tactical domain-driven design (we didn’t touch any of the strategic parts like Bounded Contexts or Ubiquitous language, since our core domain is extremely simple). Golang packages get tightly aligned with our use cases. They either implement cases directly (e.g.: by exposing http handlers to render the UI and process POST requests from the browser) or they help other packages to fulfill their role by providing supporting functionality or structures (e.g. authentication utils, http helper methods, core value objects). Of course, the road wasn’t all about roses and pretty ladies - you can’t just split codebase between a bunch of folders and hope that all will work and make sense . It is never that easy. We had a lot of discussions like : How do we decompose functionality into multiple packages which will work together to implement these use cases? This code does not make any sense, what are we doing wrong? How do we name this thingy? What is the simplest approach to implement these use cases? How can we work together on this functionality? I really enjoyed every minute of these discussions with Pieter, they were focused on the problem domain instead of fiddling around artificial architectural constraints imposed by the overall design. Besides, so far, we were able to resolve these questions and thread the thin line between over-engineered monolith and messy big ball of mud. We are not sure if we’ll be able to walk this path later, yet so far each step led to a deeper insight in the domain of HappyPancake (just like domain-driven design promises). There are a few really cool things about our current design: it is extremely easy to collaborate on the code : there are almost no merge conflicts; we are free to experiment with implementation styles within packages without turning solution into a mess; golang is designed to support small and focused packages, this shows up frequently as yet another tiny and deeply satisfying moment. The most important part is : our code is a reflection of domain knowledge captured in a tangible form. Codebase is structured around that knowledge and not vice versa. In the meanwhile Tomas was busy with administrative work and HPC1. Towards the end of the week he also got a chance to start working on the HTML design of HPC2 in a stealth mode. Pieter and me are both really anxious to see what comes out of this work. Also on Friday we were interviewed by a couple of students on the topic of CQRS. I’d think that our joint statement was something like “CQRS is new name for denormalization with a little recollection of what happened before 2010”."}
{"url":"/happypancake/2014-05-24/","title":"Emergent Design Faces Reality","text":"Emergent Design Faces Reality Last two weeks were packed. We are working hard to have a limited prototype of the application ready and available for a demo in June. So far things look really good for the schedule! Collaborative design process We chat frequently with Pieter , discussing things small and big: starting from component design to a naming choice of some variable or just a weird gut feeling about some code. I found out that disagreements with Pieter are especially productive and exciting. I’m really glad that he has patience to put through with my stupid questions. Here is one example . A few days ago Pieter started working on profile functionality and began introducing there PhotoUrl fields. That immediately gave me the big shivers, since I considered this to be a misleading design. Profile service was responsible for managing and providing published user information like gender, birthday or name. Photo urls have got to be a different concern! Bleeding them into the component responsible for creating and providing profile info felt like an over-complication, compared to the other components (they are clean and focused). I tried to explain these reasons to Pieter , but that didn’t get us far. He replied that it was ok to denormalize and mock some data within the profile service, since it would help him to get the profile viewing page faster. In response I tried to suggest to create mock stubs for photo urls in a dedicated photo component. This went on for a while. Looking at the code together through ScreenHero didn’t help much either. Some progress started only when we started talking about things in terms “this gives me shivers”, while trying to understand why each other sees things differently. As it turned out, we had different perspectives on decomposition of the components. I had in mind purely vertical responsibility for the profile component, where it would have all layers of an N-layered app along with full responsibilities : creating data, persisting it locally, publishing events, providing HTTP handler for the UX. All that, while focusing on a small and coherent set of behaviors around public user profiles. At the same time Pieter was working with the UX. He was interested in a design decomposition which would give him the component that would focus only on maintaining a cache of all user-related information for the purpose of serving profile pages and providing that information to the other components. That component would have a lot of data, but it would not contain any complex business rules mostly event subscriptions and denormalized read models. Seeing this difference was a huge step. I also needed that component (e.g. when you have a news feed and need to enrich entries in it with beautiful profile photos along with name, gender and age for each user). However, since I wasn’t aware of such distinction in our domain, I actually misused a bunch of components for this purpose. While flushing out boundaries and contracts of this new profile component we also touched it’s interactions with the future components, which are not even available in the current code (e.g.: review and draft). We talked about naming, responsibilities, contracts all things except for the implementation (which would be trivial at that point). We even made explicit things like : Ok, so we don’t have draft and review components in our codebase this week, however we will pretend that they exist by manually publishing events from them in our ‘prepopulateDB’ script. Since the other components subscribe to these events, they will not even notice any change when we introduce actual implementations. And since we model events from the perspective of the consumers, they will be useful. A better and more clear design emerged through this process, things clicking into the place like pieces of a puzzle. I find this process truly astonishing : you use codebase to drive exploration of the domain and also capture a deeper insight that is obtained during that process. Emerging design is a beautiful side-effect of that process . Design constraints Such process would not be possible without the design constraints which fuel and direct creativity. Here are a few that are important in our case: Distributed development team of three people , working remotely on the same codebase in a single github repository; mentality of golang , which forces us to think in terms of tiny packages with short and clear names; requirement to have a demo version in June and a working Beta in September ; shared belief in the power of simplicity ; high performance and scalability requirements, which we must not optimize for right now (since that put us behind the schedule for the June demo). Optimize for future performance I find it particularly interesting to optimize design for future performance optimizations, while consciously writing code that is designed for short-term evolvability (and hence is hacky and slow). This forces you to think about isolating this hacky code, preparing it for future replacement and possible optimization strategies. It is almost as if that non-existent better code was written behind the lines and continuously evolved every time you touch the component or think about it. It is impossible to forget about that, since actual code is so inefficient, just like the caterpillar. After a few iterations you end up with the component that is designed: to have high evolvability in the short term to be optimized in the longer term, making a bunch of strategies available (starting from a denormalized read model up to a in-memory cache across all nodes in the cluster, invalidated by messages). Making it all real All this process is not only fun, but it also tightly tied to the real world. Tomas makes sure of that. First of all, he acts as the domain expert and the stakeholder in the product, setting constraints and priorities, sharing insight. He also works on the vision of the product from the user perspective, capturing concepts in a tangible form of HTML templates which we started merging into the codebase. These HTML templates started showing up a few ago. They made Pieter and me feel as if New Year came early this year: it is awesome to see a real product instead of hacky UI; UX easily communicates important requirements that could be missed otherwise (e.g. “gender” symbol and “is online” highlight for every author in the newsfeed entry). In the end We keep saying: \"let's see how much our approach will hold before it becomes a problem\", however so far it holds up pretty well. Architecture, technology and other irrelevant implementation details have changed more than once during this period (e.g.: during the last weeks we switched from FDB to CRUD with shared transactions to event-driven CRUD (no event-sourcing, though). Design still supports growth of understanding and product through these minor perturbations."}
{"url":"/happypancake/2014-06-01/","title":"Minimal Viable Product","text":"Minimal Viable Product This week we were pushing forward major features missing from Minimum Viable Product for the demo in June. The progress was quite good, even ahead of the schedule. I attribute that to the design we came up with for the project. Pieter focused on introducing infinite scrolling to our feeds: alerts , news and diary . These feeds are provided by the separate golang packages of the same names. They don't own any data, but rather project incoming events into a local storage (mySQL tables used as key-value storage with some indexing) and expose an HTTP handler to render all the HTML for browsing these feeds. When we ran out of the things to do for the MVP, Pieter switched to implementing draft , register and review packages. Previously we assigned future responsibilities to them and established their contracts in form of events that cross boundary of these packages. These events were mocked by populateDB script and consumed by the other packages. This allowed to refine the design multiple times even before we started coding in this version. Tomas continued acting as Santa this week, working hard on the new HTML templates for the project, while also refining some of the old ones. These templates feature responsive UI, making them ideal for devices with small screen sizes (half of our visitors use them). Later on we could adjust HTML to work nicely for the desktop apps as well. It felt really awesome for me to skin diary , alerts and news with these new templates along with my favorite chat package. This process actually granted additional design insights: we can't generate HTML of the feed items in advance, since we need to embed things like on-line status , current profile photo ; while rendering final HTML for the feeds, profile service is queried for enrichment information dozens of times per render - I had to implement a simple in-memory cache with cluster-wide eviction of invalid items (driven by the events); we no longer could use application-wide long polling feed for updating chat pages in real-time, since this feed had to contain specific HTML templates and behaviors. Long-polling buffers had to be moved to chat , rewritten and enhanced with events like user-read-message . There are still a few missing bits and pieces related to the UI of profile views and chat conversations, but these are going to be easy to implement once we have the HTML templates to fill them in. For the upcoming week I will probably be busy with implementing nav package (it serves navbar html which is reactively highlighted whenever there is some new content for the user to consume). Ideally, we'll also tackle rendering of the ads into the feeds, since this is the most valuable feature for the business. At some point next week we might start enhancing our solution with package-level event-driven tests expressed in the form of specifications. We currently have such tests implicitly (in the form of event streams generated by populateDB script), however there is value it making them explicit."}
{"url":"/happypancake/2014-06-09/","title":"Almost Demo","text":"Almost Demo The error of my ways We are getting closer to the demo date, which is scheduled to be next Monday, and I did a big mistake last week . Instead of thinking for the whole team I focused only on the design and backend development. It would've been better if instead I tried to go out and unload some burden from Tomas , who was swamped with the other things this week. This way we could've avoided over-delivering on the back-end while getting behind the schedule on the UI design (which is usually the most visible and tangible part of any product). I'll try to pick up more skills in HTML+CSS and UI design in the upcoming days to have skills spread more evenly across our small team. Features delivered During the last week I added continuous integration server (using Drone on a VM) and introduced a shared staging server which could be used for demo. That server also has infrastructure for capturing various metrics from our application code, aggregating them and pushing to a nice dashboard UI. I introduced nav package responsible for maintaining navigation menu at the top of the page. Some items on that menu are highlighted in real-time, as new things arrive for the user to check out. Newly registered users now have a newsfeed that is pre-filled with interesting things to check out (as determined by our matching logic). Plus, now it is possible to see photos in the profile pages, like them and send flirts. Ability to register is something Pieter delivered this week, along with draft implementation of review service. He also came up with a really nice implementation for our online service, responsible for maintaining a list of currently active users across the cluster. Retrospective At this point, we have a working pre-alpha version with core functionality either implemented or envisioned in detail. We didn't burden the code with any performance optimizations , keeping it simple and capable of fast evolution. Performance optimizations, if introduced to immature software design, could hinder or prevent its growth to maturity. Technically, the implementation is extremely simple: a single executable application written in golang with mySQL persistence and in-memory bus . It exposes HTTP endpoint serving web pages and could be switched to clustered mode (if put behind load balancer and plugged to a proper messaging platform). This implementation is more developed from the design perspective : it is decomposed into simple golang packages which are designed to work together to deliver use cases of a dating web-site . These packages tend to be small - merely ~300-400 lines of code on average, including HTML templates). Majority of these packages are full vertical slices, encapsulating implementation details starting from the storage model and up to HTML rendering details within the http handlers. Concepts within the code map to the domain model rather well . They are quite focused and simple, thanks to hours spent working over them with Pieter. However, vocabulary could benefit from a better alignment with the business model . As Tomas mentioned, we managed to drift from original domain model during the development process. That is something we could fix after the demo. These design concepts are very prominent in the contracts of packages: published events and exposed services . There are quite a few of DDD Value objects as well. Design approach still seems to hold pretty well, although we are getting close to the next strain point: some packages get too complicated and would benefit from better tests at the contract level. Something like event-driven given-when-then specifications could work really well here. Adding such tests is something I'm looking forward to after the demo as well."}
{"url":"/happypancake/2014-06-13/","title":"Our First Demo","text":"Our First Demo We finally had our demo last week. As it always happens in practice, nothing went according to the theory . Unexpected problems Two big problems surfaced right before the scheduled demo time. First of all, RAID on one of the production databases (HPC1) suddenly died. This required full attention of Tomas , taking him away from the demo preparations. Second, I discovered that JavaScript part of chat (which I implemented) gets horribly messed up by subsequent PJAX page jumps . Fortunately, disabling PJAX on chat navigation links solved the problem in the short term. In the longer term, I'll need to pick up more Javascript skills. Tomas already recommended me to check out Javascript: The Good Parts . Despite these issues, together with Pieter we cleaned up the HPC2 for the demo. Tomas did an awesome job presenting the product and the vision behind it, which bought us trust from the stake-holders for moving forward. They loved it. We plan to have demos on a monthly basis from this point. NoSQL in SQL During the week we decided to give a try to PostgreSQL , which seems to have a slightly better fit to our needs , than mySQL: great replication story (e.g. \"HotStandby and repmgr\"); mature drivers in golang (if compared to MySQL); binary protocol that does not suffer from legacy issues like MySQL API does; more polished usage experience (if compared to MySQL); there is a book on PostgreSQL High Performance , which looks as good as the one I read on MySQL. PostgreSQL also benefits from being one of the widely used databases (although it probably has fewer installs than mySQL). Replacing MySQL with PostgreSQL was a simple thing, since we use SQL storage mostly for NoSQL purposes anyway. Using SQL for NoSQL gives us the best of the two worlds: mature ecosystem, polished experience and transactions of SQL along with ease of schema-less development from NoSQL. By the end of the week I migrated almost the entire application to PostgreSQL. Design decomposition into small and focused packages (with logically isolated storage) really helped to move forward. Next week I plan to finish the migration and improve test coverage in scenarios that were proven to be tricky during this migration. So far, PostgreSQL feels more comfortable than MySQL. If this feeling proves to be wrong, we could always jump back or try something else. Being the worst on errors and panics Sometime during the week, Pieter brought up the question of using panic vs error in our code. In golang it is idiomatic when functions return a tuple of result and error: func Sqrt ( f float64 ) ( float64 , error ) { if f < 0 { return 0 , errors . New ( \"math: square root of negative number\" ) } You can also issue panic which would stop the ordinary low of control and start going back in the call chain until recover statement is expected or the program crashes. Since I was burned pretty badly with Exceptions in .NET while working with cloud environments at Lokad (everything is a subject to transient failure at some point, so you have to really design for failure), I tried to avoid 'panics' in golang all-together. Instead, almost every function was returning a tuple of result and an error , problems were explicitly bubbled up. This lead to a lot of unnecessary error checking and some meaningless errors that were pretty hard to trace (since errors in golang do not have a stack trace). Thankfully Tomas and Pieter patiently explained that it is OK to throw panics even in the scenarios which would later require a proper error handling with a flow control. Initially this felt like a huge meaningless hack, but eventually it all \"clicked\". Refactoring with this new design insight already makes the code more simple and fit the future evolution (which is required by the current stage in a life-cycle of the project). Pieter also started cleaning up the language in our codebase, making it more aligned with the reality. This is a big effort involving a lot of merge conflicts, but the results are definitely worth it. Becoming a better developer through your IDE During last weeks I invested bits of time to learn about Emacs and customize it to my needs . One of the awesome discussions with Pieter on this topic helped to realize the importance of such IDE tailoring for personal growth as a developer. As you probably know, Emacs is almost unusable for development out-of-the-box (vim, even more so). You need to tweak configuration files, pick plugins and wire them together. Most importantly, you need to make dozens of decisions on how you are going to use this contraption for the development . That's what I used to hate about Emacs before, thinking that Visual Studio with ReSharper gave me everything that a developer would ever need. I came to realize that setting up your integrated development environment from the scratch forces you to become more aware about the actual process of development . You start thinking even about such simple things as organization of files in a project and how you are going to navigate between them. Or, how you are going to refactor your project in the absence of solution-wide analysis and renaming provided by ReSharper. Such troubles affect your everyday coding process, pushing design towards greater decomposition and simplicity . Ultimately, this leads to better understanding . In the end, Pieter got so inspired by our insights that he also decided to ditch Sublime, giving a try to Vim. We are going to compare our setups and development experiences as we progress through the project. I believe, this is going to lead to even deeper insight for us."}
{"url":"/happypancake/2014-06-30/","title":"Scala, Modular Design and RabbitMQ","text":"Scala, Modular Design and RabbitMQ Our system is event-driven in nature . Almost everything that happens is an observation of some fact: message-sent , photo-liked , profile-visisted . These facts are generated in streams by users interacting with a system. Due to the nature of human interactions, there is little concurrency in these streams and it is ok for them to be eventually consistent. In other words: A user is likely to interact with the site through one device and a single browser page at a time. While communicating through the system, users don't see each other and don't know how fast the other party responds. If a system takes 1 second to process and deliver each interaction then probably nobody will notice. The system should feel responsive and immediately consistent (especially while viewing your writes on the profile page and chatting). These considerations are very aligned with designs based on reactive and event-driven approaches. During the last 2 weeks we played with multiple implementation ideas of that: Use replayable event streams for replicating state between modules. Use either FDB-based event storage (which we already have) or the one based on apache Kafka. Use a messaging middleware cabale of durable message delivery across the cluster with a decent failover (read as \"RabbitMQ\") Use a pub-sub system without a single point of failure and relaxed message delivery guarantees (read as NSQ or Nanomsg with ETCD). Each of these approaches has its own benefits and some related complexity: mental - how easy or hard is it to reason about the system; development - how much plumbing code we will have to write; operational - how easy or hard will it be to run it in production. Obviously, we are trying to find approaches which reduce complexity and allow us to focus on delivering business features . Scala Theorem While talking about Apache Kafka Tomas had an idea of switching the entire codebase to Scala and JVM . Java has a lot of big supporters and a large set of great solutions fit for us. A few days last week were dedicated to evaluation on how easy or hard would it be to drop all go code and switch to Scala / JVM. Here are the conclusions: Scala is a nice language , although builds are insanely long slow (if compared to sub-second builds in golang). Porting our core domain code to Scala is not going to be a problem, it could probably be done in a week ( code is by-product of our design process ). Devil is in the details, learning the rest of JVM stack is going to take a lot more time than that (e.g.: how do we setup zookeeper for Apache Kafka or what is the idiomatic approach to build modular web front-end with Java?). In the end, switching to Scala was ruled out of the question for now. Even though this switch has its long-term benefits, it would delay our short-term schedule too much. Not worth it. Besides, Java stack seems to introduce a lot of development friction hurting rapid development and code evolution. These are essential for us right now. RabbitMQ We also switched to RabbitMQ for our messaging needs - Pieter single-handedly coded bus implementation which plugged into our bus interface and worked out-of-the-box. Previous implementation used in-memory channels. So far RabbitMQ is used merely do push events reliably between the modules: all modules publish events to the same exchange; each module on startup can setup its own binding and routing rules to handle interesting events. Although we no longer consider using event streams for replaying events as part of the development process, we could still have a dedicated audit log. This can be done by setting up a dedicated module to persist all messages, partitioning them by user id. Modules We spent some time discussing our design with Pieter . One of the important discoveries was related to a deeper insight into Modules . Previously we talked about our system using components , services , packages interchangeably. This was partially influenced by the term micro-services which was one of the ideas behind our current design. Some confusion came from that. Instead of \"micro-services architecture\" at HPC we started talking about \"decomposing system into focused modules which expose services\" These weeks we were able to refine our terminology, starting to clarify the codebase as well: our application is composed from modules - tangible way to structure code and visual way to group design concepts into ; we align modules at the design level and modules in the code - they share the same boundaries and names; at design level modules are boxes that have associated behavior, we need them to contain complexity and decompose our design into small concepts that are easy to reason and talk about ; in the codebase our modules are represented by folders which also are treated as packages and namespaces in golang; we like to keep our modules small, focused and decoupled , this requires some discipline but speeds up development; each module has its own public contract by which it is known to the other modules; implementation details are private , they can't be coupled to and are treated as black-box; Public contract can include: published events (events are a part of domain language), public golang service interfaces and http endpoints; there also are behavioral contracts setting up expectations on how these work together; in the code each golang package is supposed to have an implementation of the following interface, that's how it is wired to the system; all module dependencies are passed into the constructor without any magic. type Module interface { // Register this module in execution context Register ( h Context ) } type Context interface { // AddAuthHttp wires a handler for authenticated context which // will be called when request is dispatched to the specified path AddAuthHttp ( path string , handler web . Handler ) // AddHttpHandler wires raw http.Handler to handle unauthenticated // requests AddHttpHandler ( path string , handler http . Handler ) // RegisterEventHandler links directly to the bus RegisterEventHandler ( h bus . NodeHandler ) // ResetData wipes and resets all storage for testing purposes ResetData ( reset func ()) } Getting the notion of modules right is extremely important for us, since it is one of the principles behind our design process. We think, structure our work and plan in terms of modules. For the upcoming week we plan to: Cleanup the codebase (one module at a time), finishing the alignment to RabbitMQ; Capture and discuss next HPC features to be implemented (summer vacations are coming and we want to prepare work so that we could continue moving forward even when the rest of the distributed team is offline, taking motorcycle classes or hiking to the top of Mount Elbrus); this will add more stand-alone modules; Start writing two-phase data transformation tooling to export data from the current version of HappyPancake and transfer it into the event-driven version; this would allow to validate the design of existing modules and stress-test the system. PS: Why Emacs is better than Vim? Over my entire life I've been searching for a sensible way to handle tasks and activities, both everyday and job-related. Tooling ranged from MS Project Server (for large projects and distributed teams) to OmniFocus (personal todo lists). Earlier this year I discovered org-mode - a set of Emacs extensions for managing notes and tasks in text files. That was the reason for switching to Emacs from sublime. Recently I caught myself managing some small tasks and notes of HPC project via org-mode as well. All hpc-related information is stored in a textual hpc.org file kept in the repository with the source code. Anybody could read or even edit this file. Emacs, of course, provides more benefits on top of that mode: ability to view and manage entries from all org-modes on my machine; capturing new tasks with a few key strokes; powerful agenda and scheduling capabilities; exports to a lot of formats; auto-completion, tags, categories, outlining, refiling, filtering etc. For example, here is overview of my agenda, filtered by hpc tasks: I think, I got Pieter thinking about giving a try to Emacs, since Vim does not have org-mode (or a decent port)."}
{"url":"/happypancake/2014-07-06/","title":"Distributing Work","text":"Distributing Work A season of vacations starts . This week was the last time when our team was online at the same time. Tomas takes a vacation starting from the next week. Pieter is probably going to take his as soon as he gets through bike exams (wishing him the best of luck). I'll travel to Georgia next week, while working remotely and taking longer weekends. Obviously, we want to stay productive during this period and move forward on our project. There are things that usually require full consensus: important decisions about design, specific feature requirements, everything that involves multiple packages at once. Last week was spent going through these things in advance to make sure we have plenty of non-blocking work queued up for the next month. More Features We have some basic features implemented in the system so far. Software design evolved a bit to support them all while keeping things simple. At this point, if HappyPancake2 were a brand-new product, I'd recommend going live (e.g. in stealth mode) as soon as possible in order to start getting real-world feedback from the users. No amount of testers and visionaries can replace knowledge and insights coming from the real world feedback. Duty of software developers is to make his happen as soon as possible and then iterate, incorporating lessons learned. However, HappyPancake2 is special - it is already used by thousands of users , so there is already plenty of feedback. We know quite well which features are necessary, which could be discarded and which enhancements we could try next. Hence we can keep on working on this project without releasing it . Tomas has all the domain knowledge we need right now. We are planning to introduce these features next: Interests - tags that members can add to their profile, allowing other people to find them by interests (and potentially allowing us to provide better matching); blocking - allowing a member to ignore another one (removing him or her from all search results and blocking communications); online list ; abuse reports on content with admin review queues; favorite profiles . During the week Pieter focused his efforts on developing review functionality , which is one of the most important features in our system. Node.js We are planning to make a slight tech change in our stack by implementing front-end in node.js (which is something Tomas explored last week). This is a relatively small change to the existing system - http endpoints will need to return JSON instead of rendered HTML, so the cost is relatively low. Benefits are: better separation of concerns in our design; ability to use Rendr (render backbone.js apps on the client and the server). This would turn our existing code into back-end with an API, serving JSON requests and streams to node.js front-end. Such separation allows to have more flexibility in UI while introducing a much better testing to the back-end. Behavior Testing Thanks to the switch from HTML endpoints to JSON, I started introducing package behavior tests to our system last week. These tests set and verify expectations about public contracts exposed by packages. This is quite simple to do: Given a set of events and dependencies When we execute an action (usually calling a JSON endpoint) Expect certain assertions to be true. In the longer term I hope to convert these tests to self-documenting expectations (like I did in my previous .NET projects). Ability to have up-to-date documentation of the code that is expressed in human-readable language can be a powerful thing for keeping project stake-holders involved in the project. This means better feedback and faster iterations. Code looks like this in golang: func ( x * context ) Test_given_nancy_flirts_bob_when_GET_bobs_alerts ( c * C ) { s : = run_nancy_flirts_bob ( x ) r : = x . GetJson ( s . bobId , \"/alerts\" ) c . Assert ( r . Code , Equals , http . StatusOK ) var m model r . Unmarhal ( & m ) c . Check ( m . Title , Equals , \"Alerts\" ) c . Check ( m . HasMore , Equals , false ) c . Assert ( m . Items , HasLen , 1 ) i1 : = m . Items [ 0 ] c . Check ( i1 . Member . Nickname , Equals , \"nancy\" ) c . Check ( i1 . Unread , Equals , true ) c . Check ( i1 . Member . IsOnline , Equals , true ) // since we have allOnline c . Check ( x . Service . AnyUnread ( s . bobId ), Equals , false ) } where nancy flirts bob scenario is a simple code setting up preconditions on the system: func run_nancy_flirts_bob ( x * context ) ( info * nancy_flirts_bob ) { info = & nancy_flirts_bob { hpc . NewId (), hpc . NewId ()} x . Dispatch ( hpc . NewRegistrationApproved ( hpc . NewId (), info . bobId , \"bob\" , hpc . Male , hpc . NewBirthday ( time . Now () . AddDate ( - 23 , 0 , 0 )), \"email\" , hpc . NoPortraitMale )) x . Dispatch ( hpc . NewRegistrationApproved ( hpc . NewId (), info . nancyId , \"nancy\" , hpc . Female , hpc . NewBirthday ( time . Now () . AddDate ( - 22 , 0 , 0 )), \"email\" , hpc . NoPortraitFemale )) x . Dispatch ( & hpc . FlirtSent { hpc . NewId (), info . nancyId , info . bobId }) return } The Truth is Born in Argument I can't be grateful enough to Pieter who has enough patience to go with me through the design discussions in cases when we disagree about something. Talking things through with him is one of the reasons why our design stays simple, clear and capable of future evolution. Design Game A lot of our work resembles some sort of puzzle, where we have to do 3 things: find names and words that let us communicate better (we are a distributed team from different countries); discover ways to break down large problem into small coherent parts (team is too small to be able to tackle huge problems); decide on optimal order in which these parts could be handled (our time is limited and has to be applied to the areas where it will make the biggest impact for the project). The hardest part is deciding which things have to be done right now and which can be deferred till some point in the future. In some cases implementing a feature without all the necessary data at hand can be a waste of time, in other cases, this could lead to a deeper insight required to move forward. We try to optimize implementation chain a lot - bringing most rewarding and easy features (\"low hanging fruites\") and depreriotizing ones that are less beneficial for the project. That is an ongoing process required for applying our limited time most efficiently. For example, previously we pretended to store photos in our system. We simply passed around urls pointing to photos from the original version of HappyPancake. That was a good decision (defer functionality as long as possible), but time came to implement it. During last weeks Pieter pushed new media module and spent some time integrating it with the other our services. This brought new insights to how we are going pass around this information through events. We also know how we could host and scale such such module in production (deploy to multiple nodes and rsync between them). Anything related to performance is another example of things we deferred. \"Big Data\" So far our development intentionally focused on software design while deferring any potential performance optimizations. Now it is time to start learning about the actual numbers and real-world usage. At the end of the week I went back to Visual Studio to start writing an extractor utility. This tool merely connects to the original database and saves some data to a compact binary representation (compressed stream of protobuf entities). Then, I started working on the golang code which will scan through that data, producing a stream of events which could be passed to our development project. It is recommended to use such two-step data processing (dump data store to intermediary format and then iterate on data dumps) whenever you are working with large datasets coming from a live system. This decouples the process from production systems, reducing the impact and allowing to have faster iterations. We started working on Finland database, which is one of our smaller installations, yet the already is a bit of data to process. For example, there are more than 1200000 messages, taking 230MB in compressed binary form, 11MB of member data and 2MB of user flirts. Sweden is 100-50 times larger than that. This might seem like a lot of data, however it is not so. Our entire Sweden dataset, if encoded properly, could fit on a single smart-phone and be processed by it. It's just large. However, since we didn't introduce many performance considerations into our design yet (aside from keeping it scalable), some tuning will be necessary. I haven't worked with large datasets for more than half a year, so I'm really looking forward to get back in this field. Real-time reactive nature of the data makes this even more interesting and exciting."}
{"url":"/happypancake/2014-07-21/","title":"Smarter Development","text":"Smarter Development Shorter Feedback Loop We have a continuous integration server responsible for running tests on code pushed to the repository. It is more diligent than humans and always runs all tests. However, in order to see build failures one had to visit a build page (which didn't happen frequently). Our builds were broken most of the time. I tried to fix that by plugging build server directly to our main chat. All failures and successes are reported immediately. Build stays green most of the time. Working with Finland I spent time trying different strategies to populate our system with Finland dataset. This population happens by generating events from raw data dump and dispatching them to our system. Currently we generate only a subset of events, but that already is more than 1000000 of them, sent at once. If we can handle that, then we stand a chance against Sweden dataset. I focused on news module, which has one of the most complicated and time-consuming denormalization logic: each member has his own newsfeed; each member has an interest in some people (e.g. in females with age between 25 and 30 and living in city X); newsfeed is populated with events coming from other members which are interesting to this member; new members by default will have an empty newsfeed, we need to back-fill it with some recent events from interesting members; if member blocks another member, then events from the blocked member will no longer show up in a newsfeed, existing events have to be removed. My initial implementation of news module was handling events at an astonishing speed of 2-10 events per second. I spend multiple days learning inner workings of our stack and looking for simple ways to improve the performance. StatsD and EXPLAIN ANALYSE from PostgreSQL helped a lot to reach speed of 200-400 events per second. Solution was: push all event denormalization to PostgreSQL server (fewer roundtrips); handle each event within an explicit transaction (no need to commit between steps within the event handling); rewrite queries till they are fast. So far the performance is quite good so we don't need to bother too much about pushing it further so far. Adding more features is the most important thing now. Control is important It is really important for members of our dating web site to know that they are in the control. They should be easily able to block out any unwanted communications. That's why we have block feature - ability to put another member into an ignore list, effectively filtering him out from all news feeds, conversations and any other lists. I started working on that interesting feature only to realize that it has a lot of implications. Each other module would react differently to the fact that a user is being blocked. We need to somehow keep a track of all these requirements. Preferably it will be not in a form of the document, since documents get stale and outdated really fast (keeping them fresh requires time and concentration which could also be spent developing new features). Ideally, these requirements could also be verified automatically. Improving Use Cases I invested some time to improve our module BDD tests, transforming them into proper use-cases. These use-cases: are still written in golang; are executed with unit tests (and provide detailed explanation in case of failure); can be printed out as readable stories; can be printed as a high-level overview of the system. Of course, these stories aren't readable by absolutely everybody. That's not the point. Their purpose is: Give sense of accomplishment for developers encouraging them to write tested code (me and Pieter); align tests with expectations from the system (help to make sure that we are testing what is important); provide a quick up-to-date documentation of the API and scenarios for other developers who would be working with the system (Tomas); express behaviors of the system in a way that is not tied to any language (e.g.: Tomas will not need to dive into the golang code in order to consume API from node.js). The best part is that these nice stories are generated automatically from the code. No extra effort is required (apart from writing a small rendering logic in golang while riding on a bus in Adjaria). With this approach it becomes simpler to have high-level overview of what is already done in the system and what has to be done. Simply list names of all passing use cases per module and you have that kind of overview. Other interesting transformations are also possible (i.e.: dependency graphs between modules, event causality patterns etc). They all provide additional insight into the domain model, allowing to have greater insight into the code we write and maintain its integrity . It is quite possible that we could completely discard this code once we hit the production. Need to maintain the integrity of domain model will be replaced by different forces by then. My plans for the upcoming days are to keep covering our existing functionality with these use-cases and adding new functionality."}
{"url":"/happypancake/2014-07-29/","title":"Delivering Features and Tests","text":"Delivering Features and Tests Last week was quite productive and exciting. Introduction of use cases into our development cycle worked out really well , helping to deliver tangible features in the absence of tangible UI to target (node.js front-end development is paused till Tomas gets back from the vacation). These use cases so far: serve as high-level behavior tests aligned with the domain language (compact and non-fragile); drive development towards a better event-driven design; produce nice human-readable documentation, as a side-effect; provide really fast feedback cycle. Actually, these use cases are the design . We can probably take them and rewrite the entire system in a completely different language in 2 weeks. And we can do that without loosing any features or scalability options. However, these nice benefits are not as important as the fact the we spent last week developing new features and improving code coverage, while really enjoying the process . Pieter jumped right into the game, picking up on use case development and extending the framework to support edge-cases which I missed (e.g.: testing file uploads or using real HTTP server to allow inspecting raw HTTP requests with WireShark). He already covered drafts module with use cases. Pieter also invested time last week cleaning things across the code-base. I spent the last week both adding use cases (coverage of chat , alerts , news , poll ), fixing bugs revealed by them and adding proper handling of member-blocked and member-unblocked across the system. As of now, we have 33 use cases covering 15 API calls. We know this number exactly, because of a tiny little helper command summary which can print out information about all use cases. With that command (and the power of BASH), one can easily answer questions like: How many use cases are in the system? Which URIs are not covered by the tests? Which events are published or consumed by module X? What are the dependencies between the modules? Which events are not covered by any use case? This self-building knowledge about the system is another reason which makes writing use cases so rewarding. I also took a bite and tweaked our build server to include commit summaries in chat messages posted to Slack. This way, it is easier to observe team progress without going to git repository. This also encourages frequent pushes, since drone picks up only the latest commit in a push. This week I'm going to continue delivering features, covering them with more use cases and also working on the ETL code to extract data from HPCv1 into our new system."}
{"url":"/happypancake/2014-08-02/","title":"Data, Use Cases And New Module","text":"Data, Use Cases And New Module Last week I was simply developing in a pleasing and steady way: alerts - clean JSON API and more use cases to verify its behavior; diary - clean JSON API, more use cases and support of member blocking; chat - more use cases; like - clean JSON API and use cases; favorite - implemented full module , including JSON API, major use-cases, etl and seeding. Data Extraction from v1 I spent some more quality time with .NET/C# last week, adding more data types to our script responsible for graceful data extraction from HPCv1 databases into compact binary representation (GZIPped stream of length-prefixed binary-encoded objects). This representation works very well for further data processing even at the scale of terabytes. So far I extracted data from all of the largest tables in Finland database, writing event generation code for all matching modules and smoke-testing them on glesys. HPCv2 handles that data pretty well, although RabbitMQ gets a little strained while handling 1500000 messages and copying them into a dozen queues. We'll probably need to optimize our message routing strategy a little here. Fortunately, we can simply reuse data from our wonderful use case suite. I'll be on a vacation next week, so we tried to reproduce process of data retrieval (from binary dumps) and event seeding on the machine of Pieter . It worked without issues. Use Cases We are slowly falling in love with use case approach in the codebase of HPCv2. Writing them is a pleasure, and they actually speed up and simplify the development. At the moment of writing we have 50 of them, verifying different behaviors of JSON API for the frontend that Tomas will be working on when he gets back. I added ability to render use cases into a dependency graph, helping to see development results from a different perspective. Visual representation allows your brain to understand code from a different perspective, making it easier to spot new dependencies or gaps in the code. It is easier to communicate, too. For example, while developing favorite module from scratch, its graph looked like this: Later that day, when the module was complete and covered with 7 use cases, it looked like this: This graph is auto-generated from the code via this process: Load a specific module (or all of them), inspecting registrations in the process. Inspect all use cases for input events, HTTP requests and output events. We can do that because each use case is simply a data structure, describing: GIVEN previous events, WHEN we call API endpoint, THEN expect API result and 0 or more events. Print resulting data model in a dot file format for graphviz program to render. Of course, if some dependency is not covered by a use case, then it is not visible on such graph . I consider this to be a good feature, since it encourages me to cover all important cases. Of course, it is possible to graph all modules and their dependencies. That map would be useful for spotting some loose ends or old dependencies. Pieter was busy cleaning up the overall codebase, working on the implementation of draft , review and getting read of some obsolete logic. Next week I will be completely on a vacation, spending time around the Elbrus mountain. If there are any free periods of time (e.g. long bus rides), I'd love to clean up the profile module, adding a clean JSON API to it. Living Documentation Tomas is coming back from the vacation that week. He'll probably get back to the front-end development on top of our new JSON API. When he does, he can see living documentation for out current system. First, run ./r story to see all use cases. Second, run ./r summary to see grep-able metadata about the system (mostly derived from the use cases). Third, run ./r graph | dot -Tpng > graph.png to create dot file for the system and then feed it to grapviz . Of course, each output can be altered with other programs like grep to filter, group and aggregate information in interesting ways. This kind of documentation always stays up-to-date and does not need any effort to maintain."}
{"url":"/happypancake/2014-08-16/","title":"Back from the Vacation","text":"Back from the Vacation It is good to be back from the vacation. Not only you feel rested, but you also get to see all the cool things done by the team. Tomas and Pieter focused on pushing forward seeding utility that takes data dumps from the production systems and transforms it into the events for the second version of HappyPancake. They moved beyond Finland dataset (the smallest one) and started seeding Sweden, which yields more than 500.000.000 events. This allowed to polish code heavily (e.g. memory and connection leaks are detected early this days) with real-world data. I focused on cleanups and cross-cutting concerns this week. Removing member module helped in making modules more decoupled (now almost all data is denormalized by modules in their own storages). Then, to push this even further, I physically separated modules from each other, giving each module a separate database for production and tests. This is a big win for us, since this allows to replace one big and expensive SQL server (running it currently in production) with a bunch of smaller servers, that cost less and are easier to scale. This improvement required introduction of module specification - a data structure describing some static qualities of the module, which are known even before it is registered in the infrastructure. Such specification currently includes: module name (also acting as DB name, if it has one); module DB schema (SQL scripts to create and reset DB contents); use cases that describe behaviors of the module. With such information, we can create databases while initializing environment and pass connections to these databases to modules on registration. This also allows to run use case verification in separate temporary databases, which are destroyed at the end of test run. With improvements from Pieter our auto-generated module diagrams became even more helpful. They give additional perspective into the code, allowing to see missing or odd pieces. For example, here is diagram of the chat module: As you can see, POST /chat/poll is marked as read, since it is implemented but not covered with any single use case. This endpoint serves data to legacy UI served directly by the back-end (to be removed). It is to be removed, hence there was no point in testing it. Red marker serves as concise reminder of that. Same goes for member-typing and member-read-thread domain events, which are subscribed to but are never really used (in a way that is verified by use cases). This is also something that we will need to cleanup, once focus shifts back to the UI. Next week we plan to decide on the road map for implementing our UI. Currently it is served as HTML directly by our back-end, which is not a good thing (modules get complicated). Possible options include: move html generation with all the resources into a separate front-end module (or a collection of them); kill existing UI code and reimplement everything with node.js and rendr; find some other alternatives. In the longer term we want to have rich single-page application that offers smooth experience to our users (feeling almost like a native application). However, implementing it right now might be a long and risky shot. It seems to be safer to capture existing requirements in a simple User Interface (building upon the code that we already have but with better application design) and deliver that to production. We could always improve later. Besides user interface, there also is some more back end work to do, like: implementing albums module (public and private photo albums); implementing tags and categories for the profiles; improving performance of the system to let it process Sweden dataset faster; figure out profile properties. I personally look forward to get back to work on the front end, since it is a part of critical path (in terms of project management) to our delivery. Earlier this week I started reading the book on Single Page Web Applications only to discover that web app development these days strongly resembles development of desktop applications. Aside from the specifics of HTML, CSS, JS and various browsers, design principles and patterns are quite similar. It should be fun."}
{"url":"/happypancake/2014-08-25/","title":"Native Performance","text":"Native Performance The highlight of the last week was switch from pure Go driver for PostgreSQL to a wrapper around native library jgallagher/go-libpq . Pure Go driver seemed like a nice thing in the beginning (no need to bother with native dependencies or header files), however it turned out to bee too immature. Confusing error reporting for connection issues was one of the biggest problems for us. Thanks to database/sql abstraction in golang, I was able to switch to native driver in one day. Cleanups mostly involved our handling of uuid type. Wrapper around native PostgreSQL client library offers similar performance and wonderful error reporting . There were a few issues with our code while trying to run Finland, they were mostly solved by replacing client-side transactional upsert statements with PostgreSQL upsert statements using Common Table Expressions. One rare race condition in chat module was fixed with a tiny smoke tester and a fine-grained row lock. We were able to run entire Finland dataset (except visits) in 2m53s , which is comparable to pure go driver. Sweden takes more time, but it actually runs, without any big issues. As long as our system can handle burst of events coming from the entire dataset, running the usual production load should never be a problem. Aside from the peformance, I spent time implementing albums module (permissions on private photo album), adding proper implementation of the profile (public view of approved user information with any permissions applied). Then, I also went across the codebase cleaning up modules and adding missing use cases. Functionality of these modules is completely covered now: news , alerts , chat , block , likes , poll , flirt , profile , diary , albums , favorite . draft , auth , review lack use cases, so that is one thing we will try to fix this week. These use cases become even more important to use, since we check multiple things while verifying them: verify that all expectations are met; ensure that there are no connection leaks in code being tested; check that all events present in given are listed in module declaration (it is used to route only the needed events to module queues in RabbitMQ) I also removed all legacy HTML stuff from our back-end code. This separation improved it a lot. Hopefully, today we'll decide on our front-end strategy and start implementing this last missing bit. Use cases help to close feedback loop, but there is nothing like a tangible product. Pieter and Tomas were busy this week dealing with performance optimizations of the code and performing these time consuming runs on the production data. I think, they have quite a lot to share about the inner workings of RabbitMQ and PostgreSQL on Glesys datacenter (VMWare virtualization)."}
{"url":"/happypancake/2014-09-07/","title":"Features, Use Cases, Rendr","text":"Features, Use Cases, Rendr Last two weeks went fast. I mostly focused on the backend, rewriting some of our older modules ( register , draft , review and auth ) to match JSON API requirements and covering them with use-cases. Missing edge cases were addressed as well. There were also a few technical features, affecting the entire codebase. First of all, I added role-based security to the system, extending authorization to support various levels of super-users, like admins and reviewers. This required a slight adjustment of use case infrastructure. Second, Tomas asked to get rid of all stored procedures in the code. So I simply reconfigured use case verifier to throw errors on any module that declares stored procedures . A few more hours to clean up the code, and we are guaranteed not to have this problem any more. Third, I introduced a quick way to stress test a single module in isolation by running a special stress utility. This is faster than generating Finland/Sweden events and passing them through the system. Stress utility works by taking all use cases of a module and using them to generate events over and over again. It works automatically and provides really fast feedback. Fourth, I went through the entire codebase, replacing backend-side upsert statements with PSQL upsert queries , while also switching queries to use prepared statements . This helped to resolve a lot of concurrency issues (detected by stress tool) and speed up the execution by 40% on local machine and by 5-7% at the glesys environment. Fifth, I made all event handlers idempotent . This is needed to handle message duplication in cases of network partitions (something that might happen with RabbitMQ). The solution was to simply extend our use case verifier. After performing a normal run, it goes through all the use cases, duplicating one event at a time and making sure that all expectations still match. Then I simply went through the solution, fixing all tests that appeared to be broken. Afterwards this logic was added to the build process, to ensure that new code will stay idempotent. In this case we handle only idempotency, since this was a \"low hanging fruit\". Out-of-order messages is something we don't deal with, yet. Pieter spent most of his time working on the performance and stability of the codebase . He used our Finland and Sweden datasets as a tool to make the system more robust. This also involved improving our RabbitMQ bus to make it reconnect on network failures (golang client doesn't do that out of the box). Tomas was busy with maintenance of HPC1 , tweaking it and moving the entire system to new servers on Glesys . This included migration of all 3 countries, SQL DBs, media files and app servers. Lot's of work to do. User Interface Tomas has a vision of a Single-Page Application (SPA), where we would have rich experience on the clients with the ability to support older browsers. This SPA could even be shipped as a native app to desktops and mobile phones (via something like PhoneGap), offering really nice user experience. At the end of the last week I finally started working in this direction. Our primary choice of technology is rendr from AirBNB. It is a relatively small library that allows to run the same JS code on the server and the client, rendering HTML at either one, depending on the state. Working with rendr actually involves touching multiple technologies: node.js , expresso stack , backbone.js , handlebars.js . So I started by picking these technologies one at a time and learning more about them. That's what I plan to focus on during the next week."}
{"url":"/happypancake/2014-09-15/","title":"Getting started with Node.js, Lazojs","text":"Getting started with Node.js, Lazojs I spent the last week learning Javascript, Node.js and some of the ecosystem . JavaScript : The Good Parts and Node.js: Up and Running were a good introduction into the field. It is surprising how diverse and rich node.js ecosystem is . I think, this could be attributed to the language itself. JavaScript is everywhere, and it is extremely flexible. This creates a lot of diversity for both good and evil. Besides, it is really easy for anybody to publish node.js packages. There are thousands of them, including some real gems. I'm gradually falling in love with node.js for front-end development purposes . There just is too much goodness already implemented for you (if compared to .NET or golang environments, for example). Golang is still quite good for backends , though. My impression of node.js is clearly biased , since I was focusing only on the technologies required for our front-end. Here the list of things I've been learning about: Backbone.js - lightweight model-view library ( great introduction ). Handlebars.js - nice templating library . Grunt - JavaScript task runner . Underscore.js - functional programming helpers . Express - popular web application framework . for node.js (used in rendr) with a lot of middleware available. hapi - another server framework for Node.js (powering Lazo.js). Require.js - an implementation of Asynchronous Module Definition (AMD) , used by Lazo.js to load modules on client and server. Rx.js - reactive extensions for JavaScript , coming from the Microsoft world. Rendrjs - framework to render Backbone.js apps on the client and the server , using Node.js. Lazojs - client-server web framework for building decomposed single-page applications with optimized first page load and SEO compliance. Rendr project turned out to be lacking for our needs. However, Tomas discovered more polished alternative to it: LazoJS from Walmart Labs . By the end of the week we switched focus to LazoJS and started evaluating it. Documentation is awesome in this project. I'll start this week by implementing an infinite scrolling for our news feed in lazo.js (with all the SPA benefits), then will continue adding other front-end features. Tomas was extremely busy with the migration work during the week . Moving the most popular dating web site in a country has a lot of challenges. This process is done for HPC1 for now. This week Tomas plans to enrich HPC1 with additional data capture capabilities . This way we'll be able to serve better news for our members in HPC2. Pieter was away on his well-deserved vacation, driving a new bike all around the Europe. He is with us this week, getting back to making our HPC middleware battle-ready . Welcome back, buddy!"}
{"url":"/happypancake/2014-09-23/","title":"Web development, the good parts","text":"Web development, the good parts Last week was dedicated to front-end development. We made some nice progress there, thanks to LazoJS framework and Jason Strimpel himself. This framework, supported by the entire Node.JS ecosystem, makes development quite simple even for somebody so unexperienced in UI like me. Of course, Node.js has some drawbacks. For example, its performance is worse than the performance of golang code , pushed to the extreme. There also are a lot of bad things in JavaScript development itself : callback hell, memory leaks, relatively poor debugging experience, low quality of some public packages. However, if you consciously stick with the good parts and avoid the bad ones , the experience can be quite rewarding. First of all, since our front-end code is rather simple and stateless, we could always solve some performance problems by getting another machine . This would be much cheaper, than the time of a developer. Virtual Machines are available starting from a few dollars per month. That is probably cheaper than a few hours of development. Second, Node.js has diverse ecosystem of things that help you develop web applications. If you pick a few good ones and stick with them, your life becomes simpler, speeding up the development. Here are some that were adopted last week: ESLint - pluggable linting utility that checks for common styling and syntax issues on JavaScript code. We adopted the style of Felix Geisendörfer for our project. ESLint is a part of our build no w. If there is an extra comma at the end of array or some trailing space - it will fail the entire build. JSFmt - utility for formatting, searching, and rewriting JavaScript . Whenever a file is saved to disk, we automatically reformat it. Reflex - runs a command or restarts a server whenever sources are changed . This reduces development friction a lot. Kudos go to Pieter for finding this one. Flycheck - syntax checking extension for Emacs that runs ESLint in the background. node-inspector - Node.js debugger based on Blink Developer Tools. We use it via the node-debug wrapper. web-mode - HTML template editing mode for Emacs that understands embedded CSS, Handlebars and Javascript. Lazo.js - client-server web framework for building responsive applications. It deserves a separate podcast. BackboneJS, UnderscoreJS and jQuery come in bundled. Besides, there is PhantomJS (headless WebKit scriptable with JS) which could be used to verify UI use-cases. If we make these use-cases runnable against BrowserStack , then we have all operating systems and browsers covered for us. Feature-wise, we implemented infinite scrolling for our feeds, started working on real-time notifications (highlighting menu items which contain new content). Pieter was also implementing the profile UI. Tomas will join HPC2 development effort later this week. Last week he deployed a new like feature to HPC1, to start capturing information about the most popular content (we could use it to populate feeds in HPC2). It was used almost a hundred thousand times within the first 24 hours!"}
{"url":"/happypancake/2014-09-29/","title":"Reactive User Experience","text":"Reactive User Experience Last week I focused on implementing some requirements related to reactive user experience. That is basically about things that make the users feel as if they were working with a native desktop/mobile application. First, menu items in the web interface should highlight as soon as a user gets new information that he hasn't viewed yet (news items, alerts, chat messages). That's much like Facebook experience. Implementing this relatively simple feature was hard in our previous iteration over the User Interface, since we had no notion of events and view life-cycle. Switching to components that communicate through events makes everything simpler in our current iteration. It is currently implemented via 2 components: Poller component, which runs on the client-side. While the user is logged-in, it publishes events about feed changes as soon as they are retrieved from the server (delivered via the long-polling, would probably switch to sockets.io later). Navbar component, which renders status in HTML during server-side rendering (happens on first page load or in case where we don't have push-state) and then subscribes to feed change events on the client-side. Whenever status changes, CSS class of a DOM element is changed accordingly. In case, when user has JavaScript and push-state enabled, there will be only one instance of navbar and poller running in his browser tab, no matter how many pages he visits. Second, menu highlights should persist when you travel between the pages . There shouldn't be even flickering. This is solved by client-side rendering over Backbone. Third, news feed and alerts feed should reactively display \"You've got new messages\" whenever there are new things to read. That is very similar to what Twitter does. Once, I had the poller component, it was trivial to implement this feature. Feed views simply subscribe to poller events on rendering. Whenever there are new items - we show message box. Whenever a view is removed from the DOM, we detach all event listeners and avoid memory leaks. After getting this simple reactivity straight, I started working on the chat. That's what I'll be working on this week. Pieter and Tomas were working mainly on the Profile view, focusing on component reusability (e.g. like and flirt buttons). In parallel, Tomas was working with like feature that was recently introduced into the existing version of HPC1 across all the countries."}
{"url":"/happypancake/2014-10-07/","title":"Feeds, Chat, Online list and CSS","text":"Feeds, Chat, Online list and CSS Last week I focused on the user interface , rebuilding features from the first UI demo. This time the code was much better, though: all benefits of LazoJS framework are built-in; less memory leaks and unexpected behavior (last UI prototype stuck there), for example all views explicitly unsubscribe event handlers before being unloaded from the DOM; more reactivity and interactivity across the system. Long story short, I implemented/polished: diary, news, alerts feeds are now more polished , have same endless scroll behavior; message inbox and conversation views are now implemented and are working reactively ; all feed-based views have consistent ad display (which works for items retrieved from the server on the first page load and items added reactively); there is a new online list view, which displays last active members on the site; it acts as a roulette, never showing the same member twice within the session; the list is updated with a new photo once every 10 seconds. During the week Tomas introduced me to the articles on CSS Architecture . They describe how to design CSS classes and align them with HTML code in order to create web applications that are scalable (people-wise) and maintainable. Suddenly, HTML side started making a lot more sense to me - there is a lot of design process involved there. At the end of the week I went through all of my client-side code, refactoring HTML, CSS and JavaScript to a better design (previously there was no design at all). This way, Tomas will have less work to do, when he starts applying nice UI look to our system. Pieter and Tomas focused on implementing proper component reuse at Lazo level in the profile page . For example, like button is a reusable component which could appear multiple times on a single page. Their plan is to try and apply this approach to a collection views this week (e.g. news or diary feed). This week I'm working from Italy (Bologna). I have a bunch of outstanding tickets regarding UI features I implemented , so that's what I'll focus on. We are getting close to another demo and a release in Norway."}
{"url":"/happypancake/2014-10-27/","title":"To ReactJS and Facebook Flux","text":"To ReactJS and Facebook Flux During the last 3 weeks we evolved our understanding of frontend stack . LazoJS worked quite well in the cases that I dealt with. These were mostly simple feeds like news , inbox or conversation . However, it created some complexity and overhead for Tomas and Pieter as they explored encapsulation of UI elements and their reuse. In short, in LazoJS it was quite complicated to create and apply a reusable component (e.g. avatar or like button). While I focused on finishing features in LazoJS, Tomas and Pieter explored something else - ReactJS framework and Facebook Flux architecture . In the end they discovered that Facebook Flux was a better fit for us in the long term (at least, better than the other previously explored options like custom PJAX, AirBnB Rendr and WalmartLabs LazoJS). So we made a switch and then spent the last 2 weeks porting our frontend to isomorphic flux components from Yahoo. The investment of time was very worth it, I think. What are ReactJS, Flux and Yahoo components? ReactJS is a battle-tested javascript library for building user interfaces. It allows you to decompose your UI into very simple reusable components, which can then be composed together, rendered using Virtual DOM and then synced back to the HTML. ReactJS is very fast and focuses on unidirectional data flow, which makes code much easier to reason about. ReactJS is maintained by Facebook and also used by Instagram, Yahoo and Khan Academy. Flux is a Facebook architecture for building User Interfaces out of predefined building blocks using one-way data flow. It looks very much like CQRS and is known to scale well in large organizations. Flux architecture works very well with ReactJS. Here is what a developer from Facebook says: From first-hand experience, I can say that React+Flux has scaled well to 8+ developers over 800+ JS files and ~60k lines of code in a large single page app here at Facebook. At HappyPancake, we really liked how Flux helps us to decompose UI into components and also solve event cascade hell which is often present in apps based on BackboneJS or AngularJS. Yahoo Flux components are a set of components open-sourced by Yahoo. These components implement various building blocks of Flux architecture for building isomorphic web applications: dispatcher, router and store. Usually, single-page web applications render all HTML on the client-side. This means that the very first page load can take a few seconds or more: we need to load HTML, parse it, load all required JavaScript libraries, fetch the data for the current route, render it into HTML and then update the DOM. This doesn't work well for older browsers as well. Isomorphic web applications work around that by rendering first page on the server, so that user gets a user interface immediately. Javascript will be loaded later, turning the application into a usual Single-page application, with all its smoothness and responsiveness. Why did we switch from to Flux? The goal of the project is long-term. We don't simply want to build a new version of HappyPancake. Instead, we need to build a software that can continue evolving and scaling since the moment we release it. To achieve that, we have to iterate quickly through the possible development options as early as possible, while the cost of error is minimal. HappyPancake is a unique project due to the number of factors, so we need to evaluate and pick the combination of options (both technical and design-related) which would be good enough for accomplishing long-term goals of the business. Switch from LazoJS to Flux is an investment of time. In my estimates, it would pay off in 2-4 weeks already because of: codebase is polished and supported by very large companies; superior reuse of components; code is simpler to reason about; CQRS architecture is something that we know very well how to test. There are a few technical challenges that we would need to solve in the upcoming days as well: existing flux demos don't pay very much attention to solution structure ; and we got very spoiled by the benefits provided by decomposition of the domain into focused modules. We already have some ideas, but would need to agree upon the conventions; tooling for Flux is great, however there is no equivalent of gofmt/jsfmt for JSX files. We'll either need to wait for one or tweak existing solutions; naming of some building blocks in Flux is very weird . For example, even Facebook developers frequently use interchangeably terms actions and action creators (equivalents of command handlers and command outcomes). We need to get used to that. Despite these small drawbacks, I think that Flux is so far the best way to organize development of web user interfaces, where you need to manage the complexity and long-term development effort. At very least, Flux works so in our specific case of rewriting an already existing (and most popular in Sweden) dating web site with a team of 3 developers distributed around the Europe. So far we migrated to Flux a profile view, news feed (which is being renamed to discover , to match the purpose) and the chat. We need to do a proper 3-way team sync on the work accomplished before fanning out and working on the rest of the features. That's our plan for today and the rest of the week."}
{"url":"/happypancake/2014-11-06/","title":"Project Complete","text":"Project Complete Over the last days we revisited the entire stack, cleaning up the code and improving the design until we were happy with it. We've polished these functionally distinct pieces: Registration flow. Login/logout and authentication details. Discovery feed - equivalent of a facebook newsfeed, with infinite scrolling. Chat , delivering messages and presence notifications reactively. Reactive navbar . Profile view with public and private photos, liking and ability to favorite. That covers only functionality polished from front to the end. We also have backend implemented for the rest of the system (drafts, favorites, online lists, reviewing, alerts, history, flirts etc). Facebook Flux Architecture was a pleasure to work . It is more simple and elegant than any other web UI stack I've worked with before. Composition and reuse of UI components is superb. Flux methodology would be quite familiar to anyone who has dealt with CQRS/ES concepts before. It is very easy to reason about and leads to clean and intuitive codebases. We used isomorphic Flux components from Yahoo to speed up our development (the ones from the next version of Yahoo Mail), basing the initial frontend layout on the chat sample from Yahoo. Bill Fisher (Facebook) and Michael Ridgway (Yahoo) were very kind and extremely helpful. They helped to speed up our development a lot. Design Summary To summarize. Our tech stack and software design went through many iterations. We optimized for: simplicity of the code and design; performance and scalability; ability to evolve, grow and perform A/B testing; ability to develop and maintain with very few people. Of course, there are multiple implementation details that are not present in this bigger picture, starting from design process for frontend components and up to setting up ctags support in IDE. But these are the details. Resulting design looks quite simple: This combination of technologies and tools is optimized for a growing social web site with 600000 members (50% of them - active) and planned expansion to multiple countries. All that - with a limited development effort and resources."}
{"url":"/happypancake/epilogue/","title":"Epilogue","text":"Epilogue It is a pleasure to work with such a codebase. Someone else will have to continue that, though. This project is complete : we've got the foundation up and running. It was a great experience to work and learn in a distirbuted HPC team: Tomas has a unique ability to explore new technologies, learn fast and adapt. His decisions are always based on vast experience and large number of facts, while ultimately driving towards simplicity. Pieter is an awesome team-mate to work with. He learns and accepts changes fast. It was a pleasure to discover and refine project domain towards a deeper insight and simplicity with him. The journey was fun and exciting. We should do it again sometime. I'm definitely interested to get my hands dirty with a similar project."}
{"url":"/happypancake/retrospective/","title":"Retrospective","text":"Retrospective Our project at HappyPancake completed this week. We delivered a simple and scalable foundation for the next version of largest free dating web site in Sweden (with presence in Norway and Finland). Journey Below is the short map of that journey. It lists technologies and approaches that we evaluated for this project. Yellow regions highlight items which made their way into the final design. Project Deliverables Project deliverables included: Deployable full-stack application with major features implemented. Domain model captured in software design (back-end and front-end) and a set of declarative use-cases (acting as living documentation, system outline and behavior test suite). Configured environments for development and continuous integration (docker container). Strategies for further evolution and scaling of the system. Code for migrating existing production deployments to a new version of software. Final high-level architecture is simple to reason about and scale . It was designed to be that way. Logically the entire solution consists from backend modules (represented by golang packages) and elements of Facebook Flux architecture (grouped together in namespaces by naming conventions). Such structure helps to maintain the project as it grows in size and complexity. This design also helps to scale the deployment to handle higher loads. We can scale backend by: moving individual modules to bigger servers; launching multiple instances of a single module; switching storage of an individual module to clustered solution, moving it to bigger servers or even pushing to the cloud. We can scale frontend by simply launching new instances behind the load balancer. Solution structure also provides a natural way to split the work between the developers. Given the established published language (contracts of API and events), we can also bring in more developers, assigning them to work on individual backend modules or frontend namespaces. Lessons Learned Lessons learned : Picking the right technology can reduce the development effort. In my next project I'll try to focus even more on divide and conquer approach - isolate a small part first and then evolve it, limiting the amount of work in progress. It is crucial to establish feedback loop as early as possible, involving all stake-holders. This builds trust and helps to avoid surprises."}
{"url":"/ilya/how-to-build-best-rag/","title":"Ilya Rice: How I Won the Enterprise RAG Challenge","text":"Ilya Rice: How I Won the Enterprise RAG Challenge Ilya Rice: How I Won the Enterprise RAG Challenge From Zero to SoTA in a Single Competition In this guest blog post Ilya Rice describes the approach that helped him build the best RAG and win in the Enterprise RAG Challenge . He took first place in both prize categories and on SotA leaderboard. Source code . Also posted at TimeToAct Austria and on Habr (RU). What is the RAG Challenge about? The task was to create a question-answering system based on annual company reports. Briefly, the process on the competition day was as follows: You're given 100 annual reports from randomly selected companies and 2.5 hours to parse them and build a database. The reports are PDFs, each up to 1000 pages long. Then, 100 random questions are generated (based on predefined templates), which your system must answer as quickly as possible. All questions must have definitive answers, such as: Yes/No; Company name (or multiple company names in some cases); Titles of leadership positions, launched products; Numeric metrics: revenue, store count, etc. Each answer must include references to pages containing evidence of the response, ensuring the system genuinely derived the answer rather than hallucinating. Winning system architecture: Apart from basic steps, the winning solution incorporates two routers and LLM reranking. You can check out the questions and answers produced by my best-performing system here . Now, I'll delve into every step involved in building the system, the bumps and bruises I experienced along the way, and the best practices discovered during this process. Quick Guide to RAG RAG (Retrieval-Augmented Generation) is a method that extends the capabilities of Large Language Models (LLMs) by integrating them with a knowledge base of any size. Development pathway of a basic RAG system includes the following stages: Parsing : Preparing data for the knowledge base by collecting documents, converting them to text format, and cleaning out irrelevant noise. Ingestion : Creating and populating the knowledge base. Retrieval : Building a tool that finds and returns relevant data based on user queries, typically employing semantic search within a vector database. Answering : Enriching the user's prompt with retrieved data, sending it to the LLM, and returning the final answer. 1. Parsing To start populating any database, PDF documents must first be converted to plain text. PDF parsing is an extremely non-trivial task filled with countless subtle difficulties: preserving table structures; retaining critical formatting elements (e.g., headings and bullet lists); recognizing multi-column text; handling charts, images, formulas, headers/footers, and so on. Interesting PDF parsing issues I encountered (but didn't have time to solve): Large tables are sometimes rotated by 90 degrees, causing parsers to produce garbled and unreadable text. Charts composed partially of images and partially of text layers. Some documents had font encoding issues: visually, the text looks fine, but attempting to copy or parse it results in a nonsensical set of characters. Fun fact: I investigated this issue separately and discovered that the text could be decoded—it was a Caesar cipher with varying ASCII shifts per word. This raised numerous questions for me. If someone intentionally encrypted copying of a publicly available company report—why? If the font broke during conversion—why precisely this way? Choosing a Parser I experimented with about two dozen PDF parsers: niche parsers; reputable ones; cutting-edge ML-trained parsers; proprietary parsers with API access. I can confidently state that currently, no parser can handle all nuances and fully return PDF content as text without losing part of the important information along the way . The best-performing parser for the RAG Challenge turned out to be the relatively known Docling . Interestingly, one of the competition organizers—IBM—is behind its development. Parser Customization Despite its excellent results, Docling lacked some essential capabilities. These features existed partially but in separate configurations that couldn't be combined into one. Therefore, I rolled up my sleeves, thoroughly examined the library's source code, and rewrote several methods to fit my needs, obtaining a JSON containing all necessary metadata after parsing. Using this JSON, I constructed a Markdown document with corrected formatting and near-perfect conversion of table structures from PDF to not just MD, but also HTML format, which proved important later on. This library is quite fast but still not enough to parse 15 thousand pages within 2.5 hours on a personal laptop. To solve this, I leveraged GPU acceleration for parsing and rented a virtual machine with a 4090 GPU for 70 cents an hour for the competition. Runpod turned out to be extremely convenient for short-term GPU rentals Parsing all 100 documents took about 40 minutes, which, based on reports and comments from other participants, is an extremely high parsing speed. At this stage, we have reports parsed into JSON format. Can we now populate the database? Not yet. First, we must clean the text from noise and preprocess the tables. Text Cleaning and Table Preparation Sometimes parts of the text get parsed incorrectly from PDFs and contain specific syntax, reducing readability and meaningfulness. I addressed this using a batch of dozen regular expressions. Example of poorly parsed text Documents with the aforementioned Caesar cipher were also detected via regex patterns. I tried to decode them, but even after restoration, they contained many artifacts. Therefore, I simply ran these documents entirely through OCR. Table Serialization In large tables, the metric name (horizontal header) is often positioned too far from vertical headers, weakening semantic coherence. There are 1,500 irrelevant tokens separating vertical and horizontal headers This significantly reduces the chunk's relevance in vector search (let alone situations where the table doesn't fit entirely into one chunk). Additionally, LLMs struggle to match metric names with headers in large tables, possibly returning a wrong value. Serialization of tables became the solution. Research on this topic is sparse, so I had to navigate this independently. You can google Row-wise Serialization, Attribute-Value Pairing, or read this research paper . The essence of serialization is transforming a large table into a set of small, contextually independent strings. After extensive experiments with prompts and Structured Output schemas, I found a solution that enabled even GPT-4o-mini to serialize huge tables almost losslessly. Initially, I fed tables to the LLM in Markdown format, but then switched to HTML format (this is where it proved useful!). Language models understand it much better, plus it allows describing tables with merged cells, subheadings, and other structural complexities. To answer a question like, \"What was the company's shareholder's equity in 2021?\" it's sufficient to feed the LLM a single sentence rather than a large structure with lots of \"noise.\" During serialization, the whole table is converted into a set of such independent blocks: subject_core_entity : Shareholders' equity information_block : Shareholders' equity for the years from 2012/3 to 2022/3 are as follows: ¥637,422 million (2012/3), ¥535,422 million (2013/3), ¥679,160 million (2014/3), ¥782,556 million (2015/3), ¥540,951 million (2016/3), ¥571,983 million (2017/3), ¥511,242 million (2018/3), ¥525,064 million (2019/3), ¥513,335 million (2020/3), ¥577,782 million (2021/3), and ¥1,274,570 million (2022/3). After obtaining a serialized version of the table, I placed it beneath the original table as a kind of textual annotation for each element. You can view the serialization prompt and logic in the project's repository: tables_serialization.py * Despite serialization's fantastic potential, the winning solution ultimately didn't use it. I'll explain why at the end of the article. 2. Ingestion Reports have been converted from PDF to clean Markdown text. Now let's create databases from them. Agreeing on terminology In the realm of search systems (Google Search, full-text-search, Elastic Search, vector search, etc.), a document is a single indexed element returned by the system as a query result. A document could be a sentence, paragraph, page, website, image—doesn't matter. But personally, this definition always confuses me due to the more common, everyday meaning: a document as a report, contract, or certificate. Therefore, from here on, I'll use document in its everyday meaning. The element stored in the database, I'll call a chunk , since we store simply sliced pieces of text. Chunking According to the competition rules, we had to specify the pages containing relevant information. Enterprise systems use the same approach: references allow verifying that the model's answer isn't hallucinated. This not only makes the system more transparent to users but also simplifies debugging during development. The simplest option is to use a whole page of a document as a chunk since pages rarely exceed a couple thousand tokens (although table serialization could expand a page up to five thousand). But let's think again about the semantic coherence between the query and a chunk of document text. Usually, an informational piece sufficient for an answer is no larger than ten sentences. Thus, logically, a target statement within a small paragraph will yield a higher similarity score than the same statement diluted within a whole page of weakly relevant text. I split the text on each page into chunks of 300 tokens (approximately 15 sentences). To slice the text, I used a recursive splitter with a custom MD dictionary. To avoid losing information cut between two chunks, I added a small text overlap (50 tokens). If you're worried that overlap won't fully eliminate risks from poor slicing, you can Google \"Semantic splitter.\" This is especially important if you plan to insert only found chunks in the context. However, the precision of slicing had almost no effect on my retrieval system. Each chunk stores its ID and the parent page number in its metadata. Vectorization Our collection of chunks is prepared; now let's create the vector database—or rather, databases. 100 databases, where 1 database = 1 document. Because why mix information from all companies into one heap and later try to separate one company's revenue from another's? Target information for an answer is always strictly within a single document. We only need to determine which database to query for a given question (more on that later). To create, store, and search the vector databases, I used FAISS . A bit about vector database formats Databases were created with the IndexFlatIP method. The advantage of Flat indices is that all vectors are stored \"as-is,\" without compression or quantization. Searches use brute-force, giving higher precision. The downside is such searches are significantly more compute- and memory-intensive. If your database has at least a hundred thousand elements, consider IVFFlat or HNSW. These formats are much faster (though require a bit more resources when creating the database). But increased speed comes at the cost of accuracy due to approximate nearest neighbor (ANN) search. Separating chunks of all documents into different indexes allowed me to use Flat databases. IP (inner product) is used to calculate the relevance score through cosine similarity. Aside from IP, there's also L2—which calculates relevance score via Euclidean distance. IP typically gives better relevance scoring. To embed chunks and queries into vector representation, I used text-embedding-3-large . 3. Retrieval After creating our databases, it's time to move on to the \"R\" (Retrieval) part of our RAG system. A Retriever is a general search system that takes a query as input and returns relevant text containing the information necessary for an answer. In the basic implementation, it is simply a query to a vector database, extracting the top_n results. This is an especially critical part of the RAG system: if the LLM does not receive the necessary information in the context of a query, it cannot provide a correct answer—no matter how well you fine-tune your parsing or answer prompts. Junk in → Junk out. The quality of a retriever can be improved in many ways. Here are methods I explored during the competition: Hybrid search: vDB + BM25 Hybrid search combines semantic vector-based search with traditional keyword-based text search (BestMatch25). It theoretically improves retrieval accuracy by not only considering the meaning of the text but also precise keyword matches. Typically, results from both methods are merged and reranked by a combined score. I didn't particularly like this approach: in its minimal implementation, it often reduced the retrieval quality instead of improving it. Generally, hybrid search is a good technique and can be refined further by modifying input queries. At its simplest, LLMs can rephrase questions to remove noise and increase keyword density. If you've had positive experiences with hybrid search, especially regarding potential issues and solutions, please share in the comments. In any case, I had more promising alternatives in mind and decided not to explore this direction further. Cross-encoder reranking Reranking the results of vector search using Cross-encoder models seemed promising. In short, Cross-encoders give a more precise similarity score but are slower. Cross-encoders lie between embedding models (bi-encoders) and LLMs. Unlike comparing texts via their vector representations (which inherently lose some information), cross-encoders directly assess semantic similarity between two texts, giving more accurate scores. However, pairwise comparisons of the query with every database element take too long. Thus, cross-encoder reranking is suitable only for a small set of chunks already filtered by vector search. At the last minute, I abandoned this method due to the scarcity of cross-encoder reranking models available via APIs. Neither OpenAI nor other large providers offered them, and I didn't want the hassle of managing another API balance. But if you're interested in trying cross-encoder reranking, I recommend Jina Reranker . It performs well on benchmarks, and Jina offers a generous number of requests upon registration. Ultimately, I opted for an even more attractive alternative: LLM reranking! LLM reranking Simple enough: pass text and a question to the LLM and ask, “Is this text helpful for answering the question? How helpful? Rate its relevance from 0 to 1.” Until recently, this approach wasn't viable due to the high cost of powerful LLM models. But now we have fast, cheap, and smart enough LLMs available. Like Cross-encoder reranking, we apply this after initial filtering via vector search. I developed a detailed prompt describing general guidelines and explicit relevance criteria in increments of 0.1: 0 = Completely Irrelevant: The block has no connection or relation to the query. 0.1 = Virtually Irrelevant: Only a very slight or vague connection to the query. 0.2 = Very Slightly Relevant: Contains an extremely minimal or tangential connection. ... The LLM query is formatted as Structured output with two fields: reasoning (allowing the model to explain its judgment) and relevance_score , allowing extraction directly from the JSON without additional parsing. I further optimized the process by sending three pages at once in one request, prompting the LLM to return three scores simultaneously. This increased speed, reduced cost, and slightly improved scoring consistency, as adjacent blocks of text grounded the model's assessments. The corrected relevance score was calculated using a weighted average: vector_weight = 0.3 , llm_weight = 0.7 In theory, you could bypass vector search and pass every page through the LLM directly. Some participants did just that, successfully. However, I believe a cheaper, faster filter using embeddings is still necessary. For a 1000-page document (and some documents were this large), answering just one question would cost roughly 25 cents—too expensive. And, after all, we’re competing in a RAG challenge, aren’t we? Reranking via GPT-4o-mini cost me less than one cent per question! This approach delivered excellent quality, speed, and cost balance—exactly why I chose it. Check out the reranking prompt here . Parent Page Retrieval Remember how I talked about splitting text into smaller chunks? Well, there's a small but important caveat here. Yes, the core information needed to answer is usually concentrated in a small chunk — which is exactly why breaking the text into smaller pieces improves retrieval quality. But the rest of the text on that page may still contain secondary — yet still important — details. Because of this, after finding the top_n relevant chunks, I only use them as pointers to the full page, which then goes into the context. That's precisely why I recorded the page number in each chunk's metadata. Assembled Retriever Let's recap the final retriever steps: Vectorize the query. Find the top 30 relevant chunks based on the query vector. Extract pages via chunk metadata (remember to deduplicate!). Pass pages through the LLM reranker. Adjust relevance scores for pages. Return top 10 pages, prepend each page with its number, and merge them into a single string. Our retriever is now ready! 4. Augmentation Our vector database is set up, and the retriever is ready. With the \"R\" (Retrieval) part of RAG behind us, we now approach the \"A\" (Augmentation) part, which is pretty straightforward, consisting mainly of f-strings and concatenations. One interesting detail is how I structured prompt storage. After trying different approaches across multiple projects, I eventually settled on the following approach: I store prompts in a dedicated prompts.py file, typically splitting prompts into logical blocks: Core system instruction; Pydantic schema defining the response format expected from the LLM; Example question-answer pairs for creating one-shot/few-shot prompts; Template for inserting the context and the query. A small function combines these blocks into the final prompt configuration as needed. This method allows flexible testing of different prompt configurations (e.g., comparing the effectiveness of different examples for one-shot prompts). Some instructions may repeat across multiple prompts. Previously, changing such instructions meant synchronizing updates across all prompts using them, easily leading to mistakes. The modular approach solved this issue. Now, I place recurring instructions into a shared block and reuse it across several prompts. Additionally, modular blocks simplify handling when prompts become overly long. All prompts can be viewed in the project repository: prompts.py 5. Generation The third part \"G\" in RAG is the most labor-intensive. Achieving high quality here requires skillful implementation of several fundamental techniques. Routing queries to the database This is one of the simplest yet most useful parts of a RAG system. Recall that each report has its own separate vector database. The question generator was designed so that the company's name always explicitly appears in the question. We also have a list of all company names (provided along with the PDF reports at the start of the competition). Thus, extracting the company's name from a query doesn't even require an LLM: we simply iterate over the list, extract the name via re.search() from the question, and match it to the appropriate database. In real-world scenarios, routing queries to databases is more complex than in our controlled, sterile conditions. Most likely, you'll have additional preliminary tasks: tagging databases or using an LLM to extract entities from the question to match them to a database. But conceptually, the approach remains unchanged. To summarize: Found the name → matched to DB → search only in this DB. The search space shrinks 100-fold. Routing queries to prompts One requirement of the competition was the answer format. Each answer must be concise and strictly conform to the data type as if storing it directly into the company's database. Alongside each question, the expected type is given explicitly— int/float , bool , str , or list[str] . Each type involves 3–6 nuances to consider when responding. For example, if a question asks for a metric value, the answer must be solely numeric, without comments, currency signs, etc. For monetary metrics, the currency in the report must match the currency in the question, and numbers must be normalized—reports often write something like \"$1352 (in thousands)\" and the system must reply with \"1352000\". How to ensure the LLM considers all these nuances simultaneously without making errors? Simply put: you can't. The more rules you give the LLM, the higher the chance it'll ignore them. Even eight rules are dangerously many for current LLMs. A model's cognitive capacity is limited, and additional rules distract it from the main task—answering the posed question. This logically leads to the conclusion that we should minimize the number of rules per query. One approach is to break a single query into a sequence of simpler ones. In our case, though, we can achieve an even simpler solution—since the expected response type is explicitly provided, we only supply the relevant instruction set to the prompt, depending on the answer type. I wrote four prompt variations and chose the correct one with a simple if else . Routing compound queries The competition included questions comparing metrics from multiple companies. Such questions didn't fit the paradigm of other simpler queries, as they required additional steps to answer. Example question: Who has higher revenue, Apple or Microsoft? Let's think: how would a human approach this task? First, they'd find each company's revenue separately, then compare them. We embed the same behavior into our system. We pass the initial comparison question to the LLM and ask it to create simpler sub-questions that extract metrics for each company individually. In our example, the simpler sub-questions would be: What is Apple's revenue? and What is Microsoft's revenue? Now we can process these simpler queries through the standard pipeline for each company separately. After gathering answers for each company, we pass them into the context to answer the original question. This pattern applies to any complex queries. The key is recognizing them and identifying the necessary sub-steps. Chain of Thoughts CoT significantly improves answer quality by making the model \"think aloud\" before providing the final response. Rather than giving an immediate answer, the LLM generates a sequence of intermediate reasoning steps leading to the solution. Just like humans, LLMs handle complex problems better when breaking them down into smaller, simpler ones. CoT helps the model avoid missing crucial details, methodically process information, and reach correct conclusions. It's especially useful when context includes \"traps\" that might lead the model astray. You've undoubtedly heard the iconic phrase, Think step by step . This was one of the earliest attempts to enhance answer quality through prompting. It practically gave rise to fancy \"prompt engineering.\" However, for serious tasks, such generic instructions aren't sufficient. LLMs can sometimes \"fake\" reasoning—for instance, giving an immediate answer and then retroactively justifying it or hallucinating non-existent facts. This issue is particularly common among weaker models like GPT-4o-mini or Llama 8b. To ensure CoT is beneficial, you must clearly guide the model on how to reason. Explain reasoning steps, objectives, and provide examples. Structuring reasoning steps explicitly according to the prompt instructions can significantly enhance rule-following. In RAG, when reports lack the exact requested metric, the context may include similar but non-identical information. Models might attempt to twist the provided data to fit the requested metric. Because models are trained to be helpful, they are biased toward providing some answer rather than admitting ignorance. Trying to counteract this bias by just asking the model to be skeptical does work — but not consistently and not in all cases. Instead, prompting the model explicitly to analyze information from different angles or perspectives helps much more. I specifically instructed the model to focus on assessing the compatibility of metrics between questions and context. This significantly reduced hallucinations. Example reasoning with ambiguous context: Question: What is the Research and development equipment, at cost, for Ritter Pharmaceuticals, Inc.? Model's reasoning steps: The question asks for 'Research and development equipment, at cost' for Ritter Pharmaceuticals, Inc. This indicates a specific value from the balance sheet, representing the original purchase price of equipment specifically used for R&D, without any accumulated depreciation. The context (page 35) shows 'Property and equipment, net' at $12,500. This is a net value (after depreciation), and it's a broader category, encompassing all property and equipment, not just R&D equipment. The context (page 37) also mentions 'Accumulated Depreciation' of $110,000 for 'Machinery and Equipment'. This represents the total depreciation , not the original cost, and, importantly, it doesn't specify that this equipment is exclusively for R&D. Neither of these metrics exactly matches the requested metric. 'Property and equipment, net' is too broad and represents the depreciated value. 'Accumulated Depreciation' only shows depreciation, not cost, and lacks R&D specificity. Since the context doesn't provide the original cost of only R&D equipment, and we cannot make assumptions, perform calculations, or combine information, the answer is 'N/A'. Structured outputs Structured Output (SO) is a way of forcing the model to respond in a strictly defined format. It's usually passed as a separate parameter to the API, such as a Pydantic or JSON schema. This guarantees that the model always returns valid JSON strictly adhering to the provided schema. Field descriptions can also be included in the response schema. These don't affect structure but are treated by the LLM as part of the prompt. For example, here's a Pydantic schema for LLM reranking: class RetrievalRankingSingleBlock ( BaseModel ): \"\"\"Rank retrieved text block relevance to a query.\"\"\" reasoning : str = Field ( description = ( \"Analysis of the block, identifying key information and how it \" \"relates to the query\" ) ) relevance_score : float = Field ( description = ( \"Relevance score from 0 to 1, where 0 is Completely Irrelevant \" \"and 1 is Perfectly Relevant\" ) ) With this schema, the LLM always returns a JSON with two fields—the first a string, the second a number. CoT SO The methods described above are ideally combined with each other. During generation, the model has a dedicated field specifically for reasoning and a separate field for the final answer. This allows us to extract the answer without needing to parse it from lengthy reasoning steps. Chain of Thought can be implemented within Structured Outputs in several ways. For example, you could use multiple JSON fields, each guiding the model to intermediate conclusions whose combination leads it to the correct final answer. However, because the logic required for answering competition questions couldn't be described by a single predefined set of step-by-step instructions, I employed a more general approach, providing the model with a single reasoning field and defining the reasoning sequence directly within the prompt. In my main schema for answering competition questions, there were just four fields: step_by_step_analysis — preliminary reasoning (the Chain of Thought itself). reasoning_summary — a condensed summary of the previous field (for easier tracking of the model’s logic). relevant_pages — report page numbers referenced by the answer. final_answer — a concise answer formatted as required by the competition. The first three fields were reused across all four prompts tailored for different answer types. The fourth field varied each time, specifying the answer type and describing particular nuances the model had to consider. For example, ensuring that the final_answer field would always be a number or \"N/A\" was done like this: final_answer: Union[float, int, Literal['N/A']] SO Reparser Not all LLMs support Structured Outputs, which guarantee full adherence to schemas. If a model doesn’t have a dedicated Structured Output feature, you can still present the output schema directly within the prompt. Models are usually smart enough to return valid JSON in most cases. However, a portion of answers will inevitably deviate from the schema, breaking the code. Smaller models, in particular, fail to conform about half the time. To address this, I wrote a fallback method that validates the model’s response against the schema using schema.model_validate(answer) . If validation fails, the method sends the response back to the LLM, prompting it to conform to the schema. This method brought schema compliance back up to 100%, even for the 8b model. Here's the prompt itself . One-shot Prompts This is another common and fairly obvious approach: adding an example answer pair to the prompt improves response quality and consistency. I added a \"question → answer\" pair to each prompt, writing the answer in the JSON format defined by Structured Outputs. The example serves multiple purposes simultaneously: Demonstrates an exemplary step-by-step reasoning process. Further clarifies correct behavior in challenging cases (helping recalibrate the model's biases). Illustrates the JSON structure that the model’s answer should follow (particularly useful for models lacking native SO support). I paid significant attention to crafting these example answers. The quality of examples in the prompt can either boost or diminish response quality, so each example must be perfectly consistent with the directives and nearly flawless overall. If an example answer contradicts instructions, the model becomes confused, which can negatively affect performance. I meticulously refined the step-by-step reasoning field in the examples, manually adjusting the reasoning structure and wording of each phrase. Instruction Refinement This part is comparable in labor-intensity to the entire data preparation stage due to endless iterative debugging, proofreading of answers, and manual analysis of the model's reasoning process. Analyzing Questions Before writing prompts, I thoroughly studied both the response requirements and the question generator. The key to a good system with an LLM under the hood is understanding customer needs. Typically, this involves deep immersion into a professional domain and meticulous examination of questions. I'm convinced it's impossible to create a truly high-quality QA system for businesses unless you clearly understand the questions themselves and how to find answers (I'd be glad if someone could convince me otherwise). This understanding is also required to clarify all implicit meanings arising from user questions. Let's consider the example question Who is the CEO of ACME inc? In an ideal world, a report would always explicitly provide the answer, leaving no room for misinterpretation: CEO responsibilities are held by John Doe A RAG system would locate this sentence in the report, add it to the query context, and the user would receive an unambiguous answer: John Doe However, we live in the real world, where tens of thousands of companies express information in unlimited variations, with numerous additional nuances. This raises the question: what exactly can fall under the term \"CEO\"? How literally should the system interpret the client's question? Does the client want to know the name of the person holding a similar managerial role, or strictly that specific job title? Is stepping slightly away from a literal interpretation acceptable? How far is too far? Potentially, the following positions could be included: Chief Executive Officer — obviously, just the abbreviation spelled out. Managing Director (MD), President, Executive Director — slightly less obvious. Different countries use different titles for this role (MD in the UK and Europe, President in America and Japan, Executive Director in the UK, Asian countries, and non-profits). Chief Operating Officer, Principal Executive Officer, General Manager, Administrative Officer, Representative Director — even less obvious. Depending on the country and company structure, there may not be a direct CEO equivalent; these roles, although closest to CEO, have varying levels of overlap in responsibilities and authority—from 90% down to 50%. I'm unsure if there's an existing term for this, but personally, I refer to this as the \"interpretation freedom threshold\" issue. When responses are free-form, the interpretation freedom threshold is resolved relatively easily. In ambiguous cases, LLM tries to encompass all implicit meanings from the user's query, adding several clarifications. Here's a real example of a ChatGPT response: Based on the provided context, Ethan Caldwell is the Managing Director , which is the closest equivalent to a CEO in this company. However, he has been formally suspended from active executive duties due to an ongoing regulatory investigation. While he retains the title , he is not currently involved in company operations , and leadership has been temporarily transferred to the senior management team under board supervision . However, if the system architecture requires concise answers, as in the RAG Challenge, the model behaves unpredictably in these situations, relying on its internal “intuition”. Thus, the interpretation freedom threshold must be defined and calibrated in advance. But since it's not possible to define and quantify this threshold explicitly, all major edge cases must be identified, general query interpretation rules formulated, and ambiguities clarified with the customer. Beyond interpretation issues, general dilemmas may also occur. For example: Did ACME inc announce any changes to its dividend policy? Should the system interpret the absence of information in the report as an indication that no changes have been announced? Rinat (the competition organizer) can confirm—I bombarded him with dozens of similar questions and dilemmas during competition preparation :) Prompt Creation One week before the competition started, the question generator’s code was made publicly available. I immediately generated a hundred questions and created a validation set from them. Answering questions manually is quite tedious, but it helped me in two key areas: The validation set objectively measures the system's quality as I make improvements. By running the system on this set, I monitored how many questions it answered correctly and where it most commonly made mistakes. This feedback loop aids iterative improvements of prompts and other pipeline components. Manually analyzing questions highlighted non-obvious details and ambiguities in questions and reports. This allowed me to clarify response requirements with Rinat and unambiguously reflect these rules in the prompts. I incorporated all these clarifications into prompts as directive sets. Directive examples: Answer type = Number Return 'N/A' if metric provided is in a different currency than mentioned in the question. Return 'N/A' if metric is not directly stated in context EVEN IF it could be calculated from other metrics in the context. Pay special attention to any mentions in the context about whether metrics are reported in units, thousands, or millions, to adjust the number in final answer with no changes, three zeroes or six zeroes accordingly. Pay attention if the value is wrapped in parentheses; it means the value is negative. Answer type = Names If the question asks about positions (e.g., changes in positions), return ONLY position titles, WITHOUT names or any additional information. Appointments to new leadership positions also should be counted as changes in positions. If several changes related to a position with the same title are mentioned, return the title of such position only once. Position title always should be in singular form. If the question asks about newly launched products, return ONLY the product names exactly as they are in the context. Candidates for new products or products in the testing phase are not counted as newly launched products. The model easily followed certain directives, resisted others due to skewed biases, and struggled with some, causing errors. For example, the model repeatedly stumbled when tracking measurement units (thousands, millions), forgetting to append necessary zeroes to the final answer. So, I supplemented the directive with a brief example: Example for numbers in thousands: Value from context: 4970,5 (in thousands $) Final answer: 4970500 Eventually, I developed prompts for each question format and several auxiliary prompts: Final prompt for Number-type questions Final prompt for Name-type questions Final prompt for Names-type questions Final prompt for Boolean-type questions Final prompt for Comparative-type questions (to compare answers from multiple companies via multi-query routing) Paraphrasing prompt for Comparative-type questions (to initially find metrics in reports) LLM reranking prompt SO Reparser prompt Meticulous refinement of instructions combined with one-shot and SO CoT resulted in significant benefits. The final prompts entirely recalibrated unwanted biases in the system and greatly improved attentiveness to nuances, even for weaker models. System Speed Initially, the RAG Challenge rules were stricter, requiring the system to answer all 100 questions within 10 minutes to be eligible for a monetary prize. I took this requirement seriously and aimed to fully leverage OpenAI's Tokens Per Minute rate limits. Even at Tier 2, the limits are generous—2 million tokens/minute for GPT-4o-mini and 450k tokens/minute for GPT-4o. I estimated the token consumption per question and processed questions in batches of 25. The system completed all 100 questions in just 2 minutes. In the end, the time limit for submitting solutions was significantly extended — the other participants simply couldn't make it in time :) System Quality Having a validation set helped improve more than just prompts—it benefited the entire system. I made all key features configurable, allowing me to measure their real-world impact and fine-tune hyperparameters. Here are some example config fields: class RunConfig : use_serialized_tables : bool = False parent_document_retrieval : bool = False use_vector_dbs : bool = True use_bm25_db : bool = False llm_reranking : bool = False llm_reranking_sample_size : int = 30 top_n_retrieval : int = 10 api_provider : str = \"openai\" answering_model : str = \"gpt-4o-mini-2024-07-18\" While testing configurations, I was surprised to find that table serialization—which I'd placed great hopes on—not only failed to improve the system but slightly decreased its effectiveness. Apparently, Docling parses tables from PDFs well enough, the retriever finds them effectively, and the LLM understands their structure sufficiently without extra assistance. And adding more text to the page merely reduces the signal-to-noise ratio. I also prepared multiple configurations for the competition to quickly run various systems in all categories. The final system performed excellently with both open-source and proprietary models: Llama 3.3 70b was only a couple of points behind OpenAI’s o3-mini. Even the small Llama 8b outperformed 80% of the participants in the overall ranking. 6. Conclusion Ultimately, winning the RAG Challenge wasn’t about finding a single magical solution, but rather applying a systematic approach, thoughtfully combining and fine-tuning various methods, and deeply immersing myself in the task details. The key success factors were high-quality parsing, efficient retrieval, intelligent routing, and—most notably—LLM reranking and carefully crafted prompts, which enabled achieving excellent results even with compact models. The main takeaway from this competition is simple: the magic of RAG lies in the details . The better you understand the task, the more precisely you can fine-tune each pipeline component, and the greater benefits you get even from the simplest techniques. I’ve shared all the system code as open-source . It includes instructions on deploying the system yourself and running any stage of the pipeline. Ilya is always open to interesting ideas, projects, and collaborations. Feel free to reach out to him via Telegram or LinkedIn"}
{"url":"/lean-web-analytics/","title":"Lean web analytics","text":"Lean web analytics I don't like Google Analytics. It is slow and invasive. So I'm removing it completely. I'm still curious about the analytics, because it can answer questions like: Which blog posts and articles are interesting to people? Are there any broken links that I missed? How do people find my content? At the same time, I don't want to call 3rd party services, let companies set tracking cookies or store personal information for mining. Here is the current setup: Caddy - open source web server with automatic HTTPS GoAccess - open source web log analyzer Blog - custom static site generator similar to Pelican Web server Web traffic is served by caddy. It is a lean web server that handles SSL certificates out-of-the-box. /etc/caddy/CaddyFile looks like this: abdullin . com { root * / var / www / abdullin . com file_server encode zstd gzip handle_errors { @404 { expression { http . error . status_code } == 404 } rewrite @404 / 404. html file_server } log { output file / var / log / caddy / abdullin . com - access . json } } This serves contents of /var/www/abdullin.com . It also records structured access logs to /var/log/caddy/abdullin.com-access.json . These logs are rotated and eventually cleaned up. Web Analytics Analytics can be done with goaccess which has caddy plugin. Just install the latest version and execute: goaccess abdullin.com-access.json --log-format CADDY --ignore-crawlers Or you can generate a html report: goaccess access.json --log-format CADDY --ignore-crawlers -o report.html Goaccess configs are located at /etc/goaccess/goaccess.conf . You can enable referral details there. My overrides: exclude-ip MY_IP # comment these out # ignore-panel REFERRERS # ignore-panel KEYPHRASES It is possible to download MaxMind GeoIP database and use it aggregate visits by country or city: goaccess access.json --log-format CADDY --ignore-crawlers --geoip-database City.mmdb Next This approach is nice, but it stores IP addresses and doesn't display user interaction flows. We try to improve things in (Over) Designing privacy-first analytics ."}
{"url":"/llm-benchmarks","title":"Evaluating LLM in business workloads","text":"Evaluating LLM in business workloads Here for the latest LLM Eval and report ? Go to Monthly Reports ! I've been benchmarking various LLM models on business evals for quite some time. These evals are based on a private collection of prompts and tests that were extracted from real products and AI cases. They don't test how well LLMs talk, but rather how accurately they accomplish various tasks relevant for business process automation. A typical user of this benchmark - R&D department of a popular soft drink company. They use it to track performance of various models in business-specific tasks. Questions answered in FAQ below : Which models do you test? Can you test my model via an API that I will provide? Can you share benchmark data to help us improve our model? Why is Claude 3.5 Sonnet ranked so low in Code + ENG? Why is model X so low/high on this benchmark, compared to LLMArena? Why is the model X is ranked too high or too low compared to my expectations? Monthly Reports My findings are documented in a series of monthly reports that are published on TimeToAct / Trustbit websites. Starting from 2025, benchmarks include more complex tasks and leverage Schema-Guided Reasoning (SGR) to solve them: Summer 2025 ( interactive , pdf ): OpenAI GPT-5, Grok-4, DeepSeek, Qwen-3, ERCr3 announcement, SGR April 2025 ( interactive | pdf ) : OpenAI o3, o4 and 4.1, Qwen3, Gemini FLash 2.5 and Pro 2.5, AI+Coding insights March 2025 ( interactive | pdf ): Gemini 2.5 Pro and Gemma, DeepSeek V3 0324, Llama 4 February 2025 ( interactive | pdf ): AI Coding tests, OpenAI o3-mini and GPT-4.5, Claude 3.7 and Qwen versions January 2025 ( interactive | pdf ) - Early preview of reasoning benchmarks v2 Simple benchmarks (v1, pre-SGR) December 2024 - Benchmarking OpenAI o1 pro and base o1, Gemini 2.0 Flash, DeepSeek v3, Amazon Nova, Llama 3.3 our predictions for the year 2025 and o3 ( backup PDF ) November 2024 - Update: Claude Sonnet 3.5 v2, latest GPT-4o, Qwen 2.5 Coder 32B Instruct and QwQ, Plans for LLM Benchmark v2 ( backup PDF ) October 2024 - Grok2, Gemini 1.5 Flash 8B, Claude Sonnet 3.5 and Haiku 3.5 ( backup PDF ) September 2024 - Chat GPT-o1, Gemini 1.5 Pro v 002, Qwen 2.5, Llama 3.2, Local LLM trends over time August 2024 - Enterprise RAG Challenge July 2024 - Codestral Mamba 7B, GPT-4o Mini, Meta Llama 3.1, Mistal Juny 2024 - Claude 3.5 Sonnet, Confidential Computing, Local LLM Trend May 2024 - Gemini 1.5 0514, GPT-4o, Qwen 1.5, IBM Granite April 2024 - Gemini Pro 1.5, Command-R, GPT-4 Turbo, Llama 3, Long-term trends March 2024 - Anthropic Claude 3 models, Gemini Pro 1.0 February 2024 - GPT-4 0125, Anthropic Claude v2.1, Mistral flavours January 2024 - Mistral 7B OpenChat v3 December 2023 - Multilingual benchmark, Starling 7B, Notus 7B and Microsoft Orca November 2023 - GPT-4 Turbo, GPT-3 Turbo October 2023 - New Evals, Mistral 7B September 2023 - Nous Hermes 70B August 2023 - Anthropic Claude v2, Llama 2, ChatGPT-4 0613 July 2023 - GPT-4, Anthropic Claude, Vicuna 33B, Luminous Extended Frequently Asked Questions Which models do you test? Currently I'm testing only models served by Google, OpenAI, Mistral, Anthropic and OpenRouter. This also covers local LLMs (models that you can download and run on your hardware) - any decent model will be served on OpenRouter by somebody. For example QwQ showed up on OpenRouter within a few days after the release. Can you test my model via an API that I will provide? I do not test private LLMs or models offered via APIs outside of the list above. If you want to test your model, just make sure it gets supported by OpenRouter and ping me. Can you share benchmark data to help us improve our model? No, benchmark data is private. But if I tested your model, I can provide general feedback on the types of the mistakes it commonly makes. Why is Claude 3.5 Sonnet ranked so low in Code + ENG? While Claude 3.5 Sonnet is really good for coding assistance (in chats and IDEs), it frequently fails on more complex tasks like code review, code architecture analysis or refactoring. These are the tasks that are needed in business process automation related to Code+ENG. Why is model X so low/high on this benchmark, compared to LLMArena? LLM Arena is a place where people chat with large language models and ask them questions. Responses are evaluated by people based on personal preferences. Chatty but factually incorrect models have a chance of winning. This LLM Benchmark is designed for business process automation. If a model is incorrect - it will be downgraded. If a model doesn't follow precise instructions - it will be downgraded. If a model talks too much - you get the idea. Why is the model X is ranked too high or too low compared to my expectations? Because we probably have different cases and use LLMs differently. Tests for this benchmark were taken from the AI Cases around business process automation or products with LLM under the hood (companies in Europe and USA). Below is the matrix of AI cases that I've encountered. Darker the square is - more cases were in that segment."}
{"url":"/llm/","title":"Shipping products with LLMs and ChatGPT","text":"Shipping products with LLMs and ChatGPT Everybody is prototyping and shipping products with ChatGPT these days. I'm going to collect answers to frequently asked questions here."}
{"url":"/llm/engineering-team-advice/","title":"My team has no experience with ML/GPT. How do we proceed?","text":"My team has no experience with ML/GPT. How do we proceed? This is the essence of my advice to engineering teams that need to ship GPT-based products now (without prior experience): Read prompting guide . It covers foundations and sets up basic terminology. Pay attention to Introduction, Zero-shot and few-shot training. Use System/User/Assistant prompts to setup context and provide multi-shot samples. ChatGPT was trained to keep humans happy with chat completions. By using role-based multi-shot we help it achieve the goal. Optimising prompts against a single case is micro-management. Optimise and benchmark against datasets. Don't even think about fine-tuning or training models until you have exhausted multi-shot AND until you have a dataset to train on. Python is de-facto standard for LLM work. Node (JS/TS) has surprisingly decent ecosystem, too. Working with embeddings? Use OpenAI ada-002 to start fast, then switch to SBERT . Check out LangChain early, but don’t dive in deeply. If possible, use OpenAI API instead of Microsoft Azure API for early development with ChatGPT. Working with Microsoft API is an enterprise pain on many levels ( model vs deployment difference, batch size of 1, LangChain doesn’t work with it out-of-the-box). Gather as much data as possible . As soon as possible. You'll benchmark all new prompts and LLM-based features against it. Maybe even fine-tune or train a model at some point later. Architectures and approaches get stale really fast. So it is OK to share generic approaches with your peers, since this could lead to important insights. Generally speaking, you can share everything except for the prompts and LLM benchmark details . Expect that available models will be improving fast. It is OK to use more expensive GPT-4 for prototyping now. It will get you going. Later you could swap to a cheaper local model."}
{"url":"/llm/faq-chat-gpt-quickstart-for-programmers/","title":"ChatGPT quickstart for developers","text":"ChatGPT quickstart for developers This concise article will help you to catch up with LLM, ChatGPT and prompting from zero. If you follow this page through, you’ll know how to craft a multi-shot prompt, and what does that actually mean. If you have OpenAI ChatGPT account , it will help you to move along. Let’s start with a few terms. LLM stands for Large Language Model . Essentially a big fat function that can work with text. GPT stands for Generative Pre-trained Transformer . Do you have an auto-complete on your phone? GPT is the same thing, but more complex. Given some text, it will generate even more text. GPT is a type of LLM that we are interested in. ChatGPT is a famous GPT model that was trained by OpenAI . It is called Chat, because it was additionally trained to follow human instructions and make them happy (the process is also called RLHF or Reinforcement Learning Human Feedback ). GPT versions 3.5 and 4 are currently being served by OpenAI. GPT-3.5 is cheap and good for many tasks. GPT-4 is expensive, but more smart. In ChatGPT (or similar apps) you interact with models by chatting. User requests are also known as “prompts”. You can ask ChatGPT anything that you want. Go ahead and give it a try. The purpose here is to start building an intuition about how it responds to different queries. Here are a few prompt ideas to start the conversation: Help me to plan a birthday party for friend. Ask questions as needed. Explain me LLMs like I’m 10yo. Then ask follow-up questions to check my understanding. write me a golang http server that listens on port 8081 and replies with \"Hello, friend\" to all requests to '/hello' I'm reading a FAQ about LLM and ChatGPT on abdullin.com. Write a tweet about that. Be concise. Prompts are like text functions written in English . If you can imagine and precisely describe text operation, GPT model will do its best to execute it. The process of creating prompts that get the job done is called prompt engineering . Since prompts are like text functions, and software engineers know how to write and test functions, we already know the fundamentals of writing good prompts: Learn patterns and practice them Test and benchmark prompts Having that said, please now go ahead and read these two things, they will save you a lot of time: Prompt engineering vs blind prompting . This article is written by Mitchell Hashimoto (of Hashicorp and Terraform), it will put prompting into the perspective of software engineering. Prompting Guide - it will familiarise you with the techniques of prompting. Read only “Introduction”, “Zero-shot” and “few-shot” prompting. The other content is irrelevant for us now. After you have read through, you should be able to understand what is exactly going on in this prompt: You are a smart agent that extracts time of an event from text. Answer concisely: Q: it is going to rain tomorrow A: tomorrow Q: What do you think about afternoon lunch? A: afternoon Q: Bus leaves at 12:45 tomorrow from platform 12. A: 12:45 tomorrow Q: <USER_INPUT> A: To summarise, this prompt: is designed to extract time fragment from user input; will work in any major human language; “trains” model by setting up 3 examples (also known as few-shot or multi-shot prompting ). The last example has a <USER_INPUT> placeholder. When processing user input, we’ll fill it in and pass to ChatGPT to complete. Returned answer would be the output of our function. GPT prompts are the building blocks for adding ML-driven features to products . Here are a few examples of what is possible: Smart chat bot that can answer questions about a company or a product. Automation to extract structured information from emails, documents or voice transcripts. Software that creates product listings from human descriptions and images. Automation to write and deploy a software application based on human description of it. How do we build something like that? This could require chaining multiple prompts together with some business logic. The fastest path to achieve that is by using a LangChain - open-source python library that contains a lot of pre-built functionality."}
{"url":"/llm/faq-chat-gpt-text-analysis-prompts/","title":"ChatGPT is unpredictable in text analysis and extraction. Can this be fixed?","text":"ChatGPT is unpredictable in text analysis and extraction. Can this be fixed? Yes, ChatGPT is excellent in text extraction and analysis. You can tune it to produce predictable results. ✔️ Read prompt engineering guide If you haven't done so already, the next 30 minutes could help to save you a few hours in the next days. Don't waste your time on the entire guide, though. Just focus on: Introduction - all sections Zero-shot prompting Few-shot prompting (or multi-shot) ✔️ Use ChatGPT API Use ChatGPT via API calls, instead of WebUI. This allows you to set a custom system prompt, temperature and provide precise user/assistant prompts. ✔️ Set temperature parameter to 0 Temperature controls creativity (probability of picking less probable token), and we don't want that in text analysis. ✔️ Provide multi-shot examples ChatGPT was fine-tuned on user-assistant interactions, so set up multi-shot prompts by providing it with examples of expected interactions . Use user and assistant content types in ChatGPT API for that. This makes a huge difference! ✔️ Stick to English Yes, ChatGPT can speak many languages. However, it generally is crippled on any language other that English. This is because: it was primarily trained on English datasets it is smarter, faster and cheaper with English, because text tokenisers are based on English vocabularies. Given 100 tokens, you can fit there ~70 words in English and ~20 words in any other language. If you get bad results in non-English language, try these options: write prompts in English Translate your data to English before asking ChatGPT and translate the response back to your language afterwards ✔️ Work with datasets When crafting prompts, don't measure their efficiency on a single test case at a time. Create a dataset of 10-100 items, run prompt against it and measure overall accuracy. If you spot some repeating error - extract an example and add it to the list of examples within your prompt: \"Here is how you should act in this case.\" This helps a lot! By gathering even more data you'll spot cases where the model is not working good enough. Group these cases by types of errors and extract a representative sample (by the way, ChatGPT is good with that) and add yet another multi-shot example. And at every single step - benchmark and measure ! ✔️ Ask for the confidence ratings One thing that helps me a lot in creating stable LLM-driven pipelines - I ask to provide an answer AND a confidence rating. Rating usually comes from 1 to 5, with 5 being the most confident. While running pipelines, I can quickly filter results without re-running expensive and slow prompts. Sometimes only the results with rating of 5 are good, other times I can accept things all the way down to 3 . ✔️ Use LLM prompts to merge and double-check results Let's say you are extracting a single bit of information from multiple conflicting sources. How do you arrive to a single value that will be presented to a human? You can gather these results into a single prompt and ask ChatGPT to merge these results into a single value, with a confidence rating of its own. You can even let extraction script be creative, while the double-checking script can be pedantic and with a great attention to the detail. Summary There are multiple ways to stabilise data extraction and processing pipelines with ChatGPT. No matter what you do, please benchmark your prompts and any changes to them!"}
{"url":"/llm/how-to-get-into-ml-for-developer/","title":"How to get into ML for a developer?","text":"How to get into ML for a developer? I see two variations of this question a lot: I'm a developer. I see that ML/AI is really popular these days. How do I become an ML specialist and learn how to train LLMs? I'm a SysOp. How do I learn about training neural networks and building LLMs? The best answer for you career? Don't waste time on becoming an ML specialist. Don't learn how to train neural networks from the scratch. Everybody else will be doing that, going through the courses like \"Create LLM in PyTorch in 10 days\" or \"Tensorflow in 30 days\". This feels like an intuitive way to get into the industry. While starting data science and machine learning departments in companies, I've observed the following pattern. It is already fairly easy to hire data scientists or ML specialists . They cost money, but there is a steady supply of them. Everybody is getting into the field this way. ML specialists can build a convincing prototype that business really likes. However, here is what happens next - prototypes goes into a production and suddenly strange things start happening: messy code, broken APIs, OOMs, performance problems. There is a big gap between building a prototype and developing a robust production system. Here are a few things that come as a surprise to ML-only teams : schema versioning (APIs and data schemas) code quality and patterns uptime throughput and latency deployments and scaling integration CI/CD and process automation A/B testing telemetry, monitoring, logging and observability Companies start scrambling at this point, trying to hire ML Engineers/ML Operations: people that don't know how to train a model; people that know enough about models to be able to integrate them into business systems (ML Engineers) and operate them in production (ML Ops); people that can support ML teams and teach them good engineering practices as needed. More people become ML-only specialists, higher is the demand for all-rounded engineers that can support these specialists. So long story short, if you are an engineer with experience in building or operating software systems, don't waste your talent and time on becoming an ML specialist from scratch . Just add a little bit of awareness in ML to make your skills immediately applicable in this domain. You can start by investing your time in a more practical way: take an existing model and learn how to execute a request against it; put it behind an API of your choice; deploy all that inside a docker (ideally, try renting a GPU-based machine for a few hours); repeat that for a few different models, including LLMs; repeat that inside a Kubernetes cluster for extra bonus points; put all the related code to Github. This will give you 70% of the necessary skills needed to support ML teams in bringing their services to production. This might not look like much, but I've been on a hiring side for this kind of role. It is nearly impossible to find somebody that can do that. Also go through this checklist to catch up with LLM from a practical standpoint: ChatGPT quickstart for developers If you get there and need more guidance, don't hesitate to reach out to me in the newsletter comments."}
{"url":"/llm/how-to-segment-text-for-embeddings/","title":"How to segment texts for embeddings?","text":"How to segment texts for embeddings? Let's say you want to build an information storage and retrieval system. There are two major options for retrieving text and documents: full-text search, similar to how Elastic or Lucene do it. You need to know the words in order to find the answer. similarity search, that is based on embeddings and vector databases. Products built with ChatGPT and the other LLMs mostly use similarity search with embeddings , because it finds correct information even if you didn't guess the write words. Results also tend to be better. We'll focus on that approach. So how do we compute embeddings for a text? Normally we split the text into smaller chunks, compute embeddings for each and then store everything in a vector database. How to split text for embeddings? The answer depends on each specific domain. Your basic options are: split by paragraphs split by sentences split by N words, letters or tokens. You can also use a sliding window approach, where each chunk includes a portion of the previous chunk. More advanced options for calculating embeddings: Summarise chunks Extract possible questions or keywords with LLMs You can also check out LlamaIndex for samples of specialised indices that include: List index - just a sequential chain Vector store index - storing embeddings in a dedicated DB and retrieving top-k most similar node Tree index - hierarchical tree with traversal Keyword table index - extracts keywords from each node (text chunk) and builds a mapping Knowledge graph index LlamaIndex plugs into LangChain, making it easier to integrate language models with rich information retrieval. Once you have indexed your corpus in a vector DB with some index, you could pass control to a LLM for answering questions. Here is an example of the prompt chain for a ChatGPT: You need to answer question of a user. List phrases to search for with top-k similarity search in order to answer the question. Question follow. You need to answer question of a user. Information retrieval got these X chunks. Provide an answer now User liked answer to the question X, provide a list of additional keywords for retrieving this answer in similar scenarios Here is a good example of how things could work together beautifully in a single prototype: twitter . Success will depend on domain-specific factors in each case: prompts, terminology, information indexing and retrieval strategies. Experiment!"}
{"url":"/llm/talk-to-your-knowledge-base/","title":"How to talk to your knowledge base?","text":"How to talk to your knowledge base? Let's create a conversational agent that loads our own knowledge base and can chat about it. We'll do that in less than 30 lines of Python code. We'll need a recent python and a couple of libraries: pip install llama-index langchain PyPDF2 Initialise OpenAI key from the environment variable. import openai import os openai . api_key = os . environ [ \"OPENAI_API_KEY\" ] Load folder with our knowledge base into a vector store (I have text files and PDFs there): from llama_index import GPTVectorStoreIndex , ServiceContext , SimpleDirectoryReader docs = SimpleDirectoryReader ( '/Users/rinat/brain' ) . load_data () ctx = ServiceContext . from_defaults ( chunk_size_limit = 512 ) index = GPTVectorStoreIndex . from_documents ( docs , service_context = ctx ) Teach agent how to use this vector store, by turning GPT Index into a Tool : from langchain.agents import Tool tools = [ Tool ( name = \"GPT Index\" , func = lambda q : str ( index . as_query_engine () . query ( q )), description = \"useful for when you want to answer questions about Rinat. The input to this tool should be a complete english sentence.\" , return_direct = True ), ] Create a chat agent. Give it memory, connect to LLM and give it a Tool : from langchain.chains.conversation.memory import ConversationBufferMemory from langchain import OpenAI from langchain.agents import initialize_agent agent_type = \"conversational-react-description\" memory = ConversationBufferMemory ( memory_key = \"chat_history\" ) llm = OpenAI ( temperature = 0 ) agent_chain = initialize_agent ( tools , llm , agent = agent_type , memory = memory , verbose = True ) Let's ask this agent a question: agent_chain . run ( input = \"What does Rinat say about event sourcing?\" ) This will launch an agent chain: > Entering new AgentExecutor chain... Thought: Do I need to use a tool? Yes Action: GPT Index Action Input: What does Rinat say about event sourcing? Observation: Rinat Abdullin states that event sourcing can be used to represent the entire state of an Aggregate as a sequence of Events that have occurred since it was created. He also states that event sourcing guarantees that the reason for each change to an Aggregate instance will not be lost, and that it can be invaluable for the business. He further states that the append-only nature of Event Streams performs outstandingly well and supports an array of data replication options. Finally, he states that the Event-centric approach to Aggregate design can allow developers to focus more of their attention on behaviors expressed by the Ubiquitous Language. > Finished chain. Output will be: Rinat Abdullin states that event sourcing can be used to represent the entire state of an Aggregate as a sequence of Events that have occurred since it was created. He also states that event sourcing guarantees that the reason for each change to an Aggregate instance will not be lost, and that it can be invaluable for the business. He further states that the append-only nature of Event Streams performs outstandingly well and supports an array of data replication options. Finally, he states that the Event-centric approach to Aggregate design can allow developers to focus more of their attention on behaviors expressed by the Ubiquitous Language. Let's check how well agent uses memory and LLM capabilities: agent_chain . run ( input = \"Answer the last question like I'm 5\" ) Note that agent doesn't use tools. It relies on data from the memory to provide an answer: > Entering new AgentExecutor chain... AI: Event sourcing is like a diary for a computer. It keeps track of all the changes that have been made to a computer, like when something was added or removed. This way, if something ever needs to be changed back, it can be done easily. > Finished chain. And the output is: Event sourcing is like a diary for a computer. It keeps track of all the changes that have been made to a computer, like when something was added or removed. This way, if something ever needs to be changed back, it can be done easily. Notes You can give agents more tools. Here are a few examples: Bing Search, Python REPL, Wolfram Alpha or Zapier GPTVectorStoreIndex is one of the simpler ways to index information. Depending on your knowledge base, you might need to craft a more capable index. For example, see How to segment texts for embeddings? Obviously, there is no persistence in this sample. You'll need to add that separately."}
{"url":"/lokad-cqrs-retrospective/","title":"Lokad.CQRS Retrospective","text":"Lokad.CQRS Retrospective Lokad.CQRS was a framework for building “complex” business applications running on Azure. It was born out of very limited experience and a desire to have CQRS/ES applications on Windows Azure. I knew only a few patterns back then and tried to use them to solve all possible domain problems. Note: this article requires some previous painful experience with Lokad.CQRS. It wouldn’t make sense to you otherwise (which would be a good thing). This approach worked to some extent. It provided an entry point to building relatively complex distributed systems that run on Windows Azure: hundreds of messages and views, dozens of aggregate roots. At Lokad, we initially built Salescast and Hub with Lokad.CQRS. AgileHarbor delivered SkuVault with it. Many other projects used Lokad.CQRS entirely or reused some parts. Over time, I discovered (in a painful way), that this specific approach had many limitations: Lokad.CQRS made developers focus on low-level implementation details, instead of high-level domain design. it allowed to quickly prototype a software solution, that would be painful to evolve and maintain afterwards. Implementations were tightly coupled to Windows Azure Storage. Approach didn't provide solutions for really important matters like testing, scalability, production logging and performance monitoring. Let's go over the most painful parts! Building Blocks Lokad.CQRS focused on the idea of having a few “well-developed” building blocks. They provided abstractions for storage, messaging and event processing on Windows Azure. Implementations could also be swapped to use local file system or memory, which allowed for faster development or testing (since Windows Azure Storage Emulator was horribly slow back then). Business solutions were to be built out of these building blocks. As it turns out, the ability to swap different storage engines was the only good thing in here. However, even the implementations were very limiting and had many flaws. More than that, the entire premise of developing universal building blocks upon which the different implementations were to be built - was wrong. We’ll talk about issues with the specifics blocks first, then we'll switch to the problems with the design approach. Atomic Key-Value Storage Atomic key-value storage allowed you to store documents by keys and perform atomic updates on them. It had multiple storage implementations: Windows Azure Storage, file system and memory storage. These could be swapped in and out via a configuration setting. Search and indexing capabilities had to be created manually with this approach. Usually this was done by defining a single-key index document, which would be updated on every write. This storage was very good for small projects and rapid prototyping. However, as the project grew, it was very easy to end up with millions of document instances with this approach. Eventually somebody would want to do some bulk operation (like deleting all products for a tenant) and that’s where things would start breaking down: Azure storage (just like any other storage) has a latency for accessing individual blobs. Bulk operations would be slow even with async non-blocking operations. Bulk updates were never atomic with this storage. If something was to go wrong, the storage would end up in a very inconsistent state. While trying to read or write blobs in large bulks, Azure storage would start hitting scalability limits and time out. Some timeouts could be handled by retry policies, turning it into a very slow operation (e.g. hours) that might still eventually fail. As databases increased in size, so did many secondary indexes. Any change to the primary document was followed by an update to an index. This update could be slow or even fail, leading to an inconsistency in the data. These days I’d try to avoid introducing such a limiting concept into the project, especially in data-intensive applications. It would be more efficient to pick a proven data storage technology (or a set of them) which fit the domain model very well, then let it handle all the bulk work. By fitting well I mean ease of use and management, simplicity, predictability, ability to meet scalability and consistency requirements of the project. Append-only storage Append-only storage was another building block in Lokad.CQRS. It used Azure page blobs of 512KB to store information. Whenever a system restarted or the blob got filled, a new blob would be created. There were also in-memory and file-based implementations of this storage, usable for demoing and local development. This storage was mainly designed to work as an underlying transaction log for the event stores. It worked well for short transaction logs (e.g. a few GBs), but turned out to be bad in large-scale production . Consider a realistic case of having 20GB of event history. If the store was optimally filled, it would have about 40960 of separate page blobs inside. Backing this storage up or simply enumerating might take some time . Concurrency is another issue here. This append-only storage can only be written by a single writer (multiple readers are ok), so we can’t use one of the strongest parts of Azure - increasing the number of worker instances. Message Vault Message Vault came as a temporary replacement to the append-only storage. It is a dedicated transaction log storage library for Windows Azure. Files in 100MB chunks are used by default, master-follower setup is used for higher availability. MessageVault is available on github . This implementation isn’t perfect (e.g. if a master dies, there would be some delay before the follower starts accepting writes) or battle-tested in multiple projects. However it is a decent replacement for medium-size event-driven projects running on Windows Azure. Larger projects could switch from MessageVault to Apache Kafka, which is a battle-tested distributed commit log. Semantically MessageVault is similar to Kafka (if used with infinite retention policy), so the migration should be rather straightforward. Microsoft plans to introduce Azure Storage append blob , which might be able to replace MessageVault in companies limited to Microsoft stack. Event Store Lokad.CQRS featured its own Event Store. It was based on the append-only storage and used in-memory cache to load streams by their id. Atomic stream updates were also possible (using an in-memory locks). In theory, this allowed for building apps with event-sourced aggregates. In practice, this specific implementation failed badly in production: Worker node would need to read the entire storage and cache it in memory on startup. That would take some CPU and RAM. Imagine an event store with 20GB of history. We can’t have multiple workers writing to the same event store, so the scalability and HA go out of the window. In-memory cache implementation wasn’t thought through. For example, streams were stored in a structure like this ConcurrentDictionary<string, IList<byte[]>> . Any .NET developer could tell you that storing millions of events here wouldn’t be the most memory-efficient approach. View Projection Framework View Projection Framework in Lokad.CQRS was responsible for replaying events from Event Store to rebuild read model (views) and keep it up-to-date. It used atomic storage to persist views and smart logic to detect code changes that would require a total replay. That description should already hint at the problems: Any kind of replay would be slow, since it ends up with saving thousands of views to Azure Storage, which might not be able to handle the spike. Change detection logic was smart and hard to reason about. It still didn’t catch all changes, forcing developers to change code of some projections by hand, just to trigger a rebuild (adding ToString() overload was the most frequent choice). The framework “encouraged” building a lot of different views without thinking twice. These views could have thousands of documents each (or really large partitioned dictionaries). To speed up the replay in such situations, the framework cached all documents in memory. That used a lot of RAM, as soon as you exhausted it (and the system starts hitting the page file), your replay process would be doomed (unless waiting for a few days is ok for you). A workaround for RAM starvation on replay in Azure would be to provision a large Worker Role just for the duration of the replay process (e.g. 112GB RAM is required sometimes), which isn’t a good thing for many reasons. Code DSL for Message Contracts Lokad Code DSL was a sister tool, frequently used in projects that were based on Lokad.CQRS. It generated proper message contracts on-the-fly out of the compact definition files. For example, this code: AddSecurityPassword ? ( SecurityId id , string displayName , string login , string password ) would turn into this C# class on-the-fly: [DataContract(Namespace = \"Sample\")] public partial class AddSecurityPassword : ICommand < SecurityId > { [DataMember(Order = 1)] public SecurityId Id { get ; private set ; } [DataMember(Order = 2)] public string DisplayName { get ; private set ; } [DataMember(Order = 3)] public string Login { get ; private set ; } [DataMember(Order = 4)] public string Password { get ; private set ; } AddSecurityPassword () {} public AddSecurityPassword ( SecurityId id , string displayName , string login , string password ) { Id = id ; DisplayName = displayName ; Login = login ; Password = password ; } } Magic! Back then I believed that code generation was generally bad, but this case was an exception. After all, nothing should stand in the way of a developer creating new events. As it turns out, this case was definitely bad. Lokad.DSL allowed you to create event and command message definitions really fast, without thinking too hard about them. This was very wrong, as I learned later. Message contracts are one of the most important aspects of the design, as part of an API they touch multiple contexts. Badly designed message contract could poison all codebases it touches . To make things more difficult, a badly designed event contract could mess up an event-sourced system for years. Another (rather mild) drawback of code DSL was that it made it harder to tweak and customize message contracts for very specific scenarios. These days, instead of relying on any sort of code generation, I prefer to take my time and write the entire contract by hand. Contracts are a very important part of the design, they deserve the attention. One-Way Messaging Lokad.CQRS also featured a one-way messaging framework, using Azure Queues or the file system. The implementation grew to be rather robust with features like deduplication, quarantines and retries. As a part of Lokad.CQRS, that implementation was misused very badly. This misuse stemmed from my lack of understanding and experience. Here is the worst bit: I advised developers to let different systems communicate by pushing one-way commands to each other. Whenever a system wanted to know about the result of the operation, it had to poll a view which would eventually contain the outcome of an operation. A client (e.g. web UI) had to also follow these rules. In theory, this made the systems more robust in the face of failure (especially, if you used Azure queues for the communication and azure blobs for view persistence). In practice, this complicated everything a lot. Simple operations that need to know about the outcome of an operation (e.g. adding a user) would require a lot of boilerplate code: events, views, view projection logic, async logic for polling these views till the result arrives. Repeat that many times and you get: a lot of very fragile code; component communication via data (key-value views), which is a very fragile integration point; codebase, where adding a single feature requires changing half a dozen of files or more, chatty implementation which talks to Azure Storage a lot. Any moderate load on a busy day is likely to hit a scalability limit, causing timeouts and failures. These days, I prefer to start by using Remote procedure calls (e.g. by exposing an HTTP API) for the majority of client-server communications. These are easy to reason about and easy to use. The entire WWW works with RPC and HTTP API. And that is the most complex and scaled distributed system in the world. If there is a justified need for messaging in the project, then I’d try to use an existing and time-proven solution instead of rolling my own. Apache Kafka and RabbitMQ are among the ones that seem to excel at large scales. As for eventual consistency and versioning, these can be easily solved with well-designed RESTful APIs ExpectedVersion HTTP Header and browser-level retries. Failure of the Building Blocks Out of these building blocks, append-only storage is the only one I still use these days. All the other building blocks either have a lot of issues with them or aren't that useful. That essentially renders the entire Lokad.CQRS useless and plain harmful. The high-level design problem here was in trying to limit an implementation to a few known possible choices. It is the same as telling people: “I don’t know what you are trying to build, but here are tools and materials for that. By the way, be gentle with that sledgehammer, it could break, if twisted clockwise”. This is a constraining approach which can lead to a lot of technical debt in later stages of the project. Design Lessons Lokad.CQRS was created with a very short-sighted design approach in mind, a reusable LEGO constructor. These days I'd try to limit the damage I inflict upon the developers and avoid writing any widely reusable frameworks. They come with too much responsibility. At most I'd help to setup a design and development process in a single project or a company: Align design with tools and solutions that already exist out there. Focus on high-level design, testability and integration, while deferring implementation details. Provide developers with means to understand how their implementation behaves in production, allow to optimize it and scale. Allow developers to switch between different tools and clouds, instead of coupling them to a single cloud provider. While working on the design, start by decomposing the solution into separate contexts, identifying the boundaries and capturing them in the code. Code-wise that can be expressed as: Light project-specific framework for defining backend modules with an API. Implementation has to be aligned with well-established frameworks like: Scala Play, Node.js express or .NET Nancy. Tooling to write use-cases for these modules in a coherent and non-fragile way (event-driven use-cases), verify behavior against the working code, and print them as human-readable documentation. There is an example of that in omni project . Examples of high-level sanity checks that protect code from hacks and rash decisions that are easy to catch. These could be run locally or on an integration server (using code introspection and information captured from the use-cases). Tooling to render human-readable API documentation with samples out of the use-cases, embeddable into the API itself. This is similar to the work we did at the HappyPancake (HPC) project . Make it easy for developers to gather telemetry from the systems running on dev, qa and production environments on many servers. That would include real-time performance metrics and structural logging (Serilog and StatsdClient). Provide guidelines for capturing information, along with the docker image for setting up a production-worthy Linux server for aggregating this data and displaying it. Provide guidance for developers on how to deploy changes to dev and QA environment (with continuous delivery), how to handle production upgrades. Provide tooling that would involve managers and QA people in the software development process. Expose specifications, real-time behavior, and system design to them via: rendering use-cases to human-readable specifications, allowing QA people to define new use-cases, providing audit logs from the running system in the same language as the use-cases. Infrastructure and guidance to developers for developing client applications on various platforms (where this is applicable), train and help them find solutions to their problems. Ideally, provide examples of setup for high-availability deployments, using existing solutions from the real-world (e.g. nginx for load-balancing, ZooKeeper/Basho Core/etcd for managing the cluster state, Apache Kafka for high-throughput event bus). This list is by no means exhaustive. I didn't cover things like A/B testing, development workflow, feature toggling and deployments, integration and unit-testing, stress testing, handling on-premise deployments and many more. However, none of that was even considered in Lokad.CQRS. So that, in retrospective, should give you an idea of how lacking and bad that framework was. Summary So, Lokad.CQRS was a big mistake of mine. I'm really sorry if you were affected by it in a bad way. Hopefully, this recollection of my mistakes either provided you with some insights or simply entertained."}
{"url":"/m1-viva-and-alien-tech/","title":"M1, Viva and alien tech","text":"M1, Viva and alien tech M1 system-on-a-chip was one of the most exciting tech announcements of 2020 for me. Entry-level Apple computers got a significant improvement in performance and battery life at a fraction of the cost. The jump was even more spectacular than what AMD was able to deliver in the past years. Part of the magic was the fact that Apple has managed to switch its architecture from Intel x86 CPU to a custom ARM-based processor. To achieve that, they had to change many things, starting from the hardware and up to the OS/software implementation details. M1 delivery wasn't without glitches, yet it demonstrated the possibility of delivering differentiating features to customers by: switching from the x86 instruction set (complex and with a bit of legacy) to a customized ARM instruction set; designing hardware in tandem with the software (Rosetta 2 and x86 emulation, reference counting); closely integrating systems and iterating on the integrated solution as a whole. Even if Apple fails to follow-up with M1X and M2 variants in the next year (given their track record with A processors, this is unlikely), this has already got more companies interested in ARM/RISC-V processors and custom CPUs. This heralds potential changes for the world of the PC and modular desktop computers. It is cool to have a model where one is free to pick their components and assemble a working system like a Lego. I've done this on multiple occasions. However, such freedom comes with a hidden organizational cost: vendors have to agree about interfaces and protocols that will link their components together. More components there are - harder it is to reach an agreement. We are talking about things like CPU socket families, PCI-E specifications, power interfaces, physical dimensions, RAM interfaces, buses, frequencies, cooling, etc. With so many moving parts to negotiate and support, the rate of change is inherently going to be slower. And we aren't even talking about the drivers and compatibility. Meanwhile, Apple can alter everything inside their gadget and nobody will care . If the trend stays, there will be more changes in the hardware that runs software: AMD vs Intel, ARM vs x86, even RISC-V vs ARM, FPGA vs ASIC. If patent applications are of any indication, we might see reprogrammable execution units within the CPUs. If the recent employment changes of Jim Keller are of any indication, we are going to see more specialized processing units in the upcoming years. In this new light, I'm interested in gaining a deeper understanding of how software and hardware work together. So, let's learn some hardware and FPGAs, then? However, normal learning paths are boring. One has to learn digital logic, go through Verilog/VHDL, and start building system elements from the bottom: ALUs, buses, memory, logical units. Afterward, build a compiler capable of making your programs run on your hardware. Such \"from the bottom to the top\" trajectory is designed for somebody who starts a long career in hardware design. It is understandable but boring for me. Two small kids and all that. Can we approach the problem in a completely different way? Viva/Azido The announcement of M1 in 2020 coincided with another tech-related announcement. Ocado Group had acquired Haddington Dynamics. Ocado focuses on software+robotics platforms for retail, while HD sells a low-cost robot arm called Dexter with unique characteristics (mostly about unique precision and sensitivity). I've talked about the Dexter hand befire. Let me focus on one aspect for now: The FPGA supercomputer onboard the robot gets 0.8-1.6 million points of precision (CPR) directly on each of the robot’s joints, allowing extremely imprecise parts to produce extremely precise movement. The interesting part is that this custom FPGA wasn't designed using the mainstream tooling. It uses an ancient piece of software called Viva/Azido. This software is so special, different, and old. It could be an alien legacy for all we know. Viva was initially developed by the Star Bridge around 2000 to build custom hardware accelerators on top of the FPGAs. As Forbes wrote in 2003 : Star Bridge sells four FPGA-based \"hypercomputer\" models with prices ranging from $175,000 to $700,000. The \"sweet spot\" machine, called the HC-62, sells for $350,000 and contains 11 Xilinx FPGA chips, which cost about $3,000 each. That model will perform 200 billion floating point operations a second. The $700,000 model contains 22 Xilinx chips and can perform 400 billion floating point operations a second, Gilson claims. In addition, customers must license Viva, paying $45,000 per year for a one-person license. In April 2011 the technology was acquired by Data I/O for $2+1 million, rebranded as Azido, only to be written off a few years later : We evaluated changes in Azido projects and projected cash flows which decreased or eliminated our expected future cash flows related to Azido technology’s use or disposition. Based on these evaluations, impairment charges of $31,000 and $2.3 million were taken against this software technology for the years ending December 31, 2013 and 2012, respectively. As of December 31, 2013, the Azido technology net carrying value is $0. Later the technology resurfaced in the Haddington Dynamics where it was used to design the custom hardware for the robotic hand. Over the years, Haddington Dynamics kindly shared everything related to the robotic hand, including the Viva binary and FPGA sources (in Viva language). Viva is a visual development environment that allows top-to-bottom design while leveraging a large core library of pre-designed building blocks. Viva isn't specifically tailored to building microprocessors and CPUs. It rather aims at building hardware pipelines (specialized accelerators). Azido/Viva is not the easiest way to do design hardware. Here are a few limiting factors: there is no community; the only publicly available body of knowledge is a user guide , a few sparse videos, and a rich core library; software is available as an ancient Windows executable (Borland C++ Builder times); there are partial sources, but not enough to rebuild the binary; there probably is a reason why it isn't widely known today. However, learning Viva/Azido is a unique way that differs from the mainstream approaches . It appeals to me because it is like a puzzle that pretty much nobody in the world is going to bother solving. Besides, the learning path itself is more appealing than a standard Verilog tutorial: there already is an existing design for the Dexter HD hardware. One \"just\" needs to figure out how to parse it, recursively unroll, simulate, and synthesize. Here is a small piece (roughly 2% of the surface area) of that top-level design that drives the Dexter HD robotic hand. If you are interested in more details, I refer you to the viva-tools project on Github. The purpose of that project is two-fold. First, to collect and archive artifacts related to the Viva/Azido platform before they are lost forever. Second, I want to achieve a deeper understanding of how software and hardware work together through rebuilding a piece of software that could be used to design hardware, simulate and synthesise it. Current status : I'm able to parse the source files of the Dexter (and core lib) and render them with decent fidelity. The image of the Dexter design above was done by that script (Viva/Azido itself crashes while trying to open it). There is a small test suite that verifies the correctness of the parsing. You have read it so far! If you really dig what Haddington Dynamics is doing, especially with regards to Viva/Azido stack, don't hesitate to reach out ! As of 2022/2023 they have a ton of work to do and are actively hiring. You need to be able to work legally in the USA and be interested in moving to Las Vegas."}
{"url":"/ml-under-hood/","title":"ML Under the Hood","text":"ML Under the Hood"}
{"url":"/my-backup-process/","title":"My backup process","text":"My backup process My backup flow adheres to these principles: Stick to rule 3-2-1 Snapshot important folders forever Use untrusted remote repositories to store encrypted backups off-site Here is how it is implemented. There are three personal folders on my laptop: ~/keep - all the documents that I should keep forever; ~/hoard videos, whitepapers, podcast episodes and the other stuff I want to keep forever; ~/Dropbox - similar to keep but is also shared within the family. All personal folders are backed up daily via Borg Backup . Borg takes folder snapshots, encrypting and deduplicating chunks. Each daily backup goes to a local NAS (with RAID-1) and to a remote storage. A couple of times per year I plug a portable HDD and back up everything to a borg repository there as well. That should be good enough for the rule 3-2-1: Keep 3 copies of data. Use 2 different types of storage. Have at least one off-site replica. I keep 10 daily, 6 weekly, 14 monthly and 99 yearly snapshots. Since snapshot reuses file chunks, that doesn't take a lot of space. Compression saves a few GB on top of that. NAS itself is used to store family photos. Some of the folders there are configured to be backed up to a remote storage via Borg (same setup). This is less strict and should be cleaned up later."}
{"url":"/my-productivity-system/","title":"My productivity system","text":"My productivity system I've been reflecting a lot about the personal productivity systems. It turns out I actually have one. If writing things down or keeping a journal works for you, then you might also find the story interesting. The reflection was triggered by a couple of tweets. The first one was from Matthias Verraes : Bloggers: \"I'll demonstrate DDD by building a Todo app\" Actual DDD: \"We looked at how people are using Todo lists, and we figured that our model was wrong. Besides TODO and DONE, we needed to afford for POSTPONE, WON'T DO ANYWAY, and TOO LATE. A quick experiment showed that...\" It evolved into a healthy discussion about personal systems, handling weekly TODO lists and aligning them with big aspirations. There is even a mention of a Life OS built with Notion and Miro . I tried multiple similar approaches in the past but always got stuck. I always end up with a TODO list that never gets completed. It makes me feel guilty, so I just back off from the system to never come back. In retrospect, most of these productivity systems need regular reviews to keep working. Reviews need time. Up to one hour for the weeklies. We have two little kids and a pug. Life with little kids is a challenge on its own, but you can also throw in COVID and relocation to Austria for extra fun. Things are better now, but I still remember weeks and months with no spare time and motivation at all. Without time and motivation, tasks and pending reviews start piling up. To get back on track, I would have to deal not only with low motivation but with even more cleaning up to do. That would be impossible. Guilt would grow, and I'd just pretend that the system never happened. However, there were a few small things that kept on coming back to me throughout these years. Healthy exchanges with Laura , Tamara and Beau helped me to recognize these surviving bits and the system behind them. It doesn't work always, but all my major work-related achievements in the past 6 years were supported by it. In retrospect, the approach is simple: the productivity system should adapt to good and bad times; it should be based around writing because that is how I think. The system is a small application. When the motivation is high, it works like a Journal. The system tracks my notes and tasks across various topics. The user experience is designed to present each topic as a story, encouraging me to keep it going. Whenever I neglect the system, it backs off. It hides all the low-level details and incomplete backlogs, showing me a list of stories and an empty canvass. Every time is a fresh start that builds upon past aspirations and achievements. When my motivation is high, it works like a simple Journal I write down my actions and notes in the journal, putting them in one of the topics. Examples: Personal health, Robotic hand, DDD Katas, DSML Book, Wireguard, etc. If there is no matching topic, I create a new one. I don't call them \"projects\" because that feels like an obligation to handle. There is too much \"work\" in that word for me. After the note is added, the system extracts the headers and pushes them to the Week Outline and Topic Outline . Both are high-level. Week Outline tells a story of the things I've done that week. Topic Outline is like a table of contents for the entire topic, telling a story from a high level. The list of Recent Topics also gets updated. These are my aspirations and interests that arise naturally. I can do just a little thing like saving an interesting thought or a link, but this is immediately acknowledged by the system and the related story. The system updates all views and shows the latest topic outline. This puts me back into the strategic view and encourages me to think about the next steps. How can I continue the story? If I have time, I could work just a little bit more. If there is no time, I could write a note to myself for the next weekend: \"take a look at this project!\" This will be my TODO that will show up in the week and topic outlines. It will also become a part of the story. When my motivation is so low, that I start neglecting the system, it turns into a storybook . Daily notes and TODOs fade away from the Week Outline because they were bound to a specific date. They are still safely stored by the system. I could dig them up if needed. Should I ever want to open the system after a period of neglect, there will be no guilt. Just an empty canvass and a list of stories from the past. My interests and problems will never go away. Eventually, there will be a break and I'll do something worth remembering. Journal notes are guaranteed to be safe . Immutability, redundant backups , and CAS for the binary data handle that. So if I have a bit of information to store, I would dump it into the Journal without thinking. So I go back to the Journal to add a quick note. All user interactions start with a list of recent topics. This pulls me back to the big picture - these were the past interests and big challenges, at a high level. As soon as I hit \"Save\", the journal would present me with the outline view that includes the last added topic. Last incomplete TODOs will be balanced with the weeks and months of small accomplishments before them. No fluff or guilt, just the big picture . Telling me: \"hey, we had some fun with that story 6 months ago. Do you want to continue it?\" Usually, the answer is yes, even if that is just to spend 10 minutes figuring out what to do next or cleaning up the code. But, hey, it gets me thinking and working on that story. There is comfort in knowing that the system will continue chipping away at problems for years. Should there be a challenging period, it will eventually get me back on track. There will even be a story about all that afterward. To summarize my system : Capture notes and tasks in a journal. When saving them to a topic, extract the headers and build an outline (table of contents). Display outline in the weekly view, this will pull you back to the big picture. Display current week by default, hiding past tasks as time goes on. This will allow guilt-free fresh start after a period of neglect. Display topic as an outline by default - whenever you come back, this will tell a story from the high level. Maintain a list of recent topics to surface aspirations and problems. Use reliable system with backups. Reliability will make you come back to save things. The system will then show the big picture and nudge you to continue writing stories. If you enjoy writing or leverage it to get through difficult times, then try building a personal productivity system around that. It could provide additional support and comfort on your journey."}
{"url":"/now","title":"Event Now!","text":"Event Now! Current Event KanDDDinsky 2025 Recent Events LLMs in Marketing Hosted by the Data Lab at ESSCA Research, organized by Chaaben Kouki Click here to join the meeting (MS Teams) Links Lokad.com - better supply chain decisions Trustbit - Large Language Models for the companies Joannes Vermorel - blog Chaaben Kouki ESSCA Research Agenda FAQs, materializing the institutional knowledge Security/privacy and AI strategy for business. Generate product listings on marketplace (using text + vision) Automatically fix errors in product catalogues Write high-quality marketing texts for products Intent-driven search on a marketplace (recurring pattern) Questions and responses from audience. Large Language Models in Supply Chain With Joannes Vermorel from Lokad.com Coding with AI - by TrustBit Hosted by private customer Links: Replit - online IDE ChatGPT quickstart for developers How to get into ML for a developer? Extra: Trustbit LLM Benchmark ML under the Hood Newsletter AI in Software Development Kata Hosted By Christian F. and Erol A at Spar ICS, Salzburg. Links: Replit - online IDE ChatGPT quickstart for developers How to get into ML for a developer? Extra: Salzburg Software Craftsmanship Meetup page Trustbit LLM Benchmark"}
{"url":"/post/10-steps-to-become-better-net-developer/","title":"10 Steps To Become Better .NET Developer","text":"10 Steps To Become Better .NET Developer Here's a list of things you might want to learn about in order to become a better .NET developer. Better developers are eligible to higher paychecks , exciting projects and more freedom in their lifestyles. Learn Unit Testing Get familiar with distributed Version Control Systems : Git or Mercurial . Learn about IoC Containers : Autofac , Windsor Container and StructureMap . Contribute to Open Source Projects of your choice. Learn about Continuous Integration and try setting up TeamCity , Hudson or CruiseControl.NET . Learn about NHibernate Object Relational Mapper, but be prepared to outgrow it for NoSQL persistence like CouchDB , MongoDB or RavenDB . Check out the buzz about cloud computing , try Windows Azure Platform and also give a shot to running your .NET App with Mono on Linux ( some free offers to try). Familiarize yourself with messaging architectures and try NServiceBus , MassTransit or RhinoQueues ; then learn about alternative to MSMQ - AMQP and RabbitMQ . Read Domain-Driven Design and apply it in your projects. Get Started with CQRS Architectures and practice it in a learning project. As you can see, the list is not endless. There are just ten distinct topics structured in the recommended order of learning and practicing. Diligently walking through them is certain to get you way above the level of Senior Developer, opening a whole set of opportunities all around the world. And if you happened to learn, understand and master all these steps and still don't have interesting job or project offer - drop me an email ; I might get you in touch with folks desperately looking for developers with this set of skills . PS: On August 2012 we started a new podcast to help aspiring developers. Check it out ."}
{"url":"/post/2022-09-19-blog-migrated-to-obsidian/","title":"Blog fully migrated to Obsidian","text":"Blog fully migrated to Obsidian Last weekend I finally finished my great blog refactor v6! It is such a relief and good feeling, that I really want to share it here. TLDR; is now better integrated into Obsidian. This enables to move articles around, use wikilinks and wikilink-style images and benefit from the knowledge graph. As I side effect, I discovered and pulled aside collections of articles on event sourcing , opinionated tech and robotics . All that done is done mostly without breaking any existing (and very messy links) 🙂 Implementation Details Here is how the website currently works. I write my blog posts in Obsidian. When there is a time to publish, Python script runs. It will: Grab all markdown files and extract YAML-frontmatter from them. YAML contains just a few config entries: uid - unique auto-generated id; title - in case I want to have special symbols in the title; url - canonical (latest) url for the article. Run markdown through Markdown and Pygments (for code snippets). Run through wikimedia link parser to convert links like [[2017-12-17 Black Friday load|Black Friday]] to proper <a href . Generate static html, wrapping markdown content with an html layout ( Flask and Jinja ). Load url_map.tsv and generate redirects from various url versions to current canonical representation. rsync resulting folder to a remote server. Caddy will serve it there. That code used to be open source, but unfortunately isn't any more. Codebase is currently a mess 🤷 and is going to get even more messy. I want to bring back the comments and flatten the stack . That will require another rewrite. Did you know that automatic HTTPS from Caddy is available as a golang library ? However, if you are interested in blogging, setting up a blog or just writing in general, please don't hesitate to reach out to me. Writing is an important part of staying connected, productive and healthy in a remote-first culture. I'd be glad to help! ✌️ ❤️"}
{"url":"/post/2026-02-27-new-experiment/","title":"New experiment in 2026","text":"New experiment in 2026 This year I've decided to run one of the biggest experiments in my life - stop working as Head of ML & Innovation at TimeToAct Austria and take the risk of starting an independent business again. It felt to me that staying in a large company (anything larger than ~20 people feels large in retrospect) is a bigger risk in 2026 than leaving a secure job in an age of turmoil. There are just too many opportunities and ideas to explore, and large companies are notoriously slow to adjust and pivot, especially now. The way a recent PwC report frames it - workflow redesign is the “hidden multiplier” for the companies, while bureaucracy is essentially a tax on speed. Setup The experiment has started on February 1st, 2026, when my contract with TimeToAct ended. Here is the summary of things that happened since then. First, we took a week of vacation. It is pretty much obligatory in Austria, where travel opportunities for parents are always aligned with the school holidays. My wife is a product manager who’s spent more than a decade running remote teams across five continents. She’s exactly the person you want designing processes for a two-person startup in 2026 with clean slate and good tooling. We sat down together at the beginning of the second week and figured together the following schedule for me: Four creator blocks, 2.5 hours each (Mon - Thu) 1 hour for journaling and reflection on Monday 1 hour for maintenance and paperwork 1 hour for community management (while I'm waiting for the kids in the swimming pool) 4.5 hours in total for calls, insights and research 2 hours for hacking 2 slots to go to the gym + my regular Kendo practice with my son 2 hours for review and planning - at the end of the week A few hours to deliberately learn German It was important to balance obligatory tasks and boring routine with the other activities that provide energy, perspective and a chance to create. So, given 4 creator blocks, we've decided to split them between two activities: Create something fun that leverages my experience in creating fun collaborative challenges, presence of an awesome community of engineers and recent hype around personal agents. Finally finish and publish an English course on patterns and practices of deploying LLM in business for high-impact scenarios. We've run that for 3 sprints/weeks so far. First week was mostly figuring out - how do I work and setting up a laptop from scratch. So ~6 days focused on one of these initiatives with some time limit. While the schedule is not prescribed, it creates a welcome and diverse routine. Most importantly, it gives structure and introduces sport as a routine back into my life (something I haven't had time to do for years). Outcomes Here are the outcomes. 1. First boot of new agentic environment First, I've started designing a new challenge for the agents, targeting the simulation shape that corresponds to personal agents, as inspired by the Claw implementations and Homai architecture. By now I'm pretty confident that I will be able to pull things off, and also have an idea of tasks that could be fun to include in the challenge. Plus the new architecture will be able to scale better, to support all the participants running their agents. I also managed to solve a long-standing problem of creating and maintaining SDKs for the agents in multiple languages (people in the community tend to code in Python, JS/TS, C# and sometimes Kotlin/Java), all that while avoiding the mess that is OpenAPI/Swagger. The answer was Buf/Connect together with Codex, which quickly iterated through a couple of integration approaches until we arrived to a pleasant architecture that also happened to reduce code surface for me. Fast feedback is pretty important, so yesterday I was able to spawn my first lightweight VM and connect a mock agent to it via remote protocol that will be used in the challenge. Also, another fun thing. A while ago (in Better performance with DOD ) I mentioned as my goal: \" NixOS, the power of declarative VMs and per-project system dependencies .\" As you can see in the screenshot, these things are an essential part of my toolbox right now. Next steps: build up a tiny demo challenge, ship it with the SDKs to people and see what breaks. 2. Setting up a community We've also started working to figure out and shape the actual competition - how to organize the event, how to set up the timeline, register participants, connect them together in cities, stick to the privacy policy and terms of service, introduce code of conduct etc. Yes, organizing an event is suddenly a lot of paperwork. Thankfully ChatGPT and Claude can work as helpful assistants to streamline research, drafting and analysis. Things got slightly out of hand along the way. By the end of the month, we had a BitGN platform , April 11 as the date of the first challenge, 341 registered engineers that come from 68 cities around the world (still missing a few continents, though). We also have 5 hubs around the world (while the competition is remote-first, people tend to want to gather together and have fun together) with the HQ hub located in Vienna. Next steps: try to get more hubs into the competition. 3. Labs \"Labs\" is the codename for my Course platform. This is where I made the least progress (being too excited by the things that happen around BitGN). Still, it was something. First, I started working on the new website to serve the course, using the same stack as the one that powers BitGN community. DB structure, migrations, solution architecture, engineering harness and other tools were first designed in this project (codename \"Labs\") and then also deployed to BitGN. Among the most fun parts: DB migration framework that just works (meaning that Codex and Claude make no mistakes in evolving schema and I have no surprises when I deploy these changes to prod). Zero downtime deployment process for my stack (uses socket activation managed by SystemD within NixOS to drain connections to the old instance before starting the new one) Login with Google (it turned out to be simpler than I thought, reducing the need for me to deal with lost passwords and account sharing) Fun way to handle legal documents like ToS and Privacy Policy. Now I can change them in markdown files (if needed) and all users will immediately see the changes and will be required to accept them to continue using the site (acceptance timestamps will be recorded). Not that I plan to ever change them, but if there is a legal need - there is a zero-friction way. Engineering harness structure for my projects that makes it easier to track changes, decisions and supporting documents and communicate them to the coding agents. As you can see, most of these changes focus on reducing friction and allowing to scale human effort with the technology. These tiny improvements compound. For example, while writing this blog post, I came up with an idea to display a table of new cities that joined our community: It was a fun feature, not worth my time, so it was done completely by Codex (with my two follow-up requests): Yes, the table could be done better. Yes, styles are slightly inconsistent. Enterprises can come up with a thousand reasons to not deploy it. But it is good enough in the startup mindset! And since the deployment and DB versioning pipelines are already in place - it went live immediately. We can clean things up later, if needed. While these approaches work best of all with the code (since it is easier to verify quality), that can also be applied to the other areas of human life. 4. Personal knowledge base I always envied people that maintain elaborate and well-designed knowledge bases and personal productivity systems. I tried far too many times with tools like OmniFocus, Notion, Evernote, DevonThink, Obsidian etc. Most of these attempts stall due to my lack of patience in maintaining them. However, there was just too much noise with OpenClaw/MoltBot personal assistant, that I had to try something. The noise, I believe, happened because a noticeable portion of the population was able for the first time to experience what it feels like to have a personal AI assistant. And that experience was radical enough to fuel that media wave. I tried installing OpenClaw on a dedicated VM (for security). It felt refreshing in its freedom and ability to try things proactively. At the same time I didn't like lack of control and precision. Since I've been benchmarking too many LLMs and architectures, I just know too much about their failure modes. So what do you do, when you want to figure out something, but don't know how to approach it? You talk to a rubber duck. I had a fun session with ChatGPT Pro, trying to figure out how I can get a feel for OpenClaw's capabilities in my daily workflow, but in a more structured way. The resulting architecture involved a git repository that contains knowledge in markdown format, intermixed with agentic instructions on how to handle things. This repo is browsable with Obsidian and is connected to the remote (cloud) Codex. Structure at a high level looks like this: . ├── 00_inbox ├── 01_capture ├── 02_distill ├── 04_projects ├── 07_rfcs ├── 89_images ├── 90_memory ├── 99_process ├── AGENTS.md └── CLAUDE.MD It was designed by Codex in order to attempt solving my problem of experimenting with agents and giving agents more autonomy, while giving me the ability to ask weird things and still have some control. We've tried the following experiments: Automated knowledge capture and distillation . That surprisingly works and is the most used feature. I tend to throw URLs or voice transcripts to Codex, asking it to file the information. It will record, file, distill into a card (brief) summary. For example, yesterday I had an idea of connecting the concept of a personal agent with BitGN Challenge, and also my personal OS and linking with meta research on agentic trajectories. The idea came in a gym, so I opened up Codex on a smartphone, did a memory dump and asked to file it. The fun part was not the capture or distillation, but an afterthought. My Codex environment is configured with SOUL.MD (Be good, seek knowledge) and also an instruction for an agent to connect the dots on its own, exploring synergies, collecting insights. So here is what it wrote, reflecting back my own thoughts in a new light (which gives a few new ideas): <!-- AGENT_EDITABLE_START:reflection --> Novel vs recent cards: this is a concrete design leap from “agent workflow improvements” to a collaboration substrate where merge semantics become first-class product/API behavior rather than a tooling afterthought. The strongest compounding idea is dual-use architecture: one API contract can power both benchmark/challenge telemetry and real household/enterprise assistant operations, reducing duplicated infra bets. Treat semantic edit operations as governance boundaries (not only UX sugar): once operation types are explicit, authorization/reconciliation policies become enforceable and auditable per operation class. <!-- AGENT_EDITABLE_END:reflection --> Yes, I understand, that using Codex (which is a programming environment with its own sandbox) to download PDFs, capture sites or reason is an overkill and waste. But it works for me. The only problem is that this collection of insights lives in a GitHub repository, so it might be an inconvenient place for the collaboration. So I've done another interesting thing. I started up Codex and asked it to inspect the repository, the nature of my last changes (~20 last commits), all recent captured insights and come up with an idea of how I can keep all features but make the knowledge base accessible to others (ideally, readable by ChatGPT Desktop/Claude Cowork via MCP). The entire thing had to be usable by the agents (I'm not editing these files by hand, only viewing), while still giving me the ability to review any changes and roll them back as needed. Plus, if there are any conflicts - it should be easy for the agents to handle the merges, not me. So Codex came up with a fun and weird design of semantic text management repository with CLI/API interface (pluggable as MCP) that makes it easy for agents to manage on behalf of their controlling humans. My next step in this experiment is: \"Hey, I like this design. Now go ahead and write me a thorough RFC on how to implement this semantic storage and completely migrate this repository to it.\" I wouldn't be surprised if it succeeds (not after seeing yesterday how Claude Desktop can manage org data in Notion after being connected to MCP). I wouldn't be surprised if that task fails today, because it is too complex to be solved in one step. Either way, this will be another interesting and fun experiment that will teach me something new. And it will potentially create one more tiny compounding effect that will slightly reduce friction for all the other work that happens in the upcoming months."}
{"url":"/post/aggregate-design-and-security/","title":"Aggregate Design and Security","text":"Aggregate Design and Security Answering a question from the community: How do you handle security concern of sending aggregate id to the web client? What'd you do to ensure I can't, from the browser or another client, manually send aggregate id that belongs to another client to either get some data about it or send some commands to mess things up? Each specific system scenario would lead to the specific design. If our specific scenario is: business logic is hosted on Application Server (e.g. hosted in Lokad.CQRS) which we control trust; client UI is hosted on Web server UI which we control and trust; Web server accesses application server (by sending commands and polling views) and renders HTML to user's browser. Then the solution to the security concerns problem is: We don't trust client browser anything (except from keeping unique token) we trust Web server UI to authenticate user (ensuring that he is who he claims to be). In this web App we will ensure that client sends only commands to the aggregates he is allowed to access (while checking commands as well). In app service we don't do any deep security checks and just carry out the commands. This scenario works well in simple situations where we have an app server and a few Web UI servers (alternatively with REST API servers), managed by the same team and hosted in the same controlled environment. If we have a different environment (e.g. application server can't trust web server), then a different separation of responsibilities might be needed. For example, we might need to perform authentication within the methods of Application service (checking that user is who he claims to be ), while pushing authorisation down to method calls on an aggregate ( checking that user can do what he attempts to do ). In this scenario, authorisation might be encapsulated within an instance of domain service , passed down by application service to a method on aggregate . This service would have access to permission maps (or any other way of representing role/permission information), aggregate would call it's methods to find out if the specific user can perform requested actions on the requested resources. Naturally, in this case, Web UI must still ensure that we don't send illegal commands to the server. The difference from the first scenario is - we don't trust the web server to be diligent. These and other specifics of project environment, team organisation and trust can have a great impact on development process and product design."}
{"url":"/post/analysis-paralysis/","title":"Analysis Paralysis","text":"Analysis Paralysis How many times did you want to start a new project and implement some really exciting idea? I've been there multiple times myself. Most of my attempts failed in the very beginning, because I was trying to think or plan too much in advance: How do I plan for future extensibility and adding new features? What if I need to switch between databases - how do we abstract this away? How do we scale out for 1000000 users? I need some formal process around features and releases. If project becomes successful and grows, new team members should have no difficulty joining in. These are the most sane of the imaginary requirements that used to come to my mind (more exotic ones included terms: \"Neural networks\", \"Linux kernel\", \"ARM processor support\" and \"should make good expresso\"). All this felt like something good, as if I were planning for every feature and problem in advance. However, in practice this somehow used to turn simple and exciting projects into challenging sets of problems, that had to be solved all at once. Most of the times, these sets were so complicated, that I had to stare them in awe without any slightest idea of where to start and what to do next. This state is often called analysis paralysis (or the worst way of dreaming). As you might guess, almost all of such projects were dropped, while the other half failed later during careful planning and execution. We can wish to be prepared for a lot of problems and features in advance. But do we really need for them to happen all at once? That's really hard to achieve. Life is simple. You can't walk 1000 miles at once. There has to be the first step, and then the one that will come after. There is an approach that helps to move forward with development in such situations (first time I heard it from Gregory Young). It can be really hard for developers, since we all are inherently perfectionists. Instead of trying to plan for the entire project in advance, we just take the smallest bite possible . You can call it a \" prototype \", \" minimum viable product \", \" let's give it a try or \" dirtiest and hackiest code I've ever written \". This attempt will be fast and deal with the core idea . If it fails - it will fail fast; if it makes at least some sense - it will only get better from this point. We can focus on the most painful problem that makes this idea shine (it will be easy to prioritise) and solve it. Then the next one and the next. The idea is just to start walking towards the goal, instead of burning yourself down by an attempt of 1000 mile jump (only to discover that you jumped in the wrong direction). The approach becomes even more valuable, when there are multiple stakeholders involved in the project. It is much easier to arrive at collaborative analysis paralysis, when everybody keeps on throwing their dreams in : \" we want this \", \" it should do this \", \" what if this happens? \". The most simple solution to the core problem provides team with a starting point for discussion and planning. It makes discussions more real, than juggling with wishes and fears in the abstract problem space. This approach also helps to prioritise further progress - you focus on the most painful thing first. Life and projects can overwhelm with problems. Keep it simple and focus on making next most important step. Step by step, you can walk around the world."}
{"url":"/post/anatomy-of-distributed-system-la-lokad/","title":"Anatomy of Distributed System à la Lokad","text":"Anatomy of Distributed System à la Lokad This is an update of my previous post on Building Blocks à la Lokad . It is based on improvements of the understanding and terminology thanks to Vaughn Vernon . It also shamelessly borrows from the subject of my continuous fascination: biology and nature itself. Bounded Contexts Let's start with the high-level overview of a Software-as-a-Service company (SaaS). This company could employ a range of systems needed to deliver its services to customers. Systems will be separate to reduce coupling and reduce the friction, as company evolves and grows (smaller the company is, more important this becomes). Each system is a separate Bounded Context ( BC term comes from DDD), which could have different implementation details: teams, technologies, language, lifecycle, deployment and maintenance specifics. Below are some examples of BCs in a SaaS company. Please keep in mind, that each bounded context is a modeling concept and not an indication of how system is implemented. For example, Client Portal could be implemented as a stateless web farm with ASP.NET MVC3, redis-based state server and cloud-hosted processing logic, accessible by desktop and mobile clients. Or, it can be a single Windows Server with IIS, file-based state and a few console processes running in background. This is similar to how a human body is composed of a set of biological systems , where each one serves specific purpose and can be built from specialized organs, cells and tissues. Events: Forming Digital Nervous System These separate bounded contexts need to communicate with each other. It happens by publishing events by each bounded context to an outside infrastructure. Each event is just a serializable message that complies with certain guidelines and is routed by messaging systems. Each event tells about something that has already happened : Invoice Created or Invoice Possibly Expired . Events streams form digital nervous system of a company, where bounded contexts act as biological system. Image above reflects event flows in an imaginary company. It might look complex in this specific representation, however, this doesn't always need to be perceived this way. While publishing, sender does not even know about existence of its recipients, however recipients can subscribe to any publishers, they are interested in. This traditional PUB/SUB approach simplifies the picture a lot, since we can focus on specific bounded context with its dependencies. It also makes it similar to information flow through biological neuron . Each bounded context can subscribe to events in two distinct ways: by declaring event receptors or with view projections . 1. Event Receptor Event Receptor is a simple declaration of events that the specific bounded context is interested in and will react to by sending commands to application services of this context. In code such receptors are usually grouped together in classes, according to their purpose: public sealed class ReplicationReceptor { // 'Domain' is the name of the primary Bounded Context // in this system readonly DomainSender _send; public ReplicationReceptor(DomainSender send) { _send = send; } public void When(SecurityPasswordAdded e) { _send.ToUser(new CreateUser(e.UserId, e.Id)); } public void When(SecurityIdentityAdded e) { _send.ToUser(new CreateUser(e.UserId, e.Id)); } // more receptor methods skipped 2. View Projection View Projection subscribes to events, which are projected to a view (or persistent read model) that is structured in a way that will be easy to query by components within this specific bounded context. In code, projection elements are grouped together in classes based on the view they keep updated. public sealed class InvoiceDeliveryProjection { readonly IDocumentWriter<unit, InvoiceDeliveryView> _docs; public InvoiceDeliveryProjection(IDocumentWriter<unit, InvoiceDeliveryView> docs) { _docs = docs; } public void When(CustomerInvoicePaymentRequested e) { var mark = new InvoiceDeliveryMark { AccountId = e.Id, Created = e.RequestedOn, InvoiceId = e.InvoiceId }; _docs.UpdateEnforcingNew(view => view.Invoices[e.InvoiceId] = mark); } public void When(CustomerInvoiceClosed e) { _docs.UpdateEnforcingNew(view => view.Invoices.Remove(e.InvoiceId)); } Both View Projections and Event Receptors are intentionally made extremely simple and easy to change. This is required, since they are coupled to external bounded contexts which can be controlled by different teams with different level of agility and technological capabilities. Actual business behaviors and complex logic reside within Application Services and Tasks , which are safely isolated from the change shocks by projections and receptors. 3. Application Service Application Services are interfaces which are remotely accessible for calls within their bounded context. Normally these calls happen by sending a command towards one of these application services (this command can be either explicit command message, or it can be carried by infrastructure in form of remote procedure call). So each application service is just a set of command handling methods, which are executed when a specific message arrives. These command handlers are grouped together according to intent and design guidelines. Each one deals with one command and publishes events afterwards. It can call upon various service clients in the process as well. Examples of service clients that are usually used by command handlers: SQL client; NoSQL database client (e.g. key-value store with views that are projected within this bounded context); Connection to an Event Store; Integration endpoint with 3rd party; Business Logic client; Command sender to an application service within the same bounded context. Implementations of command handlers within application services can be different: Stateless processing of incoming commands into events (functional style). Handling commands by executing certain behaviors and calling domain services (e.g.: CRUD-style Aggregate Roots). Aggregate Roots with Event Sourcing (AR+ES) and Domain Services (favorite). Here’s an example code for AR+ES implementation: public class CustomerApplicationService { // domain services initialized from constructor IEventStore _store; IPricingService _pricing; IInvoiceCalculator _invoicing; // handler for command SettleCustomerInvoice public void When(SettleCustomerInvoice cmd) { DispatchAndMergeConflicts(cmd.Id, cust => cust.SettleInvoice(cmd.InvoiceId, _pricing, _invoicing)) } // skipped other command handlers // helper method that dispatches call to an aggregate root loaded // from event stream. If there were any concurrent changes, we’ll // check server changes for merge conflicts and try to rebase our changes void DispatchAndMergeConflicts(IIdentity id, Action<Customer> action) { while (true) { var stream = _store.LoadEventStream(id); var agg = new Customer(stream.Events); action(agg); try { _store.AppendToStream(id, stream.Version, agg.Changes); return; } catch (EventStoreConcurrencyException e) { // verify our changes for merge conflicts foreach (var clientEvent in agg.Changes) { foreach (var serverEvent in e.StoreEvents) { if (ConflictsWith(clientEvent, serverEvent)) throw new ConcurrencyException(e); } } // there are no conflicts and we can rebase _store.AppendToStream(id, e.StoreVersion, agg.Changes); } } } } 4. Task Task is the last element of a distributed system (a la Lokad). It essentially is a method call that is executed at specific moments in time. It can call upon various service clients (e.g.: query views or check up on 3rd party integration systems) and publish events. Task implementation is generally similar to application service, except for the trigger part. Here’s an example of task that continuously checks on list of invoices, detecting invoices that need additional action. var remindIn = TimeSpan.FromDays(15); while(!server.IsCancellationRequested) { var pending = _docs .Get<InvoiceDeliveryView>().Invoices.Values .Where(x => (x.CreatedOn + remindIn) < DateTime.UtcNow); foreach (var x in pending) { _events.Publish(new InvoicePossiblyExpired( x.AccountId, x.InvoiceId, x.CreatedOn)); } server.WaitHandle.WaitOne(TimeSpan.FromMinutes(15)); } Example In the snippet above, we actually handle piece of bigger invoice delivery and reminder process that would probably be implemented in bounded context, using all 4 elements: Event Receptor, View Projection, Application Service and Task. For instance, you can track invoices as part of customer application service . This application service would use customer repository and currency converter service as its dependencies in order to handle commands like: Create Invoice Add Payment To Invoice Expire Invoice etc Some of these commands will be sent to application service by receptors of bounded context in response to events: When Invoice Payment Arrived then tell customer application service to Add Payment To Invoice When Invoice Possibly Expired then tell customer application service to Expire Invoice On picture this would look like: We can also keep track of all open invoices by creating a projection for Outstanding Invoice View . This view will be used by Invoice Expiration Tracker task , which will once now and then rescan list to detect outstanding invoices that were created too much time ago. For each one, it will publish invoice possibly expired event. Purpose Purpose of all this separation is simple: to have a set of simple building blocks from which more complex systems can be composed in a reliable fashion that allows change without much friction and regressions. Rules of such composition should be rather straightforward and hopefully clear. I will be talking about them and implementation details in following materials. This approach is what I have arrived (so far) while working on various projects at Lokad . This is also what shapes future development of our small but rather interesting Software-as-a-service company. Even if it keeps on growing in complexity, I see no big problem in following up with the supporting infrastructure. After all, this has already been done by nature billions of times. As you have probably noticed, I keep on shamelessly borrowing concepts and approaches from her, because it is all well-documented and: Human subtlety will never devise an invention more beautiful, more simple or more direct than does nature because in her inventions nothing is lacking, and nothing is superfluous. Leonardo da Vinci Update This discussion continues in another blog post: Bird's-eye view of a Distributed System - Context Map , which tries to take into consideration real world environment around such system."}
{"url":"/post/behavioral-analytics/","title":"Behavioral Analytics","text":"Behavioral Analytics What is Behavioral Analytics? Behavioral analytics focuses on how users behave, and why. This covers a wide range of applications, starting from online shops and up to connected devices in your smart home. Deep understanding of the behavior allows to optimize for future circumstances, serving customers better or optimizing the use of resources . There is a lot of good that can be done here. For example, a smart restaurant might use weekly visitor patterns, combined with information about reservations and some forecasts to plan for the daily cooking routine and product replenishment. Smart online store could provide personalized product recommendations, just like Amazon does (or do it even better). Smart houses can help to reduce energy consumption or warn on various dangerous situations (e.g. unexpected spike in energy consumption at night or unusual person entering from the window). Design Requirements Behavioral analytics is one of these areas where multiple interesting software requirements are present at the same time: responsive and high-load (we don't want to degrade performance during the holiday shopping rush); personalized (we need to track events at level of a single user); large scale data mining is required, since we have to process all the accumulated date in order to create new models, verify them and track their performance); user interactions have to be captured and quantified properly in order to have some data to process. All this might sound high-tech and complicated, however it is quite easy to start approaching such problems. Devil is still in the details, though. It all starts with domain events. Role of Domain Events User interactions with a system could be represented as a sequence of events in a stream. For example, in online store we could have such sequence: {{% img src=\"events.jpeg\" %}} The methodology of capturing such events is well-known and documented around the DDD related articles: event-storming, domain modeling sessions with experts, event-driven use cases etc. The story of 👍 HappyPancake covers all benefits of that in great detail. If a system emits events, these could be captured and persisted in an event store, suitable for further processing and integration. Towards a Deeper Insight With event store, it is quite easy to start using events for greater benefit (aside from integrating different modules and 3rd party systems). First of all, we could to set up a dashboard with various reports derived from such events and updated in real-time. Or easier yet - we could project events to some OLAP system and let managers slice and dice the data as they see fit. Even Excel has such capabilities. {{% img src=\"system.jpeg\" %}} Given this initial insight, domain experts could come up with interesting ideas for improving user experience and getting more revenue out of the system. Then, we could run batch processing across the event streams to verify new theories or fine-tune existing models. These could later be expressed in form of rules that will run in real-time and interact with users, reacting to their behaviors. Back in my economics R&D days I used to run evolutionary algorithms across historical datasets to capture dependencies between available data. This information was used to remove less relevant data from datasets and build refined models. They were used for forecasting and running various \"what if\" scenarios. Not only we could execute various rules in real-time (with extremely low latency), but we could adjust their behavior and verify them through A/B testing . This approach, perfected by Amazon, involves splitting user base into multiple groups and giving each group slightly different experience. Nature of the experiment is recorded along with the user behavior captured in events. These are later compared to pick the most efficient approach. With A/B testing you could verify various theories: What is the best location of an advertisement on a page? Which promotion offers for users buying product A drive more sales? Which room brightness is the most comfortable for people in the evening? Evolutionary process of iterating over theories, models and real-time rules is a reflection of a usual software development process to a slightly different field. The purpose of this process is still the same though: gain deeper insight into the domain . One might even claim that the principles of domain-driven design could still apply here. To be continued Behavioral analytics and its application to various aspects of our life is something that interested me for a long time. I hope to continue this topic in a series of posts. If you have some comments, insights or interesting ideas to share, please don't hesitate to get in touch."}
{"url":"/post/behind-the-scenes-of-dddes-project-salescast2/","title":"Behind the Scenes of DDD/ES Project - Salescast2","text":"Behind the Scenes of DDD/ES Project - Salescast2 A few posts back I dropped a screenshot of Visual Studio solution structure for the Salescast2 of Lokad. Jonathan Oliver (author of EventStore and also partner in crime in the distributed podcast ) immediately left following comment: Ouch! That's a lot of projects. Granted, it's not as many as NServiceBus, but the pain in .NET land I'm feeling right now comes from compile times and JIT startup times on anything over 3-4 projects. I personally, I wouldn't consider Salescast2 solution to be really large, given what this baby can do. So let's do a quick tour, shall we? Salescast2 is a platform for pulling diverse sales data from retail networks (lots of it), performing some rather smart data processing (that's what we specialize in at Lokad), producing business intelligence reports and passing resulting reports back. Here's a case study and product page about the previous version of this platform. This sounds like something simple, so far. However, there is more. We pull data either via custom adapters or rather flexible database schema which could be located on Oracle, mySQL, MS SQL or PostgreSQL server. Each of these databases can have not only rather different setup (e.g. different columns for inventory products), but also different performance. For instance compare clustered Oracle setup versus some tiny mySQL database running in a shared environment. Obviously, Salescast2 should be able to automatically identify type of the database, it's configuration and performance capabilities, adjusting to all that . There are a lot of things that can go wrong. Some of these are critical (products table does not exist), while some are less important (exports table exists but is not writeable). Problems should be handled automatically if possible , exposing helpful information about the nature of information if something goes wrong. Printing exception stack is not enough. Solving all these problems was achieved via use of Event Sourcing applied to Domain-Driven Design . Granted, it was not easy, but fortunately we had our domain model established in the first version. Server-side functionality for all that is hosted in 5 projects: Contracts - message and persistence contracts, language of our system; Domain.BC - actual code that coordinates everything, starting from discovering projects and up to generating excel reports; Adapter.BC - messy code of SQL integration; Wires - some infrastructure-level implementations (e.g. sending emails or persisting documents); Worker - console to run app server locally, it also acts as a worker role that can run on Windows Azure cloud. That's it. This is the only project-specific code of the application server that runs the system . All the rest is less important. However database integration is only part of the big picture. This server code does much more . We also need to aggregate data, clean it up, run some analysis, calculate some formulas, start even more complex process of forecasting, retreive results, run additional formulas, arrange available data and skip missing bits, producing nice Excel reports or exporting data back to customer's database. Fortunately, during this process we don't need to deal with the most complicated part of analytics (producing classical and quantile forecasts). This aspect (and core of our technology ) is handled by a separate set of systems. Still, while doing all these calculations, we need to handle relatively big data . 300k products and 80m sales entries were an initial target for a single project (customer can have any number of projects). AFAIK, that's more than 500MB of data if stored in compressed format. We achieved that, and also done a bit more: This dataset can be processed on a single thread on a laptop under one hour (old version used to require day and a large server), obviously in production we can use as many threads as we need. data processing (e.g. map reduce) use efficient streaming operations that require rather small amount of memory. entire system can natively run either on a file system without ANY external dependences except .NET (xcopy deployment is fine) or it can be deployed to Windows Azure, using efficiently it's capabilities. One of the reasons behind that is - SQL server and especially Microsoft Azure Stack is horribly slow on my tiny Mac Book Air, which has to run a bunch of Visual Studios in a VM with 1.5GB of RAM. So they had to go, in order to provide comfortable development experience. Nobody regrets the decision. In order to achieve that, we used some of the old storage abstractions from Lokad.CQRS and also enhanced them with a few new ones (to be ported to open source). Obviously, this is still just a tip of the iceberg. This server, in addition to pure processing and business logic, also hosts some code for Web UI projects, providing denormalized read models (views), which are convenient to bind to ASP.NET MVC views. This server tracks any changes in it's own code and rebuilds views automatically, should it be required. It also accepts commands from various clients and provides additional integration capabilities with 2 more systems. These Web UI clients on their own are quite simple (since all complicated logic is done on the app server). They add a few more projects to the solution: Admin.BC - projections and read models for the Admin UI Admin.Web - Admin UI with ASP.NET MVC Client.BC and Client.Web - idem for Client UI Both Web UI projects, obviously, display relatively nicely both on desktop browsers and on mobile devices. Albeit, UI Design itself is not as nice as it could've been (if it'll be worth it - we could just hire a PHP or Ruby team to redo that UI in a few weeks). Rest of the projects in this solution is just support: Cqrs.Portable and Cqrs.Azure - copies of Lokad.CQRS source code to provide actual infrastructure capable of doing big data, cloud integration and all the other things; Dsl - helper util to generate command and event classes on-the-fly ; Admin.Deploy , Worker.Deploy , Client.Deploy - just deployment projects for Azure Audit - a copy of event stream viewer from Lokad.CQRS a bunch of test projects (not enough). Obviously, there is a bit more: Capability for zero downtime upgrades; Capability for immediate backup and replication of data (event streams make it easy); Almost-infinite scalability (processing partitioning is dead-easy here and storage is partitioned by default); Low-friction evolution - it is easy to add new behaviors or functionality to application server; UI changes are never a problem - server will rebuild the views as needed (and all queries take just one round-trip to Azure storage); Any server-side failure can be easily reproduced most of the time, including complete state of the corresponding business object at that moment; Writing and evolving long-running business processes is marginally more complex than dealing with projections (and these are handled automatically); Ability for fail-over between the clouds and availability zones (read-only mode and full functionality). That was something I really missed for Salescast on February 29th. So, all in all, I think this big-data BI platform has some decent capabilities packed in a rather small and simple project (even though it includes it's own infrastructure and tooling in source code). As you probably already know, we hate friction and unnecessary complexity at Lokad . Knowledge and experience learned during it's development was pushed in A+ES chapter in Vaughn's IDDD book (chapter is still not visible on Safar, yet), Lokad IDDD Sample and is being used to gradually improve Lokad DDD/ES Sample Project . So far I'm pretty happy with how this little project went (it will be released to public soon). Massive kudos, respects and thanks to Lokad team for pushing effort in this direction. Plus all the people behind CQRS/DDD and ES communities. What to do next? If you are developer interested in this technology - please feel free to do what you want with the materials from this blog on Salescast or source code from any of Lokad Sample Projects (no stupid restrictions apply). We'll be updating in the upcoming months to share some more code and practices. Don't hesitate to ask questions and share problems. Hopefully this little bit of sharing could help you to move forward personally. If you are an existing customer or developer with some Lokad-specific questions - please don't hesitate to drop me an email. I'm especially interested in cases, when some stupid big-data query takes hours to compute or when you need to budget millions of euros just to store and process your data. The latter is a huge waste of resources, you know (probably it all fits on a single laptop or even a smartphone). If you are a Lokad team member - we're just getting started :) By the way, do you know what was the most difficult part along this journey? There were two problems: Actually understanding what this product really has to do and how (Domain-driven design helped here); Stepping away from all these widely-accepted technologies, practices, databases and service buses; then unlearning them and doing something much simpler. This journey was clearly worth all the effort. And if I can do a single one thing, I would like you to encourage you to try a similar one . This does not need to be same stack or set of approaches, but just exploring new directions and sharing results. On such journey not only can you find something new for yourself, but you can also serve the greater good of helping to push forward the community. That's how we all learn, move forward and push the state of the art together."}
{"url":"/post/being-the-worst-updates/","title":"Being the Worst Updates","text":"Being the Worst Updates This is an update post to being the worst podcast on software design, which we started with Kerry Street at the end of this summer. A few things have changed since then. Module 1 (or Season 1, if you wish) is almost finished . In the second one we will switch to slightly more complicated domain (Factory is too simple). Further episodes will base on previously covered material with additional focus on: Production experience of a SaaS project run by one man or a small team; Collaboration between team members (or with external outsourcing parties); Collaboration between multiple sub-domains; integration with external systems; Deeper level of Domain-Driven Design; Patterns of Client UI development (including mobile clients, web UI). We'll see how it goes. So far, we are still the very worst and excited about staying this way. Growing community and support it provides - encourage us to keep moving forward personally while sharing the lessons learned: 250 subscribers milestone reached and still counting. Tonight I plan to record first episode of Being The Worst in Russian with Anton Vinogradenko . It will mirror english version, benefiting from its (relatively) coherent viewpoint, heavily commented samples and all additional reference implementations that will be added later. Tom Janssens contributed sample code in erlang for one of our episodes. There is work in community to provide Java equivalent for that, as well. Sharing knowledge became even deeper part of Lokad approaches (which are presented in this podcast). The same materials are reused in development training. Additionally our new Lokad Data Platform initiative shares theoretical foundation with BTW podcast. It is currently being introduced to largest retail companies in Europe as a way to enable high-speed data integration between formerly locked systems. Approaches and experience behind Data Platform will be covered in later modules of \"Being The Worst\" along with all these \"big data\", \"real-time\", \"cloud computing\" and \"business intelligence\" topics. Besides, I'm personally just curious, what would happen to the development world if some educational equivalent of \"Advanced Distributed Systems Course with DDD and Event Sourcing\" would be made available for free to the community (along with coherent set of additional study materials, assignments and samples). This can be an empowering social elevator for the people from undeveloped regions or poor families . What do you think about the podcast so far?"}
{"url":"/post/birds-eye-view-of-a-distributed-system-context-map/","title":"Bird's-eye view of a Distributed System - Context Map","text":"Bird's-eye view of a Distributed System - Context Map In one of my previous posts on Anatomy of Distributed System à la Lokad we were zooming into the design of a system that could be built as a part of SaaS company. Such system was built with and around CQRS/ES concepts. This approach looks like a nice by-the-book design: bounded contexts that are based on a uniform design and connected together to deliver some functionality. This is the kind of drawing that could be produced by an ivory tower architect or a consultant that is \"too expensive to ever touch the codebase\". However as we know, nothing exists in isolation from the real world . Every entity (either biological or digital) is always connected to much larger environment (and is affected by it). In software projects, for instance, you would have other departments to cooperate with, customers to serve and partners to integrate with. You might also have Microsoft and Apple OS to curse about. Even political rivalry between departments A and B in an organization could matter (especially if they share a database). So there is an immensely complex and diverse ecosystem that surrounds every single software project . There are different technologies, languages, concepts, mentalities and resource constraints. These things do not necessarily belong to the project directly, but could have an extremely strong impact on risks and costs, potentially leading to the difference between success and failure . Good software developer will always consider the most important of these strategic factors, while making tactical decisions. One of the ways to take some of these factors into consideration is by changing perspective of your system and by looking from bird's-eye view at it and other systems that it is linked to. Let's zoom out of our CQRS/ES-based set of systems in a department (which was discussed in Anatomy post ) and look at a bigger picture. We'll include other departments, some of the customers and partners. This view will be based on \"Context Map\" approach of DDD but slightly expanded to cover ecosystem within and around the company. I will keep CQRS/ES-based bounded contexts marked with orange border, while \"external\" ones will have borders of other colors (just like countries on a map). Ok, now that's a subtle change. We get a lot more variety in colors and shades, which represent differences: outside elements have technologies and context specifics that differ greatly from ours; our own elements turn out to have different shades of CQRS/ES implementations as well. In other words, each bounded context is different. These differences can be caused by a huge amount of internal and external factors. For example, consider these two contexts, which happen to be owned by different organizations: Obviously, these differences would make different approaches more or less successful than the others . And it actually is perfectly OK to have different approaches and technologies living in the same organization or a project, if this is justified by the environment. Unless, of course you are a developer from a Soviet Russia that has to employ absolutely identical set of tools everywhere (i.e. SQL everywhere, NHibernate everywhere, NServiceBus everywhere etc). A few vivid examples of ignoring the strategic context, while charging into the battle were clearly demonstrated by Napoleon and Hitler. They came over to visit Russia and stayed over for the winter with such extended and fragile supply lines. So, please, don't repeat mistakes already covered by history books (these are the most shameful ones) and: Take into consideration real world that surrounds this project, trends and risks that surround in future. Split your environment into set of bounded contexts, that could be distinguished by similar language, match with organizational boundaries or technology manifestations. Map the terrain to make it more understandable. Before considering to apply CQRS/ES (or any other tech) to a given bounded context - consider and compare it with other approaches , their risks and costs. Delay non-essential decisions as much as possible. Push for approaches that allow you to capture feedback from real life as fast as possible (lower development friction, faster iterations, simple architectures). Keep it simple and separated (KISS). Let me reiterate. Domain-Driven Design with its approach of bounded contexts that can be joined into a strategic context map - is one of the ways to visualize and represent environment surrounding and affecting a project. This approach can help to take into account all important factors and figure out the tactical details of most implementation, architecture and technology. Sometimes CQRS/ES approach would be the best fit, however more frequently you would have a different set of conditions that require a different solution to the challenge ahead. Pick your weapons wisely, for they might affect the outcome of the battle and war."}
{"url":"/post/black-friday/data-oriented-design/","title":"Better performance with DOD","text":"Better performance with DOD Time flies! Black Friday Experiment has been running for 4 months, according to my git commit history. When I was starting, I wanted to scratch my itch and try improving on the past event-sourced designs. I'm quite happy with the code performance so far. It is faster than the LMDB-based inventory engine I used to work on in the past. In 3 minutes I can replay a story of a growing company: 500k nested locations, 498k SKUs, 690k orders, more than a million entities and events. All orders are resolved against strongly consistent availability engine. 1 thread, running in a local-only mode without remote event store replication. With EC replication, the speed will be the same. If we go consistent, then the latency/throughput will depend on the streaming/storage engine setup. Curiously enough, this exercise quickly moved from \" exercise in event sourcing \" to an \" exercise in writing inventory constraint solver engine \". In order to make things fast, I had to finally start thinking about how data is laid out in the memory and how CPU cores go through that . If you think of it, classical Object-Oriented-Programming (OOP) lays its data in memory like Spaghetti bolognese - objects and pointers are all mixed up. This is fine until you want to do some intensive computations with your data. For example, detecting early, if moving items between two warehouses will negatively affect any pending orders. Given that you have a few hundred thousand locations and products already in the system. If your object graph is done in OOP, then CPU pre-fetcher will have jump back and forth between random memory addresses. Performance is probably the reason why deep neural networks are never implemented in OOP style . If each neuron and axon were represented as a distinct object with its fields, then exact memory location of a given neuron could be non-deterministic. Instead, neural networks are represented as arrays, matrices and tensors. This happens to be more natural and fast for CPU or GPU to work with. Fortunately, there is a good old programming approach called Data-Oriented Design . It says: \" Dude, do whatever you want, just structure your data to use CPU efficiently \". One approach for that is about laying out data in continuous blocks of memory , so that it will be easy for CPU to prefetch all required data and just chew up through it. Just refactoring the codebase to achieve that was good enough for the performance gains I mentioned earlier: group all inventory locations for a given product close in memory; represent them as structs and arrays. // ProductStock captures stock quantities and reservations // of a single product across all locations and warehouses. // it represents a tree that is flat in memory. The root // always contains totals. // We can easily fit stock availability for 100 products in CPU cache // allowing us to resolve availability or run kit calculations type ProductStock struct { // location IDs in the system locs [] int32 // pointer to the parent location. -1 for root parentIdx [] int16 // reservation and availability counts reserved [] int32 onHand [] int32 } This means: whenever the code execution touches any stock, then all locations for that product are probably already in the CPU cache. So calculations across multiple locations will be cheaper. Main memory reference is as expensive as 20x L2 cache references or 200x L1 cache references. The impact of the resulting code is interesting - I need to maintain event stream, durable aggregate state and in-memory state that is follows data-oriented design. The complexity is manageable by treating in-memory objects as immutable. All of these states are considered to be transient - only the event stream matters . However for the sake of performance we need to cache things close to the processor. Always replaying the event stream will be horribly slow. There could be more interesting patterns to play with, like Entity-Component-System from the game development. But current implementation is good enough. No new features are going to be added to the codebase. I just need to make existing specs pass. Ultimately the final version of specs and a spec runner will be made open-source as a part of Trustbit katas. As for my Black Friday implementation - I'll keep it closed source for now. The domain is a toy project, but the engine, tooling and infrastructure are good enough for production and simple enough for a full handover to another team. Perhaps it will come in handy one day. Next : NixOS, the power of declarative VMs and per-project system dependencies ."}
{"url":"/post/black-friday/experiment/","title":"Black Friday Experiment - Report 1","text":"Black Friday Experiment - Report 1 This is a first report on the Black Friday experiment : let's rebuild and improve event-sourcing designs that I have worked on in the past decade. about the experiment; which features I wanted to keep and introduce; how the code looks like (features and specs); performance; simulation; tooling and development experience; possible next steps. About the experiment Black Friday is a special day in a retail business - a great deal of yearly revenue is made on that day . It also is a challenging day for engineers and operations. There is lot of stress on sales and inventory systems . Load goes up, cloud providers go down, things go haywire. This uncovers weird bugs, breaks assumptions, and teaches a lesson about cascading failures (under the Black Friday load failures never happen alone). In this world, upgrades have to happen with zero downtime under a sustainable load . So I wanted to see if I could design an event-sourced system to handle the Black Friday load while being simple to maintain and extend . At least, more simple than what we had achieved at SkuVault . Another way to look at the experiment - is about trying to scale event sourcing implementation to the extreme in performance, capabilities, and ease of supporting multiple engineering teams. From this perspective, developer tooling and experience become very important . In the recent past, I've worked on projects where finding and onboarding new engineers is a difficult process. This could relate to the domain (data science and machine learning in international logistics), but some of the lessons apply well to any software project. Picking the battles Here are the features that I want to keep from my past experiences: Event-sourcing - because that simplifies development and zero-downtime upgrades under sustainable load. Event-driven specifications to define and manage features that are linked to the user stories. Fast test runs - being able to run 1000s of specs per second on a local machine. Simulated development environments - be able to run a server or the whole cluster locally, if needed. High availability - killing any single server (Load balancer, application, or database) shouldn't bring the system down; Zero downtime for any operations. Here are the features I want to introduce: Tenant isolation . Each tenant has to run in its own tiny VM, while allowing to pack thousands of tenants on a single node. Previously: multiple tenants from a single partition would share a process. Design for live tenant migration and resource allocation . Previously - tenants were fixed to their partition. All tenants in a partition shared CPU, RAM, and storage. Use golang language - it is easy to pick up, has fast build times, and compiles to a single static binary that can be easily deployed. Previously: C# .NET Use gRPC to define APIs and contracts - because there is so much tooling around that. Previously: code-first contracts and custom IDL for service definitions. Use SQLite for the aggregate state - because there is so much tooling and ongoing work. Previously: LMDB database which has great performance but a smaller ecosystem. Design for event store compaction , since I want to keep primary event streams small and enable fast schema migrations. Previously: event streams were continuously growing without a bound. Enable per-tenant customizations , to reduce overall system complexity. Previously: features were developed and deployed for everybody, even if they were needed only for a few customers. Enable per-tenant encryption and encrypt all data (in flight or on disk) with a tenant-specific encryption key. Codebase so far At high level, the architecture looks like this: There is a single-tenant application that contains a single event-sourced aggregate. It is possible to have multiple tenants per application, multiple aggregates, too. However, this increases code complexity and discards some nice capabilities. At this point I just want to see how far \"one event sourced aggregate per tenant\" can take me. As you can see the tenant has two interfaces: Request/Response interface for API and Web. Publishing events to the event stream for replication (also loading these events by replicas). That interface would be my testing boundary for features. I have re-implemented specification testing framework in golang. Specifications follow common \"Given events - When request - Then expect outcome and events\". Check out this section on Event Sourcing for more details. Specifications test aggregate logic and look like this: When specs run, they produce an output like that on failure: I started with specifications because they are more important than the actual implementation. They define the behaviors of a service, capture edge cases, and prevent regressions. If you have good tests, then you could reliably refactor the actual implementation. You could merge, split or even completely rewrite actual codebase. See High Availability and Performance for more details. There is a portable demonstration of that approach on Replit. It contains a subset of specs and a spec runner. Last two specs are intentionally broken. The unit of development in this approach is not an aggregate, but a feature. Features could relate to specific Web/API methods (e.g.: create location, move location, reserve inventory) or be more fine-grained (e.g.: run custom script when reserving stock for an incoming order). It all depends on user stories. I usually keep one feature in a single file , with a separate specifications file nearby. That file would normally be a single method that covers everything from API/Web to DB access to event generation and application. Let's take a look at a simple feature. It is an API method that allows to move location to any new location. For example, we could have a incoming container that is being moved to the unloading zone in a warehouse. All logic happens within a single transaction that is managed by the framework. Within that transaction we are free to: Query the database . Queries are extremely cheap and generally take microseconds each. Make decisions and generate new events . Events will be applied to the database immediately, also taking microseconds. Return with error at any point . This will abort the transaction and roll the state back. Return a valid response . Framework will take care of appending events to the event store and completing the transaction. Feature-oriented project layout works surprisingly well. It is a relief from projects when adding a single field required changing half a dozen files. Or a dozen, if that involved entities living in different aggregates. In order to see how well this approach handles business complexity, I'm adding features that weren't on the table before (in my past projects): tenant-specific scripts; flexible inventory location structure (e.g. warehouse-bin or region-warehouse-zone-row-shelf-bin with movable containers and palettes); ability to reserve stock at any given location level (this reduces item availability); ability for tenants to create inventory reports that provide fast and consistent view into products quantities at predefined locations (e.g. by warehouse or in all containers-in-transit). We'll see where does this break. This is how custom scripts look like , by the way. They are installed in aggregate via a normal event, which allowes them to be testable without any extra scaffolding. Performance Test performance is good. I can run more than 10 thousand specs per second (using perf command to run existing specs in a loop): > make && bin / bf perf - sec 5 Matched 30 specs out of 30 Speed test with 4 core ( s ) for 5 s OPERATION TOTAL OPS / SEC KHZ SEC PER OP run spec 64050 12810 12 . 8 78 . 064 µ s apply event 172935 23716 23 . 7 42 . 164 µ s executed 64050 specs This also includes specs that involve custom tenant-specific scripts in Lua. That logic has a bit of overhead to start a new Lua VM, load its state, run custom scripts, and return control to the API call. Tests run on an empty database (using a combination of :memory: , WAL, and transactions to keep things fast), they don't indicate overall performance in a real world. To keep development grounded in the reality, I started building tooling for the stress tests. Here are the first results of a simple simulation scenario run with a stress command. Numbers are a testament to the performance and efficiency of golang and SQLite . > make && bin/bf stress Start simulation Setup simulated network Dialing 'sim' Serving on 'sim' DURATION DB SIZE LOCATIONS PRODUCTS ON-HAND RESERVED 432 ms 4.4 MB 511 1000 4934 16109 422 ms 4.5 MB 1022 2000 9486 32597 424 ms 4.7 MB 1533 3000 14116 48624 424 ms 4.9 MB 2044 4000 18717 64874 430 ms 5.1 MB 2555 5000 23356 81562 433 ms 5.3 MB 3066 6000 28643 98009 429 ms 5.5 MB 3577 7000 33492 114569 426 ms 5.7 MB 4088 8000 38785 130953 428 ms 5.8 MB 4599 9000 43605 147313 435 ms 6.0 MB 5110 10000 48865 164267 436 ms 6.2 MB 5621 11000 53694 180734 431 ms 6.4 MB 6132 12000 58368 196699 477 ms 6.6 MB 6643 13000 63468 213129 509 ms 6.8 MB 7154 14000 68682 229472 527 ms 7.0 MB 7665 15000 74143 245777 452 ms 7.2 MB 8176 16000 79378 261941 452 ms 7.4 MB 8687 17000 84434 278851 459 ms 7.6 MB 9198 18000 89124 295007 493 ms 7.7 MB 9709 19000 93776 311512 559 ms 8.0 MB 10220 20000 98624 328051 Simulation Stress test tool above uses simulated network connection. This is possible because gRPC SDK in golang relies on net.Listener interface: // A Listener is a generic network listener for stream-oriented protocols. type Listener interface { // Accept waits for and returns the next connection to the listener. Accept () ( Conn , error ) // Close closes the listener. Close () error // Addr returns the listener's network address. Addr () Addr } Even an onion could match this interface, if it knew how to return an instance of net.Pipe() (synchronous, in-memory, full duplex network connection). net.Pipe() allows to test the whole stack without actually touching network devices, until needed. This is as far as the simulation goes in this project, though. I couldn't figure how to control golang scheduler and simulate the passing of time, like it was possible in .NET (see Simulate CQRS/ES Cluster ). Tooling and development experience While specs are enough for the most of the development, in some edge cases I need more insight into application state . explore command helps here, it runs a single test specification, exporting the final state into a standalone database. This is good for fixing weird bugs, tuning queries or even developing new features. After starting to use this command, I noticed a nice side-effect. GoLand picked up on the schema (because I used it to run some queries) and started intelligently completing and highlighting inline SQL. Speaking of the development experience, the full cycle of building the code and running tests isn't that fast . It takes roughly 3 seconds on my laptop (Intel Mac Book) for that. I blame my CGO dependency for that (SQLite). Running go install -v doesn't help. > time make build test make build test 2.96s user 0.88s system 112% cpu 3.427 total I also took a look at running self-hosted server (with gitea ). It had server-side post-commit hook that would trigger the build inside a VM and send any failures to Telegram. So whenever I git push , there will be a success or failure notification arriving within 3 seconds. It was an interesting experience. Next steps I have a few ideas of where to take this next. Below is a list without priorities. There are too many ideas, and only one or two will get a follow-up. Switch the spec runner to work through the simulated gRPC interface. This will let to compare multiple different implementations of the same features, even if they are written in different languages. Christian Folie is looking into F# implementation! Explore adding more domains to the same codebase (JIRA, ML Pipelines library, Community blog with comments). Recover experiments with Terraform and FoundationDB to deploy golang codebase in a cloud environment against a clustered event store. Launch stress command and then start manually killing nodes at random. Continue exploration of FirecrackerVM (it is awesome) to streamline the process of building application as a bootable VM image. Then check out, what it takes to have a few hundred of them on a single server. What would the CI/CD process with stages look like? Improve the performance a bit more: prepare and cache SQL queries, switch Lua VM to use native LuaJIT, and play with request batching (trades off latency for throughput). Add web UI and web specs, so that features are more realistic. Set up a proper demo user map with an issue tracker and CI/CD cycle, then bring over a few friends to see what they think of the development experience. I'm personally interested in the FirecrackerVM line of research. What do you think?"}
{"url":"/post/bounded-context-is-a-team-working-together/","title":"Bounded Context is a Team Working Together","text":"Bounded Context is a Team Working Together Let's talk about architecture-level decisions that drive development and evolution of systems. As Stacy noticed recently: I notice your architecture is at the BC level. I certainly can see some advantages. But I wonder about things growing into a big ball of mud as requirements dictate more aggregates or larger ones. By designing at the aggregate root level, things stay small an agile as the business requirements grow and change. Although the tradeoff is more message chatter between them. Just wondering what drove you to the BC's as components, rather than aggregate roots? I used to focus just on aggregates (taking the route of \"your aggregate is your bounded context\"). This looked plausible in theory, but it didn't work for me in practice that well. Reason behind that - Aggregates are the most important building block of my systems, since they capture core business concepts and deal with complex persistence tasks (via event sourcing). However, focusing only on aggregates can lead to less efficient designs , that lack some synergies and capabilites. Such approach encourages putting too much responsibility in one place, which significantly raises complexity and cost of change (if you seen recent \"Avengers\" movie, you have an idea of what happens, when you put a team out of outstanding heroes - most of the time they argue and deliver collateral damage). Besides, this actually leads to rather chatty implementations. Aggregates, no matter how significant they are, still need to cooperate and exchange information (e.g. consider the case with \"Customer\" and \"Invoice\" aggregates, where old invoice has to be cancelled, whenever an override invoice is issued for the same customer). Solution to that problem of mine was to start thinking in terms of bounded contexts, rather than individual aggregates . Bounded Context (BC) is part of a domain model, unified by a common ubiquituous language, shared and connected concepts. It is like an organ of a human or a plant - composed from multiple separate cells, but with distinct boundaries (or, you can consider BC to be a country, if you prefer my war analogies ). As such, Bounded Context consists of multiple distinct building blocks that work together to simplify achieving common goals of capturing core business concepts (and aggregates are one of these building blocks). Implementation-wise, Bounded Context is a highly specialized team of focused soldiers (e.g. A-team ) that work together. They tend to be deployed together as well (which aligns them well with the development and operations). Software solution can be composed rom multiple bounded contexts. Each arcitecture style has it's own set of building blocks. Common guidelines for picking them are: blocks have to be relatively simple and coherent on their own in theory; development, management and evolution of this building blocks should be practical . In the case of doing Domain-Driven Design a la Lokad , we have arrived to the list of building blocks: Application Services - groups of command handlers, defined on a single class. Most of the time, these command handlers are responsible for loading aggregates and executing their methods (although, other uses are quite common, as well). Aggregates implemented with event sourcing ; they capture behaviors and core business concepts in an expressive way that frees us from bothering about complex persistence, its testing and versioning. View Projections - event handling classes that denormalize events into persistent read models called \"views\". These views are complextely disposable and can be automatically rebuilt by server, if corresponding projection code changes. They are used for bringing information together in a way, that is easy to consume by: aggregates, clients, tasks etc. Event Ports - simple event handling classes that declare event subscriptions and act as ACL (previously called \"Event Receptors\", but switched the name to match \"Ports and Adapters\" by Alistair Cockburn). They make boundaries and relations - explicit. Tasks - long-running processes (in essence, just a while(server.IsRunning) { DoSomething(); Sleep(server, 5.Mins()); } ), that allow the system to actively invoke actions; keep track of timeouts and schedule actions. In practice, not every bounded context has all of these blocks (for example, my client BCs tend to have only view projections). However, having them at hand (along with the available practices of development, deployment and operations) is what makes a lot of scenarios possible: starting from simple SaaS billing system and up to self-diagnosting big-data platform for integration (which integrates directly via databases with flexible schema) and business intelligence. Please, keep in mind, that this choice of building blocks is by no means something final, officially approved, certified (hell, I'm merely bringing together and exposing knowledge of more experienced practitioners). So it is just a selection that has (so far) proven to work the best in Lokad and some other small result-oriented companies. These are the companies, where software designers and architects have wear the burden of responsibility about their decisions and choices for years. This is exactly the environment, where operations crush and crumble nice theoretical ideas, forging some other solutions that ain't that pretty. Yet, in practice they work. Each block has a whole story behind. I hope eventually to share it (dude called Vaughn Vernon even strongly suggests on writing a separate book on that). However, what matters the most right now - this is minimal set of blocks that we would currently need to build a system in a case, which requires application of Domain-Driven Design (with or without capabilities for cloud deployments and big data processing). More is certainly possible, but that would be less. This approach to structuring Bounded Contexts would certainly evolve and change, as simpler approaches are found and new tools show up in this area of knowledge (some of the future improvements in development landscape are already accounted for). Takeaway Aggregates are just one of the building blocks for expressing Bounded Contexts in the code. Aggregates implemented with event sourcing shine in capturing rich behaviors that might require both complex persistence AND rapid evolution of underlying schema. Domain-Driven Design and use of other building blocks is an essential prerequisite for this approach to succeed."}
{"url":"/post/building-blocks-in-cqrs-world-la-lokad/","title":"Building Blocks in CQRS world à la Lokad","text":"Building Blocks in CQRS world à la Lokad Once you have broken your systems into separate subsystems (bounded contexts) and introduced Command-Query Responsibility Segregation inside each one, there are multiple paths to take from there. Each path depends on your background, familiar technologies and available developers. Two Primary View Points I think that all views on building distributed systems in .NET world can be organized in two groups. Developers from Udi Dahan/NServiceBus world tend to structure systems with CRUD aggregates (persisted in SQL or RavenDB) and SOA Services, implement business workflows with sagas that look almost like aggregates. Choice of persistence usually is restricted to transactional systems (MSMQ being the star here). There is a smart usage of caching and internet infrastructure to deal with scaling. People with preference for Greg's ideas tend to express business behaviors with event-sourced aggregates, avoid use of \"orchestration\" term and structure flows with either state machines or stateless document sagas. Views are usually rebuilt by replaying events. Choice of persistence does not matter and transactions are much less important. à la Lokad I'm personally closer to Greg's side, but a lot more constrained (startup environment and hybrid hosting environments took their tall). Primary differences (for the latest projects) is that I do not use any sagas (there is no such term) and there are no future messages. Let's highlight primary domain building blocks that are used in post-CQRS world according to Lokad: Aggregate Roots Workflows Projections Domain Services Processes Aggregate Roots Aggregate Roots with event sourcing (or AR+ES) - have complex behaviors, are easily tested and persistence - ignorant. They serve as consistency boundary and are partitionable by Id. When command arrives to the server, it is passed to the command handler, which loads the aggregate root (and any relevant services) and executes an action against that aggregate root. That action results in events that both change state of the aggregate and are published as messages to all subscribers. One of the simple mental patterns of implementing AR+ES entities revolves about splitting state and behaviors in two distinct classes: aggregate state and behavior. Aggregate State class contains structural representation of AR+ES, that can be mutated only by applying events to it. public class CustomerAggregateState { public CustomerAggregateState(IEnumerable<IEvent<IIdentity>> events) { foreach (var @event in events) { Apply(@event); } } public int Version { get; private set; } public bool ConsumptionLocked { get; private set; } public void Apply(IEvent<IIdentity> e) { RedirectToWhen.InvokeEventOptional(this, e); Version += 1; } public void When(CustomerLocked e) { ConsumptionLocked = true; } public void When(CustomerUnlocked e) { ConsumptionLocked = false; } // ETC... } Actual aggregate class contains behaviors that are usually executed in response to commands sent. In order to carry them out, aggregate uses it’s own state and any available domain services. Any changes are passed down to state as events. They will also be saved in unit of work, then - committed to event store (which will publish them afterwards). public class CustomerAggregate { Action<IEvent<CustomerId>> _unitOfWork; CustomerAggregateState _state; public void LockCustomerForAccountOverdraft(IPricingModel pricing) { if (_state.ManualBilling) return; var balance = pricing.GetPaymentThreshold(_state.Currency).Convert(d => -d); if (_state.Balance > balance) { Context.Explain(\"Balance {0} is above threshold of {1}. Don't lock\", _state.Balance, balance); } else { LockCustomer(\"Overdraft\"); } } public void LockCustomer(string reason) { if (_state.ConsumptionLocked) return; Apply(new CustomerLocked(_state.Id, reason)); } // ETC void Apply(IEvent<CustomerId> e) { _state.Apply(e); _unitOfWork(e); } } Command handler deals with bringing all this together: /// This command handler can be replaced by a set of lambdas public class CustomerHandler { IAggregateStore<CustomerId, CustomerAggregate> _store; IPricingModel _pricing; public void When(LockCustomer c) { _store.Update(c.Id, a => a.LockCustomer(c.Reason)); } public void When(LockCustomerForAccountOverdraft c) { _store.Update(c.Id, ar => ar.LockCustomerForAccountOverdraft(_pricing)); } // ETC... } Domain Services IPricingModel from the code above is actually a sample of domain service . It is acquired by command handler and then passed to aggregate root to provide all sorts of rich functionality. If Aggregate Roots (with their complex behaviors and advanced structural persistence) are brains, then domain services are actually the muscles. Here are some samples: index lookups; pricing calculators; mail messaging; integration with payment systems. Workflows Workflows are the corner stones for interactions in our bounded contexts (or sub-systems). They subscribe to all sorts of events that happen in the environment both outside and inside the bounded context and define immediate reactions to these in form of commands that will be sent to players inside. public sealed class BillingWorkflow { readonly IFunctionalFlow _flow; public BillingWorkflow(IFunctionalFlow flow) { _flow = flow; } public void When(CustomerBillChargeAdded e) { _flow.ToCustomer(new WriteCustomerInvoiceForBill(e.Id, e.BillId)); } public void When(CustomerInvoiceWritten e) { _flow.ToCustomer(new RequestCustomerInvoicePayment(e.Id, e.InvoiceId)); } public void When(InvoicePaymentReceived e) { _flow.ToCustomer(new AddCustomerInvoicePayment(e.CustomerId, e.InvoiceId, e.GrossAmount, e.PaymentCode, e.PaymentId)); } // etc This code is extremely simple and serves one and one purpose alone - to explicitly define events that this bounded context reacts to. These reactions will then be carried out by our command handlers which will load appropriate aggregate roots (brains) and let them operate domain services (muscles), while remembering what happened for future generations. Explicit difference from sagas is: Term \"saga\" is completely overloaded. Initially it was introduced for managing long-lived database transactions and then got hijacked by Udi Dahan and NServiceBus. Sagas usually allow complicated business logic structured around internal state. Workflows usually don't have any logic and serve merely as a way to explicitly define events which this Bounded Context subscribes and reacts to. If we bring all this into one picture, that's what will show up: Projections OK, this picture is nice, but dealing with events outside of Aggregate Roots can be complicated (unless you enjoy querying event streams). That's where projections come into play. They allow to project event streams into any structural representation (view or persistent read model). This view is eventually consistent and persistence - ignorant (in other words, it can live in any key-value store with decent consistency guarantees). Projections are usually way too simple to require any testing. public class CustomerInvoicesProjection { IAtomicWriter<CustomerId, CustomerInvoicesView> _writer; public CustomerInvoicesProjection(IAtomicWriter<CustomerId, CustomerInvoicesView> writer) { _writer = writer; } public void When(CustomerInvoiceWritten e) { _writer.UpdateEnforcingNew(e.Id, i => i.AddInvoice( e.InvoiceId, e.Totals.Total, e.Header.CreatedUtc)); } public void When(InvoicePaymentReceived e) { _writer.UpdateOrThrow(e.CustomerId, i => i.AddPayment( e.InvoiceId, e.GrossAmount, e.PaymentId)); } public void When(CustomerInvoiceClosed e) { _writer.UpdateOrThrow(e.Id, i => i.CloseInvoice(e.InvoiceId, e.Reason)); } } [DataContract] public class CustomerInvoicesView { [DataMember(Order = 1)] public IDictionary<long,CustomerInvoice> Invoices { get; set; } public CustomerInvoicesView() { Invoices = new Dictionary<long, CustomerInvoice>(); } public void AddInvoice(InvoiceId invoiceId, CurrencyAmount total, DateTime createdUtc) { Invoices.Add(invoiceId.Id, CustomerInvoice.Create(invoiceId, total, createdUtc)); } public void AddPayment(InvoiceId invoiceId, CurrencyAmount grossAmount, string paymentId) { Invoices[invoiceId.Id].ApplyPayment(grossAmount, paymentId); } public void CloseInvoice(InvoiceId invoiceId, string reason) { Invoices[invoiceId.Id].Close(); } } On our diagram they would look like this: Where the service (that provides query capabilities over the view) could be located in the client (Web UI) or inside server-side bounded context. For instance I could maintain an eventually consistent list of all outstanding invoices in my payment processing BC, so that when a payment comes, I could map it to the invoice by reference number. Processes These 4 building blocks can be combined in various ways to define a system that passively reacts to external events (which could be caused by user) and publishes responding events outside. How do we proactively do something to drive world around us? Let's code us our virtual user that will be running loops over the projected data and carrying out actions, when he finds something interesting. These actions will come either as events published to everybody (\"FraudTransactionDetected\") or commands to a command handler within the same bounded context (\"TryClosePendingInvoice\"). In some other teams people would use \"sagas with state machines and future messages\" in order to implement this simple user. In our case, this is just a simple process that accesses some service in a loop. The service can be our own read model or some 3rd party system that we are polling. while (!token.IsCancellationRequested) { try { var doc = _reader.GetOrNew(); foreach (var cmd in GetOverdraftSuspects(doc)) { _endpoint.ToCustomer(cmd); } token.WaitHandle.WaitOne(waitPeriod); } catch (Exception ex) { Context.Debug(ex); // to avoid overloads token.WaitHandle.WaitOne(TimeSpan.FromMinutes(5)); } } But, unlike sagas, this process is not a pain in the neck, when matter comes to debugging, maintenance and upgrades in real world. Bounded Context All these 5 blocks from above can be combined together within a bounded context to express any kind of desired behavior. For instance, let's say we are defining a BC for managing access of users to some cloud service. This BC would be responsible for: telling service fabric, which users are granted access to to it (and with which keys) receiving fine-grained usage statistics from service fabric; aggregating these statistics on a periodical basis into consumption bills for actual billing process. Such BC would have following components: Workflow that subscribes to outside account management events (from other BCs) and passes them as commands to internal command handlers (which will have aggregates for consumption tracking); these command handlers would also call domain service for service fabric management API to add/remove access keys for users; there will be a continuously running process to poll management API of service fabric for any new consumption details and pass them to consumption aggregates. there will be a projection that maintains a list of all accounts along with their respective billing periods. this projection will be scanned once in a while by a process that will look for accounts that should have their consumption bills settled. For each of these, it will send a command to internal aggregate, instructing it to close them. Somewhere in bounded contexts far far away, there could also be: a billing process in some other bounded context, that is interested in events about closed cnosumption bills (using them to charge customer's balance) Web Admin UI that displays a view with last 500 resources consumed and top consumers (kindly populated by corresponding projections) Web client UI that displays for the customers detailed break-down of the resources they have consumed within the last billing period. etc... But these are different bounded contexts, with different story, purpose, tech requirements and combination of these 5 core building blocks. Update This article has been improved upon with: Anatomy of Distributed System à la Lokad"}
{"url":"/post/business-amnesia/","title":"Business Amnesia","text":"Business Amnesia At Lokad we focus on big data analytics and consulting for the retail. This is about doing some small things (sometimes simple, sometimes not) that would companies to be more efficient in their business (e.g.: reducing lost inventory, getting better service level, optimizing replenishment for an inventory of 500k products, etc). More often than not, I observe two patterns: First, companies have really refined and deep business models . They are really good in what they are doing. Second, business processes of these companies tend to be hindered by their software and it's limitations coming from the previous century. This hindrance comes in following forms: Business practices in a company don't even have a chance to evolve and adapt simply because legacy software is so fragile and limiting. Something that was once designed to support and serve now resists change by being expensive and risky to change. A lot of money and resources is wasted inefficiently to work around such limitations. Companies can still be extremely profitable, but a few billions of euros less than they could make. A lot of vital business data is lost irreversibly, simply because before for years it was too expensive or impossible to persist it within the existing setups. So we are arriving at an interesting paradox here. On one hand, a company can be really good in what it is doing in business. On the other hand it can continuously suffer from the progressing amnesia - loosing fine-grained information about actions and decisions linked to individual customers, supliers, products or promotions. The fact that you can't process your business data today, does not mean, that you will not be able to benefit from it later. It might be useless or it might increase your revenues. However, if you don't save it - you will loose it forever. Imagine what a sad world we would live in, if health records of every patient were discarded. Detailed history of business operations is such a set of health records. It can be used to find problems or explore opportunities for improvement, which would be unique for each unique business. You just don't throw out such vitual information. Such general approach to discarding extremely valuable data can be even more surprising, given the fact that these days it becomes increasingly cheap to store data for future reference . Prices start from 15 cent per GB per month (any cloud provider) and go down to 1 cent per GB per month ( Amazon Glacier ). Keep in mind that 1 year of sales history of 1000 supermarkets can easily fit on a single smartphone. More often than not, you don't need any rocket science to store your business history for the purposes of the learning and analysis. So please, do it - this a cheap and low-risk way to potentially derive additional value later. Business amnesia, on the other side, can be an extremely risky thing, especially in competitive markets."}
{"url":"/post/challenges-of-the-code-documentation/","title":"Challenges of the Code Documentation","text":"Challenges of the Code Documentation Here's the interesting problem. There are numerous situations, when code contains a lot of important information . This important code can change really frequently too. Let's say that we need to relay this important information for somebody who is not intimately familiar with the codebase. For example: Researchers depending on the conventions and transformations in some data pumping project. New users being introduced into some project via articles with a lot of samples. Managers , requiring knowledge of some business constants and rules. 3rd party Developers , that have to integrate with some API, while having the access to the latest samples, restrictions and constraints. Needless to say, that important code pieces could be scattered across multiple projects, adding friction to people that need to have a look at them fast. We do want to have this friction at minimum ! This way we increase the chances that some questions could be resolved by looking at the documentation, instead of wasting time and potentially involving somebody else into this quest for the answers. Saved time essentially translates in reduced expenses and faster reaction of an organization (resulting in improved ability to compete on the market). There also might be some important contextual information about this code. It might or might not be valuable for the certain party, but developers would want to write it somewhere (enabling them to forget details and free Brain RAM for the other tasks). Comments usually help here, but they have to stay with the code and are limited to the plain-text (no graphs, images, tables or even bold). One common way of relaying this information (in some specific context) is to document the code in external docs , while including the latest snippets. However, the code tends to change a lot . This is especially true for fast-paced environments with tight feedback loops and low-friction development (and deployments). So we have got ourselves a problem here: we either need to waste time and concentration on updating the documentation after every significant code change (i.e.: a few times a day); or we have to accept the fact that the documentation is out-of-date and essentially useless; or we have to include links like: \"for the actual details look in the method DoomsdayMachine.RefreshWorld() and any other methods it might call\" . We'll also need to remember to update the links, should the class be renamed or moved. One logical solution is to have auto-generated documentation that could be compiled from some text, while automatically linking to the code sources. And it has to survive refactoring and class renames. I know that Lokad researchers use LaTeX with some scripts for such tasks. However the whole LaTeX thing looks a bit of overkill here, plus I'm not sure it can bind to some MSIL-level markers within the .NET code, while providing common publishing functionality. Ideally this would work like this: Project has documentation files stored and versioned side-by-side with the sources (ideally in the same solution). These documentation files are expressive enough to contain graphs, images, tables and all the other nice publishing things, while referencing some code blocks in the project. Editing the documentation would be WYSIWYG-friendly , while the original document format would be friendly to the version control (and seeing the changes). Changing the original code (i.e.: adding a few lines in the beginning of the file, or moving method around) should not break the documentation. Whenever needed (or continuously on the integration server) these separate doc files are assembled and rendered to the desired publishing format (i.e.: online docs or PDF). Any document-level compilation problems are detected immediately (i.e. when building documentation). Does anybody have similar problems and ways of solving them? What do you think?"}
{"url":"/post/cloud-cqrs-lifehacks-from-lokad-part-2/","title":"Cloud CQRS Lifehacks From Lokad - Part 2","text":"Cloud CQRS Lifehacks From Lokad - Part 2 Let's talk about some more practical lessons learned at Lokad, while running and evolving Cloud CQRS system deployed at Windows Azure. In the previous Cloud CQRS Lifehacks article we've talked about: \"Real-time\" CQRS Server Debugger Immediate Email Error Notifications Detailed Error Logs Domain Log History Excel + Domain Log Exploring your own domain In this article I would like to continue the topic and focus on the recent tricks that helped to improve performance, scaling, stability and development efficiency of our systems. Publish Performance Stats in Domain Events This is one of the best things I've done recently: Putting as much performance statistics into the domain events as possible (YAGNI does not apply here). Making these stats human-readable within the domain log . By the way, note how we are using human-readable translations of digits as well. \"5 minutes\" might be less precise than \"00:05:17.1803\", but it surely is much easier to read. This saves a few brain CPU cycles along the way and really helps in the long run. These performance stats are available within seconds after being processed in the cloud. They are captured in the code manually via a few simple helper classes (no voodoo magic here): using (stats.MeasureMs(SyncStat.UploadDatasetPart_UploadMs)) { UpsertTimeSeriesParallel(client, task.Dataset, history, false); } using (stats.MeasureMs(SyncStat.UploadDatasetPart_CommitMs)) { operation.MemorizeDatasetInput(history); operation.Commit(); } These stats aren't actually human-readable in the real world, since they are encoded with ProtoBuf-net serializer (fastest and most compact .NET binary serializer). However, for the display purposes I'm using Json formatter to make message contents look readable. This required defining a few custom Json Converters for classes that do not look nice in plain JSON, but this essentially was it. Use Consistent Color Coding It is really amazing, how a simple color coding can speed up data comprehension by a human brain. In essence visual elements (when they are not excessive) serve as mental hints or some sort of the index; they allow our conscience to skip actual reading and see high-level patterns. My life became a little bit simpler after color coding was added to the domain log. These colors directly map to the legend from my previous CQRS articles like: Domain-Driven Design, Event Sourcing, Rx and Marble Diagrams Theory of CQRS Command Handlers: Sagas, ARs and Event Subscriptions The legend is: Real-time domain log monitor follows the same pattern as well: Essentially with this approach, I do not need to read names of the commands (and recall their intents) in order to see the bigger picture. Optimize Management UI for the Touch This was unexpected, but recently I had to do a bit of monitoring via an iPad connected to a work-station via RDP. I don't have a habit of carrying a laptop around on weekends, while iPad is rather lightweight and is usually within the reach. However, native desktop interfaces are not necessarily a good fit for working with via Tablet devices. All this can add a bit of friction to the experience: context menus; keyboard shortcuts; unnecessary text-based controls blocking operations. You can really feel the friction as you struggle with your own UI on iPad. Natural urge is to reduce this friction in the areas, where you spend a lot of time. Surprisingly enough, while design gets \"optimized\" for touch interfaces, it also becomes more clear and explicit, serving as an explanation to itself. Monitor Stale Messages Unexpected things can happen in software systems. In the distributed cloud deployments they always do happen. Actively looking out for them and reacting immediately is the only way to handle the unpredictable. In addition to reporting poison messages directly to the email inbox, I've recently added another life-saver (as it turns out) feature to our systems. There is a scheduled task that checks all queues every few minutes. When it finds any with stale or delayed messages (essentially messages that were not processed within the designated amount of time) - an email is dispatched to support. Message can be stale for a number of reasons: full queues, threading issue, deadlock, low performance etc. However, if you know that such issues will be detected and reported fast, a certain amount of peace and tranquility will come to you. Of course, all these lifehacks are still separate hacks that do not compose a true self-healing and self-tuning Cloud CQRS system , but we will get there eventually. Stay tuned! PS: The next article in the series is: Helpful Domain Logs of CQRS . It takes color-coding idea one step further."}
{"url":"/post/command-handlers-without-2pc-and-with-various-levels-of-reli/","title":"Command Handlers without 2PC and with Various Levels of Reliability","text":"Command Handlers without 2PC and with Various Levels of Reliability In the last post we've tried to start reasoning out CQRS theory by making first parallels between Rx for .NET, almost-infinitely scalable systems as advocated by Pat Helland . I've been thinking a lot about this recently and so far the logic seems to hold. Let's continue with talking about persistence and message dispatch in command handlers. We will do that without worrying too much what exatly resides within the command handler. Valid options would be: aggregate root with a state repository; aggregate root with event sourcing; saga. Note that we are not talking about relational databases for the persistence in this scenario. Working with RDB would require 2PC, distributed transactions and all sorts of complex stuff. We don't need this. I would like to express my thanks to the DDD/CQRS group for helping to clarify this logic and especially Jonathan Oliver . Command Handler Command handler is a process that resides within a partition and knows how to apply incoming commands (delivered via the durable message queue) to entities that reside within the partition. Commands always have a recipient (we need to be able to route them to the partition), so applying commands is a matter of: loading the entity uniquely identified by it's identity which is mentioned in the command (from memory cache, snapshot, persisted state or event stream); applying the command to the entity (by calling a method); in case of success we persist changes to the store and acknowledge command to the incoming durable message queue (actually deleting it). Change persistence within the partition could happen in form of: event sourcing - we save events that describe the change and allow to reproduce it; essentially that's domain-aware transaction log that keeps all the information. state change - we just discard the intermediate information and persist only the latest state to the store, appending it. In both cases we need to persist information about the incoming command(s) that caused this transaction. This way we will be able to ensure command idempotency . The latter allows us to handle rare but possible scenarios when: environment duplicates some messages (this can happen in the cloud or partitioned systems). process crashes between the moment we've committed the change and ACKed the incoming message. Obviously appending changes to the disk is not the only outcome of the command handler. It might need to send messages (i.e.: aggregate publishes events and saga sends commands). This is done by the async process that runs in parallel and checks the storage for any newly committed changes. If it discovers them, then the associated messages are also published. In case of the event sourcing we simply publish all events, in case of the state persistence we just look for the explicit messages. Anyway message dispatcher within the partition will know how look for messages to send. By the way, incoming command message might actually include multiple commands. This could be used in CQRS scenarios when we want to achieve all fail or succeed together behavior (command composition is a lazy way to avoid writing large commands for each combination that might need to be atomic). In this case all caused changes are perceived as a single unit of work and are committed together as one. If the changes can't be persisted in a single write operation (i.e.: too big or storage does not handle atomic updates), you can first write all the changes in as many passes as you need. The write could be finished by the SHA1 signature of the changes. This way the transaction is considered to be completed (by the event dispatcher or anybody else who's reading it) if the signature is correct. Otherwise we have abandoned transaction that just needs to be cleaned up (or ignored). Reliability and Redundancy We can have to levels of reliability and redundancy in this scenario (we are talking about the simplest scenario where changes are written to disk and we need to handle reliability and redundancy explicitly). As long as we stay within the partition (a single machine or a small cluster) we can have 100% logical redundancy. Basically we can persist a change to multiple locations at once. Transaction will be considered as completed if the change is valid on all locations. And since all locations are within the partition, reads are guaranteed to be fast and consistent. One can implement this manually or just trust RAID to do that. If one of the writes fail - then the entire transaction is considered to be failed. This approach allows us to have a variable level or reliability in the persistence (we pick that level). In fact one should be able to move critical processes to highly-reliable partitions (dedicated hardware and data centers) and keep non-critical entities in the commodity environment (i.e.: in the cloud). Second level of reliability comes at the moment we cross the partition boundaries . We would eventually want to do this, since data is more safe, when stored in multiple locations (earth-quakes, floods, political issues etc). However, since we cross boundaries, the reliability would be eventual, giving us a small time window (from a few milliseconds up to seconds or longer), during which hot data will reside only within a single partition. Basically RAID arrays and redundant writes are just needed to reduce risk of fresh data corruption within this time window and increase the chance of successful replication it somewhere else. Replication comes here in form of yet another simple async process that merely picks up our changes and sends them across the partition boundaries to long-term storage (preferably there will be multiple locations). If our recipient is not available at the moment (network failure, repartitioning in the process or 2012), then all we can do is to raise alert, retry shipping changes outside of the partition boundary till we succeed (the infrastructure might also provide alternate shipping locations). By the way, the very process of shipping changes across the partition boundaries could also be used to replicate them across the globe, feeding to various read models (they are eventually consistent anyway). Basically we could combine these two simple approaches (writing to multiple locations within the partition and shipping changes outside of the partition) to achieve a desired degree of reliability and fault-tolerance. Besides, we can always position various partitions (based on their business importance) in environments with various reliability settings to achieve the maximum ROI and handle risks in cost-effective manner. There also is a third approach , that involves sending data in a transactional manner across the partition boundaries on the actual write. This one is slightly more complex and would probably require using some storage engine for the purpose. However in essence this would work like: storage sequentially writes the value to multiple locations within various partitions, sending ACK only if the last write is successful (or there is a quorum or whatever tech is used to figure out decision across the consistency boundaries). I doubt this is much better than 2-step redundancy approach described earlier. This simplicity and ability to tune reliability settings (and failure probabilities) on per-partition level is what I like about the approach. Simplicity allows to scale and evolve, variable-level reliability allows to be really efficient and cost-effective. NB: please keep in mind, that for me this just a theory at the moment (logically ruled out of the properties of the distributed and scalable environments and some cloud experience at Lokad projects). I haven't tried this specific approach in production (this might be fixed soon, though). The whole purpose of this exercise is to figure out and understand underlying theory so that future systems will be developed without architectural decisions blocking almost-infinite scaling by enforcing certain constraints based on the theory. All feedback is welcome! Now if only somebody could help to me figure out how to distribute events to subscribers in partitioned environment without feeding everything to a single message bus (complicates and creates single point of failure). I just can't figure out the logical transition from the published event (going outside of the partition) via the subscription and into the command that has a recipient and goes inside the partition."}
{"url":"/post/concurrency-programming-is-like-thinking-in-5-dimensions/","title":"Concurrency Programming is Like Thinking in 5 Dimensions","text":"Concurrency Programming is Like Thinking in 5 Dimensions Concurrency is about thinking in 5 dimensions at once. Let's consider a single code class to be a 2D space. In the end, it's flat and fits on one display. File organization (either flat or with multiple folders) add another dimension - third one. These are static dimensions, just like the height, width and depth in the real world. Fourth dimension, time, has it's own analogue in the code world - execution of the code within a single thread. This is were the code comes to life and shines in its dynamics. Concurrency brings another dimension to the mix - the fifth one, since you have to deal with multiple threads being executed. They are just like parallel dimensions in the real world. That brings us somewhat close to 5D. So delivering reliable concurrent code is about thinking in 5 dimensions: it takes a bit of time and exercise to get used to to it; it is rather hard to visualize; you may want to keep the basic logical pieces as simple as possible in order to manage and evolve the entire system efficiently in your head. Adding distributed systems could pump the dimensionality count up a little bit as well, increasing the complexity and excitement of the delivery. NB: thinking about an entity (software system, in our case) from completely different angles helps to find simpler solutions to existing problems and challenges in the code."}
{"url":"/post/const-vs-readonly-vs-static-readonly-in-c-net/","title":"Const vs. readonly vs. static readonly in C# .NET","text":"Const vs. readonly vs. static readonly in C# .NET Here are the differences between C# .NET const , readonly and static readonly fields. Constants : Static by default Must have compilation-time value (i.e.: you can have \"A\"+\"B\" but cannot have method calls) Can be used in attributes Are copied into every assembly that uses them (every assembly gets a local copy of values) Could be declared within functions The compiler performs some optimization by not declaring any stack space for the field Readonly instance fields : Are evaluated when instance is created Must have set value by the time constructor exits Static readonly fields : Are evaluated when code execution hits class reference (i.e.: new instance is created or static method is executed) Must have evaluated value by the time static constructor is done You really do not want to put ThreadStaticAttribute on these (since static constructor will be executed in one thread only and it will set value for its thread; all other threads will have this value uninitialized) Class that has only constant or readonly instance fields is considered to be prepared for the concurrency challenges of parallel computing, since it has no mutable state. BTW, there is a simple unit-test to enforce that your classes are immutable . Additionally you might be interested in some tips on writing the event-handling code in .NET C# from the theory category."}
{"url":"/post/continuous-learning-at-lokad/","title":"Continuous Learning at Lokad","text":"Continuous Learning at Lokad I've been asked a few times recently about how Lokad approaches development from the high-level perspective. After giving it some thought, I've assembled a small article that provides a bit more detail on the subject. I hope you find it interesting or at least amusing. While reading, please keep in mind that these are merely personal thoughts or a person who has just started to learn the trade. If you need some solid professional advice or a guidance on the matters at hand - please go to Microsoft, Oracle or any consulting company. Lokad , once again, is a tiny SaaS company focused on providing forecasting services, while using some bleeding edge tech to deliver them in an efficient and cost-effective way (as in \" much cheaper and smarter, than any competitor would be able to come up with \", \" scalable beyond the wildest dreams of any customer \" and \" don't trust just my words on that and try-it-yourself-for-free \"). Heavy research background helps us a lot here to stay efficient and move forward. This works because the majority of people of the company got involved into PhDs or secondary education at some point. PhD is a nice practice of continuous self-guided research and study process. Continuous learning is actually what we are doing on everyday process here (this is in addition to work done by the interns and people who are actually working on their PhD within the walls of the company). You'd be amazed to know, how much you would learn from you work in a continuous manner, if you perceived every task at hand as an opportunity to learn and improve . This approach is reflected in our formal practices (i.e. writing weekly internal blog posts on challenges encountered, things accomplished and planned for the next week). This is also reflected in informally adopted practices as well. Company culture encourages blogging (internal and public) which is a nice way to structure your own thoughts and do the planning (for example, blogs of Joannes , Matthias and Christoph ). Talks and presentations are also encouraged along with sharing non-core knowledge with the community in various forms (i.e. See our open source strategy ). Even development projects are shaped around vigorous learning approach as well. In development we try our best to avoid heavy planning and management approach (waterfall, PM, Agile, XP, Scrum or whatever other letter combination that is being sold by consultants these days). The reason being - we don't really care about the process or formalities of dev planning and management (as it is usually advocated by the thought leaders). The only trustworthy guide for us here is the real world - being able to deliver services and software that do the job, survive the beating and stress in the cloud environment, help to move the company move forward. They have to require as few man-hours as possible, of course. No matter how good development practice is, it does not guarantee that the project will have a financial success in the real world. However, if the project did succeed and is capable of evolving efficiently, nobody would really care about the \"properness\" of the development practice employed to achieve that success (aside from ivory-tower architects and managers). These are the additional reasons (fine-print for the previous statements), why our ignorance of any \"proper formalized development planning\" actually works out for us: we are still very disciplined in what we do at Lokad and how we approach planning and management; it just happens that we use approaches that are closer to the scientific fields with a lot of experimenting and unpredictable future; projects are deliberately broken down and structured to make this approach work; we avoid development and technology choices that would conflict with this approach; multi-hat roles and a lot of interpersonal communication help to soften the rough edges as needed. Let's dive into a bit more detail. Projects and systems are broken down into smaller tactical apps (Joannes wrote a bit more about the tactical approach ). Actual development is preceded by a series of rapid prototypes (spikes), if there are some unanswered technical questions. These prototypes stay forever in the trunk of the originating project for any future references. Usually the first version of the system is just a prototype that has gone too far. We don't involve any external consulting companies whenever we need to solve an unknown dev problem. First, this would be hideously expensive and slow, compared to our own standards; second, I doubt it would bring us anywhere (i.e. given what is preached by consultants in enterprise and cloud systems, for example). Instead, usual scientific research process is being used, tapping on the infinite practical knowledge shared by the community (all, who have kindly and patiently answered our numerous questions on forums and emails - I thank you!). Results are structured, implemented and shared back with the community where possible, in a form of open source projects, blog posts and published articles. The latter is not only a way to give back something, but also a way to gain additional insights, feedback, ideas and inspiration to move forward. First project iterations are pushed to production as soon as possible (often with few or no tests). This helps us to get real world feedback and correct the project at the stage, where correcting the project, business idea (or even discarding them altogether) is at it's cheapest - in the very beginning. After the first release, new discoveries and insights are incrementally incorporated into the codebase in the order of importance. This importance is prioritized by the business in form of Return on investment (ROI) in the mid-term (adjusted according to the priority of the project and it's positioning). Such factors as \"reducing development friction\" or \"refactoring\" play on the same table as \"business features\", since they all determine the capabilities of the project to survive, adapt and evolve with the minimum effort possible. I've seen some remarks by thought leaders along the line of \" who put developers into the position to decide what technology they should use? \" (the context was about the event sourcing or doing any major refactorings at the cost of short-term feature development). Well, at Lokad we have a clear separation of concerns. Business decides and determines, what should be done ; it comes up with the challenges and problems; it picks the direction (and is also encouraged to share wild dreams and things that may seem to be impossible to do these days). However, question of \" how it should be done to achieve the specified result with the maximum efficiency possible \" is handled by the development. It is the responsibility of the development to find the fastest and cheapest route towards the business objective , suggesting alternative routes if needed. Hence such things as introducing a new tech, if it really proves worth it, are never the problem. Here are some examples that made a quick way in, because they were a tool to solve some problem at the point: elastic cloud deployments; distributed version control systems; different IOC containers; new continuous integration platforms; new versions of SDK or frameworks etc. These days we are making it into distributed cloud engines, event sourcing, new types of UI. A lot more cool stuff is considered down the road as well. However, development does not run anarchist in it's work. We don't pick every single technology or jump at every opportunity. As you recall, it is the task of the dev to solve business problem in the most efficient way . \"Most efficient\" should, obviously, apply not only at the scope of a single project, but should cover the whole project portfolio of a company considering it's well-being, managing the risks and optimizing returns over both the present and foreseeable future. This forces Lokad development to be self disciplined and self-constraining. This also makes it an interesting challenge to balance business problems with the technical and resource constraints , overlapping them over the major paths that the future might take us. Probabilistic planning and balancing of resource pools and tech choices can send your head spinning sometimes (try visualizing effect of a choice over the possible scenarios of future, weighted by their math expectation of ROI and colored by the distance from the currently selected course ), but it allows development to be really efficient with resources and time , while helping to set what is believed to be the best course for the company. Here are some examples of the self-inflicted development constraints , that would seem to be illogical choice if considered with the scope of a single project only. However they are really justified (as we want to believe), when the entire business is factored along with the major future scenarios and our real-world constraints: we still use svn and mercurial for version control (we know, that svn can't even be compared to dvcs and thar git is more flexible than git); one of the current projects under work still uses ASP.NET web forms for the UI (although it's a pure drain of development time and effort); we continuously push effort into open source projects (well, this is easily financially justified); we make the effort to make systems at least 100 times more scalable, than the current actual stress (almost-infinite scaling is obviously the ultimate goal); we gradually migrate from SQL to noSql and event sourcing (easily justifiable, but for some reason it's so hard to do in some other companies); sometimes developers are assigned to projects and tasks that hold no real business value (aside from letting them learn and gain experience in preparation for projects planned down the road); a lot of times we restrain from using some tech approaches and solutions that would make our life easier (these tend to be the very same things that led us to big bowl of complex and expensive spaghetti code in the past). By the way, below are some widely accepted names, that we are either moving away from or ignoring altogether. Please, keep in mind, that this is a highly opinionated list . With a lot of these items we were in deep love before (and still share a deep appreciation for their authors). But we eventually discovered that in our hands these do a lot more damage than good (this is probably caused by my lack of technical expertise and the lazy approach of trying to do as less work as possible). NHibernate or any ORM; all mocking frameworks; anything more complex than basic NUnit for the unit testing; anything that requires an SQL database to work with; Azure compute and storage emulators for the cloud development, Azure bus and caching; Reusable frameworks (getting rid of Lokad.Shared libraries was a major pain) Any IOC container that is more complex than Autofac (although ideally we would boil to lambda-based IoCs); Silverlight, WPF or any other ancestor that would show up in Windows 8; MSMQ, or anything that has to do with DTC and System.Transactions; anything that has to do with Aspect-oriented programming and code weaving. In addition to selective ignorance of widely used technologies, we try to find our own way around the architectures and development principles as well. Modern enterprise development by the books (i.e.: as in N-Layer architectures or patterns and practices ) might work for large enterprises that could dedicate multiple developers (even dozens) to work on a single project for multiple months (even years). We don't have this luxury and should seek something dead-simple, dead-cheap and extremely efficient (for examples see CQRS and xLim series of articles within this blog). This process of trial and error does not end up with the development only and covers the other areas of running a business - marketing, sales, customer support, product planning etc. The most interesting pieces are reflected in articles of this blog, as well as the blogs of the other members of Lokad team . As you have probably guessed, I find deep satisfaction and pride in working in such interesting environment. Man should be challenged, and all challenges should be answered. By the way, we are in a lookout for young talents and researchers who are interested in learning and working in such productive and efficient environment, continuously learning and pushing the state of the art. Not just for the money, but also because it is fun and helps to make the world a better place. Internship opportunities are offered in Paris, France (developers, statisticians and sales/marketing) and Ufa, Russia (strong focus on dev, but other options might also apply). And there is always a possibility of negotiating something more permanent if we get along really well. Even if you just need advice or have a question in the field of forecasting, organizing and running start-ups, efficient development (CQRS/DDD/ES/Cloud etc), please don't hesitate to ask a question in our community or drop us a line in private. We will answer."}
{"url":"/post/contributing-to-open-source-is-so-much-better-than-taking-a/","title":"Contributing To Open Source Is So Much Better Than Taking A Class","text":"Contributing To Open Source Is So Much Better Than Taking A Class Let's say you want to really learn cloud computing, unit testing, inversion of control or CQRS. You can either: pay for a course, read some books, do exercises and get a certificate at the end; participate in an open source project or two in your field, getting real-world practical knowledge and being forever recorded as a contributor; nothing to pay and skills will be up-to-date. Guess which option will score more victory points, when you apply for a job or start building your own ISV startup? My personal biased bet will be on Open Source , so let's talk about it. BTW, my many thanks to Muneeb for sending a question on the topic. Also thanks to Jonathan Oliver for helping to clarify the title. Open Source projects are something we all know about - projects where people collaborate to develop a product, providing it at no cost along with documentation and source code . In .NET community you probably came across a project like NUnit, Autofac, NAnt or NHibernate. These are open source. Their analogues on the other platforms are usually open source as well. You can get the source codes, build them and run. Certain restrictions may apply , depending on the licensing terms. However generally you can do pretty much what you want as long as the credit is given and fair sense is applied. We can clearly see the commercial benefit of Open Source Software (OSS) in various projects - you don't need to pay licensing costs and usually have faster development model. However, there is much more in OSS! In my previous post of 10 Steps To Become Better .NET Developer there was this seemingly simple line: Contribute to Open Source Projects of your choice. Actually this was one of the most important items in the self-improvement list. Let's talk how exactly open source projects can help you to become a better developer. If you would like to recommend a .NET Open Source , or need help in finding one that fits your field of interest - please read towards the end of the article and leave a comment . Learn from Open Source First of all, you can benefit from the Open Source by just downloading the source code and learning how smart people get things done in software . I loved reading through and being inspired by NAnt , OmeaReader of JetBrains , MindTouch and StructureMap , when I was just starting with .NET. Such an activity helps to: practically see how people organize code and do all sorts of small things: name classes and variables, throw exceptions, write sanity checks and document code; learn about testing projects, building and integrating them - this is real-world stuff; see various ways to organize projects, supply documentation and version dependencies; practically understand what steps are actually involved in delivering a project. Of course, there are no perfect projects or developers. After checking out a few similar projects you will notice that they tend to have different strengths and focuses. Some might have better codebase, the other - better documentation, third - more efficient development environment and faster feature deliveries. By comparing them you will see tradeoffs made by people running them. They had to focus on the most important things (in their opinion) while using limited resources at hand. Resources are usually limited by their own free time and motivation. And since the Open Source project is active and being used, you can already say that they were successful in making these choices and executing them. Hence, you can learn a lot. Besides, if you focus on a set of open source projects in a certain niche, you will be able to learn the technology and the primary principles behind. You will practically see the choices and trade-offs being made in the code. For example , less than a year ago I knew little about messaging and nothing about service buses. Yet, when there was a need to build a service bus for Windows Azure with a specific set of requirements, it was a rather straightforward process. I simply had to: Learn the theory and read a lot of articles; Find open source projects dealing with the similar problems (NServiceBus, Mass Transit and Rhino Queues), reading through them; Start working on Lokad.CQRS , while adapting the most important ideas from these projects to the specifics of Windows Azure; whenever I had a problem at hand, there were already 2-3 options of solving it. The end result is that now I can say a couple of words about using and building service buses in cloud computing environments and on-premises (esp. in CQRS architectures and Windows Azure). Plus this brings real business value at Lokad. As you can see Open Source projects can give you a lot merely by providing real-world something to learn from. However you can gain a lot more by actively participating in them. Participate in Open Source Participating in Open Source project allows you to: actually practice your development skills: applying design patterns, writing unit tests, managing complexity and doing all sorts of other things essential for any project; get feedback on what you do and potentially some helpful guidance - it will come from the people involved in the project and hence having practical experience in the field; this is far more better than spending time in university (for which you will need to pay) and listening to teachers that don't necessarily have up-to-date practical knowledge and experience; learn and practice things like: issue trackers, wikis, version control systems and change management, integration tests; you will also learn how to use them in order to work with the other people spread all across the globe. How Can You Get Involved in OSS project? First you need to pick projects in the field that really interests you . The \"really interests\" part is extremely important, since personal motivation can make a huge difference between personal endeavor pushed to success and something we just spend time on and eventually put aside. Second, you learn as much as you can about it : read documentation (and especially the FAQ); download the source code, try building it and running; walk through the samples, if there are any; sometimes other people will blog about the project - google these articles up. Already at this point you probably have some questions or came across an odd behavior. Share these discoveries ! ask questions in the community; blog about successful and not-so-successful experience (while doing that - please try to be more professional, than I tend to be, say, when writing about Windows Azure); try to figure the problems on your own, propose solutions and submit patches. I can already tell you that almost every single open source project (i.e.: .NET framework or tool) would love to have: better documentation and tutorials; samples; unit tests and small bug fixes; small features that some people want but don't have the skills to do; people using their projects and providing detailed feedback; So you can check out issue trackers (they could also be named as tickets, bugs or feature requests) in a project. They will list problems and feature requests that project owners would love to be done. Yet for some reason they are not done; the reason being - lack of time to do that. So if you step up for the challenge and offer your help - you'll be a hero of the day. If you are just a beginning developer and don't see something you can handle, just go the community and say the words: \"Hi, I love your project and would like to contribute. Tickets are too complex for me at the at the moment, but I really want to learn this field. How can I help?\". I would be surprised if you'll be able to get away with less than few things to do and recommendations. The truth is: owners of these projects run them in their spare time and will gladly accept any help you can offer . Besides that, they are passionate about their projects and just love to talk, teach, help and share (how many university teachers do love answering questions on evenings and weekends?). Community members usually are no better either - they just love talking about the project, field and improving it, while doing all sorts of things to make this happen. Since there always are some really smart people out there (with a practical experience), this is way better than your average class for learning, sharing and having some fun. That's the nature of the open source. So if you would love to learn some practical skills (and understand that universities might just serve some outdated theory), feel free to check OSS out. For example , in .NET you can start with: For Inversion of Control: Autofac IoC Container ( 22 tasks in the issue tracker) and Windsor Container (more than 50 unresolved issues ) For Cloud Computing and Windows Azure: Lokad.Cloud for Windows Azure ( 23 tasks in the issue tracker) For Continuous Integration: Cruise Control .NET (more than 200 open issues ) For Unit Testing: AutoTest.NET (just 3 issues , but they will have more, plus participating there can get you a beta of Mighty Moose ) or NUnit (38 open bugs in 3.0 version) For service buses: NServiceBus and MassTransit NB: This list is by no means complete . It just serves the purpose of giving the overall perspective. If you want to recommend an open source project to participate or look for one in a specific field - please, read to the end of the article and leave a comment . After participating in Open Source projects, you can also take the next step - start your own . There are always places in .NET community and outside, where the tooling is less than perfect or there only are expensive options. By taking this step, you will learn an additional set of skills which a lot of companies look for in resumes (we definitely do at Lokad ). This will also will provide you with additional background for managing development projects at companies or starting your own business. In fact, an open source project you are passionate about, can become the foundation for the business. But that's a story for another post. Go for it! So if you liked what you heard - I encourage you to go and take a closer look at open source projects in the fields you are passionate about. You could learn a lot of practical and inspiring things, get a good line for the resume (actually companies tend to look for active contributors in the open source projects that they rely on) and give something back to the community. The latter will give you this warm fuzzy feeling of doing good stuff and being part of some bigger effort. I also strongly encourage you to comment below if you : look for an open source project in a specific field with a friendly and welcoming community willing to help you to learn and contribute back ; please indicate field of study/interest . run an open source project in some field (or know such projects) that would love to get more help and interest; please indicate focus of the project and give a few samples of how a novice can help . Contact info would also help. You can also retweet this article and share it with people and communities to encourage diving into open source projects (you might have one or two specific ones in your mind already) and getting real-world experience while helping them to move forward. Links for sharing are below. I'll keep this article updated. PS: As you can see, an innocent-looking item of \"Contribute to Open Source Projects of your choice\" in 10 Steps To Become Better .NET Developer turns out to be far more exciting and valuable than it might look on the surface. I'm planning to dive in a few more topics from this list later as well. Stay tuned !"}
{"url":"/post/cqrs-architecture-and-definitions/","title":"CQRS Architecture and Definitions","text":"CQRS Architecture and Definitions CQRS principle, in it's core form, merely introduces separation of reads from writes. This simple approach brings following benefits: denormalized query persistence is optimized for the reads (which usually make up the most of the persistence IO) resulting in better performance and user experience; we can optimize our read side for the needs of the UI (i.e.: fetching dashboard for the user in a single query) which will result in better development experience and less risk of breaking something on the write side. read side can be put on some cloud storage, which is inherently optimized for the reads, could be partitioned, replicated and even distributed via CDN; by offloading data reads to synchronous queries we automatically increase the performance of the write side - now it has lower stress and lower probability of hitting a deadlock (which you should still account for). Deeper introduction and more learning materials are available for study in CQRS Starting Point What about things that you hear in any CQRS talk: commands, events, DDD, eventual consistency and almost-infinite scalability? These are distinct architectural patterns with their own benefits and peculiarities. These patterns play so nicely with CQRS principle (separation of reads from the writes), that they are often perceived as one thing . So when we say \"CQRS\" this often means: \"CQRS Principle\" + \"Message-driven architecture with commands, events and queries\" + \"Domain-Driven Design Methodology\". This combination is one of the most frequent variations of \"CQRS Architecture\" (sometimes Event Sourcing is included there by default as well). Success of this variation is the reason why there is so much buzz and hype around the original CQRS term. So here's what we have here: CQRS - buzz-word that could mean a lot of things; also name of the \"cult\" . CQRS Principle - principle that dictates separation of reads from writes in your system. CQRS Architectures - specific architectural designs based upon the CQRS Principle and a few other time-proven methodologies and approaches. They usually come with a clear evolution path enabling migration of live application to more elaborate design, if needed. DDDD (Distributed Domain-Driven Design) - one of the CQRS Architectures , as presented by Greg Young. It is based upon \"CQRS Principle\" + \"DDD\" + \"Message-based architecture\" + \"Event Sourcing\". Documents on the CQRS Info site cover this in greater detail. Obviously, multiple architectural designs could be established on top of CQRS Principle, DDDD is just one of them (see my CQRS Roadmap for some other possibilities). I believe, when the CQRS book finally comes out, a few designs will be covered there. Meanwhile, here are the options that I'm aware of: DDDD - CQRS Principle + DDD + Message-based architecture + Event Sourcing Cloud CQRS Architecture - CQRS Principle + DDD + Message-based architecture + Elastic Scaling + NoSQL persistence on top of cloud storage. CQRS with Relational DB - CQRS Principle + Service Bus (NServiceBus/MassTransit) + ORM (NHibernate). Each category has it's own variations and possibilities as well. For example, while evolving your legacy application towards the DDDD Architecture, you could have at some point Event Store and RDB+ORM persistence, where the latter could be dropped in some areas of the system a bit later on. In fact, enterprise system, being build with real-life approach for CQRS, would probably contain a mix of various architectures. We would be picking whatever makes the most sense for each specific business case and subsystem (basically optimizing for the highest mid/long-term business value for the development effort invested). Better categorization of real-life CQRS architectures, their peculiarities, challenges and recommended evolution paths would probably require a bit more research. This could include gathering feedback from people practicing various CQRS flavors and structuring it on top of the theory with the formal methodologies used in universities during the PhD studies (for both learning and presenting). Maybe some time later I'll be up to this challenge. Meanwhile, you can stay tuned to this blog and also RSS feed of CQRSInfo.com (it will get more lively a bit later)."}
{"url":"/post/cqrs-lifehacks-from-lokad-production/","title":"CQRS Lifehacks From Lokad Production","text":"CQRS Lifehacks From Lokad Production So far I've been talking purely about CQRS theory and attempts to settle it down in a logical way for projects of various scalability and reliability requirements. Here's what the recent posts were talking about (check CQRS and xLim sections for even older articles): The Best Way to Learn CQRS, DDD and Event Sourcing Scenario-based Unit Tests for DDD with Event Sourcing Domain-Driven Design, Event Sourcing, Rx and Marble Diagrams Command Handlers without 2PC and with Various Levels of Reliability Theory of CQRS Command Handlers: Sagas, ARs and Event Subscriptions Now it's time to switch back to the real world and this wonderful thing called production. I've seen a lot of CQRS/DDD/ES theory articles and abstract snippets out there (and attempted to contribute to this myself), but I can't recall any posts describing real-world production systems, their problems, challenges and various life-hacks. By the way, if you know such articles or happen to share your experience, please drop a comment or twit me. I'll be sure to include reference to such material, so that everybody could benefit. Real-world systems are rarely pretty, they tend to contain a lot of hacks and miss potentially good improvements. Yet, they are real and they keep on surviving the best test out there - \"natural selection\" or \"survival of the fittest\". Theory can only prove itself to be correct by being implemented in such system and surviving long enough to evolve further. I'll start by sharing a few recent CQRS-oriented development discoveries that helped me a lot on the battle-field within the last two weeks. There will be less of nice diagrams and more of screenshots exposing ugly UIs that I hack for myself. I'm probably reinventing the wheel here by trying something done already done by the other people. If you have something to add from your experience, please - share in the comments or in your blog. This will benefit the community immensely. \"Real-time\" CQRS Server Debugger Essentially this is just a mere console that aggregates and gets all messages (events and commands alike) from the production system. Actual system could be deployed somewhere in the cloud and have access is protected by the HTTPS+DevKey authorization. Since we are reading the past history (with the slight delay) from the cloud persistence (inherently scalable and optimized for reads), this has little impact on the performance (close to none). At the same time it provides almost real-time insight into distributed server-side processes as needed. Each line is basically a string representation of a single message. Some messages get \"ToString()\" overloads to make them simpler to read. Other's just print their name. public override string ToString() { return string.Format(\"Send mail ('{0}' to {1})\", Subject, To); } For those of you that have been working with Windows Azure and using Trace display of the Azure Dev Farbic, this is almost the same experience. But it works with the production deployments in the cloud and I use it a lot more than IntelliTrace in Azure. Error Notifications One of the crucial differences between ivory-tower theoretical architectures (which might look good in spikes and lab tests) and abused production deployments is the simple fact: unexpected problems happen in real world . Since they are unexpected , we can't avoid them 100%. Yet we can do our best to learn about problems as soon as possible, have the information to fix and be able to deploy the fix as soon as possible. Wiring email notifications to the poison queues is the simplest way to learn about errors fast. Such emails tend to help stabilize system really fast. This works especially well with the fragile integration points or external 3rd party systems that start behaving badly (timing out or returning unexpected results). If you start investigating issue as soon as possible, there is a chance to still catch such system at the crime scene. This makes it's easier to isolate the issue and prevent it from happening ever again by adding some smart behavior (i.e.: saga). Error Log Details Once there is information about error, you'd probably would want to fix it fast (preferably before this starts having impact on the customers). Detailed error log, referenced by the email notification, could help. BTW, you might be tempted to send the entire exception details by the email. I strongly recommend to avoid this path, since it could accidentally (just like the recent padding oracle in ASP.NET) expose confidential information outside the system. It's better to provide unique message identifier in the exception notification, while keeping the actual details in the persistence in a secure way. Once you've got exception details from the secure storage, you just need to copy exception stack trace, paste it to ReSharper (\"Explore Stack Trace\" feature) and jump to the failing point in the codebase. Domain Event History Sometimes information about the exception (no matter how detailed it is) is just not enough to solve the mystery of the problem at hand. Full domain message log (which comes native with the CQRS architecture) and append-only persistence are one of the best tools for post-mortem analysis . At Lokad we tend to record all available information into the events, just because it is extremely easy to do. This includes (but is not limited to): performance statistics (time spent, number of records processed); explicit details about the security context; denormalized information from the AR. This yields following benefits at almost no cost: easier to write view denormalizers; we've got perfect audit logs; easier to track performance statistics and tendencies of the system. The latter part is extremely important, since CQRS systems tend to be rather complex and dynamically evolving (just because it is so easy to evolve them without hitting any complexity barriers). This forces the system to encounter various real-world problems and scalability limitations as it rapidly grows from the prototype and into the scaled full-blown solution integrating with dozens of various internal and external systems. As long as we track all information in the events, we could data-mine captured knowledge for hints of problems yet-to-happen. Reporting over domain log will help us here. Excel + Domain Log Microsoft Excel 2010 is one of the best tools for analyzing complex data with just a few lines of code. We can take our domain log, write some denormalizing queries, run them through the history and dump resulting state directly into the Excel spreadsheets for charting, pivoting and looking for trends and potential problems. Since such reports are extremely easy to create and run, this encourages for exploration and experimenting and leads to better understanding the system. In the end assumptions about the real-world behavior (in my case they tend to be off-scale, especially when I'm trying to assume bottlenecks and performance impact of some things) are replaced with simple knowledge. For example, in order to mine all history for the report of all mail messages (presented above), one would just need to write a query like: var messagesPerDay = history.Messages .Where(m => m.Content is SendMailCommand) .GroupBy(me => me.GetCreatedUtc().Date, c => (SendMailCommand)c.Content) .Select(x => new { Date = x.Key.ToString(\"yyyy-MM-dd\"), Internal = x.Count(m => m.To.Contains(\"lokad.com\")), Public = x.Count(m => !m.To.Contains(\"lokad.com\")), }); Infrastructure, reflection and some OpenXML will do the rest. Exploring your own domain Sometimes, in order to resolve the issue we would need to get really hacky and send raw messages directly to the system via some endpoint (at least SSL + dev key are recommended to secure such endpoint). Home-grown UI utils, organizing commands and events in a nice way, will help to navigate all the messages and automate sending the right ones. Another use for such functionality is to resend the last failing message from the poison queue back to the command handler, after deploying fixes to production. I used to rely on such functionality a lot while fixing various integration issues. What do you think? What hacks and tools do you use to evolve your systems past new scalability and feature requirements? PS: If you are interested, you might also the next article in the series . It shows how to \"teach\" Visual Studio a new language (in our case - DSL for specifying message contracts) PPS: You can also jump directly to the next part of Lokad Lifehacks ."}
{"url":"/post/cqrses-road-trip-lessons-learned-at-networkua/","title":"CQRS/ES Road-trip lessons learned at","text":"CQRS/ES Road-trip lessons learned at NetworkUA conference at Kharkiv was amazing! It felt a bit unofficial and practical all way through. Attendees, organizers and final feedback were just great. I couldn't even wish for a better start for our CQRS/ES road-trip . There was a outstanding panel of speakers - Sander Hoogendoorn (no frameworks), Ayende Rahien (no-SQL), Greg Young (DDD/CQRS). Note, how all these talks are focusing on keeping things simple and practical. I was presenting on event-centric development (we had the same setup with Greg as in Vienna, where he was doing crash course which I follow up with small bits of practical experience and applying all that in various projects). Obviously, my talk was far worse than presentations of Greg or Sander (well, that's just 2nd time I'm doing this presentation, as opposed to 50th of Greg :), but certain things come only with the experience. Next time it will be better. Sheer lack of time felt a lot: rushing through slides to fit into 45min talk. I was probably talking too fast. But we did catch up with some Questions and Answers in the corridor afterwards (blocking the entire corridor with a small crowd for quite a bit :) Lessons learned: Prepare core structure of the presentation and work it out (i.e.: enough for 20 minutes of a talk). Identify pieces of presentation that could be gradually added or excluded, depending on the lack or abundance of time. Make sure that you are heard well (not all conferences will have head-set mics). I need to finish the FarleyFile sample on CQRS/ES (or, at least, finally publish it as it is). Of course, in addition to the actual presentation-related skills, there were quite a few things learned during the after-party and unofficial hang-outs. But that's more technical (I'll blog about some things later, as promised). Speakers arrangement was amazing - small group, passionate about development and all agreeing on learning and keeping things simple. This made the final panel quite interesting as well - sharing motivation, inspiration and vision of future. This actually makes an interesting rule for the future, when deciding whether to participate in a conference or not. Feedback after the conference was incredible. People were sharing emotions and thanks to the organizers. Well, I'm still receiving emails and DMs. The most important thing: developers are interested in exploring, trying and learning. So they probably got something out of the conference. If I were to help in organizing European CQRS conference some time later (as discussed in the community) - I know much better now how it should look like to be educational, inspirational and also beneficial for the community's progress. Fun included. Thanks to Akvelon and Sigma Ukraine for providing an amazing atmosphere. To speakers - for providing the company, sharing ideas and experience (the morning is a bit rough after the last after-party :). To Microsoft - for providing a great framework and tooling for exploring all these ideas. In short - I've learned a lot already, making this CQRS/ES road-trip really educational already. And this is just the beginning. We'll have an Event-centric IT Cafe in Kharkov today. Then on Monday - we'll start to move slowly in the direction of Kiev towards Event-Centric weekend (depending on how things work with the developers out at the cafe)."}
{"url":"/post/data-anonymization/","title":"Data Anonymization","text":"Data Anonymization Various organizations, such as hospitals, social agencies or commercial companies, sometimes need to release potentially sensitive information to the other parties. They might want to do that in order to gain additional insight into the domain or run a specific research for which they don't have resources. While releasing such information, privacy of people and businesses should be protected. We don't want risk the chance of somebody stealing that data and abusing information in it. This is achieved via data anonymization process - removal of all personally identifiable information while preserving patterns essential for research. Anonymized data can then be shared with the people outside of organization without endangering confidentiality or leaking business secrets. {{% img src=\"data-anonymization.jpeg\" %}} For example, consider a business domain that captures all relevant events. These events, after anonymization process, might look like the one below: { date: \"2014-09-04T23:02:00\", group: [ { ref: 1, delta: 1 }, { ref: 2, delta: 11 }] } As you can see, there isn't much personal information there. This event tells only that on September 9th something happened, involving two groups of records, with some deltas. It is just a data point now. Can you guess, which event or type of business we are talking about? This is how good anonymized data might look like. It contains enough information to run the necessary behavioral analytics , however all personally identifiable information was erased from it. This prevents cross-referencing of that information and tracing it back to the origins. This anonymized event was produced from the original event by: erasing all human-readable information; reducing precision of times or numeric values; replacing identifiers and tags with sequential numbers (1,2,3); altering field names to decouple event from the domain; distorting some values by introducing random noise (perturbation). For the reference, original event could have looked like this before destructive data anonymization: { tenant : \"contsco-ebay\", id : \"contsco-2024-09-10-0334\", time : \"2014-09-04T22:57:55\", status : \"pending\", shipping : { country : \"Russia\", zip : \"450075\", line1 : \"Ufa ul.Zorge 66-61\", to : \"Abdullin Rinat\", carrier : \"usps\" }, products: [ { id : \"ksm6573er\", name : \"Kitchen Aid 6573 Empire Red\", quantity : 1, price : \"200 EUR\" }, { id : \"ka1234\", name : \"Plastic shield guard\", quantity : 10, price : \"7.99 EUR\" } ] } In some cases, original domain could capture events on a really fine-grained level, allowing to apply data generalization - merging multiple events together into a more generic one. For example, we could sum all product sales in a day, producing a table of daily sales. This significantly reduces information quality but still keeps statistical analysis possible. Leaking sensitive information Diligence is required in data anonymization process. It is possible to leave some personally identifiable information in the data , even though it might not look like that at first sight. For example, hospital records without patient names but with birthdays and cities can be cross-referenced with other data sources, potentially leaking identities. In this case, it is better to weigh the risk of disclosure and invest extra effort into data anonymization. For example, birthdays could be randomly shifted by 30 days or rounded down to years. City names could probably be discarded, unless geographical location is required by the research. Losing valuable information It is possible to apply too much anonymization and discard valuable research information in the process . For example, aggregating individual sales into daily sums leads to loss of information on the shopping habits of customers. Researchers will not be able to analyze them, coming up with models for promotions and personalized discounts. Instead of aggregation, it could be possible to hide individual customers behind artificial numeric identifiers (1,2,3...). If extra caution is needed, it would be possible to reduce the risk of cross-referencing by applying various data transformations. For example: replace each sale of \"Google Nexus 5\" with 1 \"Luxury phone\" and 3 \"post cards\"; whenever a person buys 1kg of candies, multiply the amount by 2. Applying these transformations allows to keep important patterns in the anonymized data without making it useless for research. These are oversimplified examples, of course, but they can serve as an example of the general approach. We remove some confidential data and 'encode' the rest in research-friendly way, while keeping the key secret . Summary Data anonymization allows organizations to share private information with external researchers. This process has to be done carefully, balance kept between the two extremes: leaking personally identifiable information and removing bits valuable for the research. There are no general rules of thumb here."}
{"url":"/post/ddd-evolving-business-processes-a-la-lokad/","title":"DDD - Evolving Business Processes a la Lokad","text":"DDD - Evolving Business Processes a la Lokad As you already know, there are multiple ways to express any given core business concept in the code via domain modeling (we discussed this topic in previous article ). These ways usually depend on the architecture style selected for the bounded context, in which we are currently working. For now, let's focus on one of such domain concepts: long-running business processes . In a cloud-based SaaS company, we could have following business processes (among many other): if invoice has a non-zero amount and was not paid within 15 days, then send customer a reminder. if customer balance stays below -5EUR for more than 30 days, then issue a lockdown. if distributed computing process has not finished processing all data batches within 1 hour, then restart it once (except cases, when it was already restarted - then issue a termination alert) As you probably already noticed, these examples share a few similarities : they are aware of the passing of time and deal with it; these processes express rather complex precondition that is based on current state of the system and leads to one or more then outcomes. Let's assume that we are dealing with a distributed system, where information about current state is shared with events. In such case, our business process might resemble a piece from complex event processing and would look this from the logical perspective: How can we implement this \"Business Process\" box? There are multiple alternatives, depending on the architecture style you have chosen. For example, you can use a state machine, where each instance of state machine would correspond to a specific process instance that you are tracking. Events would then be used to navigate an instance of the state machine across the nodes. It will also use external timer service to send messages \"to future\" (where message is put on hold till certain time comes). State machines are good for formalized domains. You can learn more about such approaches in the materials provided by Gregory Young and Udi Dahan. However, when we are dealing with business processes, that are rich with fuzzy logic, uncertainty and also happen to evolve rapidly, then a more simple solution might be needed. Especially, when you have almost no development time to spare. What is the most simple solution in case with locking customer balance for overdrafts? For instance, we can project all events to a view, which will track all active customers that used our services and went below the threshold at some point. Then our execution will be responsible for regularly checking this view and sending \"Lockdown\" to every customer that had his balance below the threshold for too long. This component would also need to keep in mind that certain customers require special handling and investigation before being locked out, while others can be locked right away. Naturally, these rules will be changing really often. What is the fastest and most flexible way to implement such component in a rapidly growing and changing environment? You simply wire view to the UI, attach a button to send \"lockdown command\" and ask a person from the business department to spend half an hour per week processing all late customers . This will save dev department hours on implementing these complex execution rules, testing them and then changing (as business discovers new corner cases). Essentially we let the rules evolve and change in the environment that shapes them: in the minds of business managers. In other words, at this point we avoid large development effort with a little bit of human time . Please, keep in mind, that once business processes are established and we have so many cases, that manually processing them takes too much time (that should be a profitable company by then), we can always rewrite these lockdown rules as a continuously running server-side task (rules would be mostly established by then). We could still keep the projection and a corresponding view. At this point we invest a fixed amount of development to automate a large portion of manual work . This gradual evolution of business processes is currently the recommended approach within Lokad.CQRS architecture style for delivery of non-formalized and rapidly changing business rules."}
{"url":"/post/ddd-from-reality-to-implementation/","title":"DDD From Reality to Implementation","text":"DDD From Reality to Implementation This is yet another post in a series that were triggered by fruitful discussions with Vaughn Vernon over content for his DDD book. I think, one of the sources of confusion in DDD/CQRS world is that we often mix terms and concepts that belong do absolutely different layers (and that we don't know how to go from one to the other). Let's start by introducing the following separation: {{% img src=\"ddd-from-reality-to-implementation/2012-04-17_DDD-reality.jpg\" %}} For the new readers, DDD stands for Domain Driven Design, which was introduced by Eric Evans in the book with the same name. CQRS stands for Command-Query Responsibility Principle, which is often associated with architecture styles for implementing systems with DDD and optional Event Sourcing. The term was coined and explored by Greg Young . Reality Reality is that thing around us, which we perceive through our senses and continuously try to understand. In the context of business and software, reality contains core business concepts which are important for the competitive advantage of our business. We want to capture them and then somehow express in code for automation purposes. For example, business concepts could involve things as: Customer Registration Process Customer Subscription Invoice Invoice Payment Cycle I'm taking examples from the environment of Software-as-a-Service (SaaS) company, since that's what I'm mostly familiar with. Domain Model As we learn more about reality and business concepts, we could distill our understanding into the domain model , which contains all things that are relevant and important in the current situation. For the sake of simplicity, we will break down the entire model into set of bounded contexts (BCs) which are separate by the natural boundaries we've discovered in the real world. In SaaS world we could highlight BCs like: Customer Subscriptions and Billing Client Portal Reporting Product 1 Product 2 Cloud integration etc. Each of the bounded contexts in this model stems directly from our understanding of the reality and the natural boundaries that we have identified ( read more ). If we dive inside one of these bounded contexts, we could discover more fine-grained concepts: Ubiquitous Language Aggregate and Aggregate Root Consistency Boundary Business Process Please keep in mind, that these are purely logical concepts, that have (yet) nothing to do with the implementation and all the less important details! The process of identifying BC boundaries can take into consideration things like: teams, skills, available resources and technologies. However, at this level we still don't care about technical details like : frameworks, databases, message middleware, service buses etc. We just create foundation for making conscious choice later down the road. Architecture Styles and Implementation Only after we have identified bounded contexts, we can focus on each BC and start thinking about implementation matters, while considering project specifics. Result of this exciting process would be a choice of key elements for the specific bounded context: development process - how do we organize and manage our development. architecture style - how do we structure and design software implementation. technology stack - what technologies do we use and how do we get them resource allocation - how do we get resources (budgets, people, knowledge) for the project delivery. For instance, if our teams are familiar with the SQL/RavenDB and NService Bus, we can pick architecture style described by Udi Dahan ( blog ), where: aggregates are persisted with SQL+NHibernate or RavenDb; command handlers and application services are hosted by NServiceBus; business processes are implemented with NServiceBus sagas; consistency boundaries happen within the transaction scope; views are created either by in-memory events or via projected audit logs; development process will be aligned towards Waterfall or Agile. If environment requires event sourcing or teams are hyped with AR+ES architectural style of Greg Young ( blog ), we can: persist aggregates with event sourcing in event streams; host command handlers in custom message dispatchers that use something like AMQP or direct socket communication; business processes are mainly implemented with state machines hosted within event handlers or via document-based state machines (where state is persisted in messages); views are disposable and are projected from event streams to whatever technology that is needed; consistency boundaries are within aggregate; use Agile development process. At Lokad we are mostly using Lokad.CQRS architecture style ( sample ) that is derived from Greg's but is fine-tuned to: fit cloud computing environments while supporting on-premises deployments; reduce development friction and development effort at the cost of higher requirements for team skills and discipline ; support rapid domain evolution in rapidly changing business environment. This architecture style involves following technical choices: aggregates are persisted with event sourcing in event stream; message handlers are hosted in custom message dispatchers provided by Lokad.CQRS sample project (with adapters for on-premises and cloud deployments); business process transitions are implemented as part of the aggregate behaviors (they could be triggered by user interactions, stateless event handlers sending commands in response to events or tasks that sending commands on a schedule); views are disposable and are projected from event streams, using dead-simple key-value persistence for the majority of cases (with adapters for on-premises and cloud); consistency boundaries are per entity (aggregate or view instance); rapid development process is used for multiple releases per week/day. Obviously, these are just a few options of implementing a given bounded context. There are more architecture styles available out there (and each architecture style can have multiple implementation options and variations). DDD Modeling Process Now, for the most fun part. This trajectory from reality to architectural style is just a happy path scenario that happens only in dreams. In reality you might need to iterate from reality to implementation multiple times : {{% img src=\"ddd-from-reality-to-implementation/2012-04-17_DDD-iterate.jpg\" %}} These iterations are the foundation of the approach to explore and capture core business concepts in domain model (or, one of the approaches). Approach is totally attributed to Greg Young. We start by sketching out domain model without even trying hard to be precise from the start. Then we try to implement it in the code using the most hacky approach possible . Lokad.CQRS style with file-based persistence and messaging works for us, because it is designed for rapid iterations and has the least friction (I'll need to do a quick video on that process). One of the goals here is to build a set of unit tests that use specifications to verify behaviors of aggregate roots with event sourcing (AR+ES). These specifications can be printed out as use cases in human-readable way. More often than not, a lot of problems and questions show up during this implementation phase. At this point we usually forget about the implementation (while keeping our use cases) and go back to the domain experts with these questions (more often than not, for me this boils to going and talking to the mirror). Then we adjust the domain model according to the lessons learned. Depending on the nature of the adjustment, implementation is either discarded completely or refactored to fit (AR+ES specifications are usually kept between these iterations, to make sure that we don't loose any captured use case). The process is repeated till we distill our domain model to the point that it captures all required business concepts in a way that can be easily implemented in the code. At this point we can print out specifications, reconfigure implementation to use adapters for production environment (e.g. Azure), push it to production and try to call it a day (only to have next challenge handed to us). PS: If you are interested in this topic, next article in the series might interest you: DDD: Evolving Business Processes a la Lokad ."}
{"url":"/post/ddd-summit-2012-summary-dddesign/","title":"DDD Summit 2012 Summary -","text":"DDD Summit 2012 Summary - Some time ago I was honored (probably because of some sheer mistake) to be invited to DDD Summit 2012 organized by Eric Evans. The event took place in Portland over the course of last week and it was an absolute blast for me. From left to right: Cameron Purdy - VP of development at Oracle [ Blog , twitter ] Daniel Gackle - CoFounder at SkySheet Randy Stafford - A-Team at Oracle [ About ] Jimmy Nilsson - Factor10, Author of Applying DDD Design and Patterns [ Blog , Twitter ] Rebecca Wirfs-Brock - Wirfs-Brock and Responsibility-Driven Design [ Wiki ] Vaughn Vernon - Consultant at ShiftMethod, Author of the next DDD Book [ Blog , Twitter ] Andreas Brink - Software developer & coach at Factor10 [ Blog ] Alberto Brandolini - Avanscoperta , DDD Instructor and master of tomato [ Blog , Twitter ] Eric Evans - Father of DDD and founder of Domain Language [ Twitter ] Dan Bergh Johnsson - consultant and partner at OmegaPoint [ Blog , Twitter ] Paul Rayner - DDD Instructor at Domain Language [ Twitter , Blog ] Martin Fowler - needs no introductions [ Wiki , Twitter ] Patrik Fredriksson - Consultant at Citerus AB, DDD Instructor [ Blog , Twitter ] Rinat Abdullin - accidental visitor, Tech Leader of Lokad In essence, we were discussing various aspects of DDD in small focus groups and all together, going out on hikes and dinners, talking, talking and talking. There was a wide variety of topics covered during these three days. For me, this was an amazing opportunity to dive deeper into DDD nuances, practical experience and ideas shared by the leading thought leaders. Outcomes and Commitments Long story made short, there are some outcomes of this event that should benefit of the community (including some of my personal commitments). There is going to be more activity on the DDDCommunity.org site , that might have felt a little neglected over the course of last year (with a lot of good materials being hidden by accident). Things will start improving within the next few weeks (more regular updates and more structured community). I personally hope to contribute some small practical experience bits of community building from projects and groups on DDD/CQRS (i.e. CQRSGuide and CQRS Beers with awesome communities around EU) , Lokad projects and Distributed Podcast , maybe even bringing all these closer together. Practical samples are a big thing for demonstrating DDD. We plan to bring them together on a github and refresh a little bit (may be even adding Scala/Akka implementation!). My own Lokad.CQRS Sample Project is essentially a sample of DDD (with all the latest tech stack, cloud portability, practical things and even some DDD modeling toolset) that currently lacks the most important thing - proper domain model. I plan to fix this last problem by the mid of June, in addition to committing to things I've been delaying far too long (i.e.: articles and videos about the sample project). More articles and materials about the Domain-Driven Design are expected be published; some of the topics were already outlined by the summit participants. There are plans to establish better environment for reviewing and shepherding such materials, which should definitely help. I personally plan to continue writing about various aspects of DDD/CQRS+ES. Hopefully these new articles will start making slightly more sense due to focus not only on disposable technical details (which CQRS+ES is, despite all its coolness and cloud-portability), but also on strategic and design decisions governed by DDD. Besides, some more peer pressure is expected on them :) There are even more cool long-term plans about pushing DDD and state of the art in the community, however they are long-term. Meanwhile, you can stay tuned for the updates via #DDDesign (I'll make sure that any official DDDCommunity news will be auto-published on twitter via this tag as well). How does this look?"}
{"url":"/post/dddcqrs-challenge-integrating-distributed-systems/","title":"DDD/CQRS Challenge - Integrating Distributed Systems","text":"DDD/CQRS Challenge - Integrating Distributed Systems Let's have a look at the relatively simple DDD/CQRS challenge in integrating elements of a distributed system composed of a different bounded contexts and deployed across different hosting environments . Let's imagine a small Software-as-a-Service company which provides some subscription-based service while charging customers per consumption on pay-as-you-go basis. Software infrastructure of such company could consist of only 3 bounded contexts (a major oversimplification on my part, bigger view might be more complicated ): Subscriptions - subscription management system, that keeps track of all customers, their active plans, billing information, invoices, monthly service consumption and available login keys. This system is architected as NoSQL solution with event sourcing and is deployed on a dedicated server (with plans to redeploy it to Azure some time later). Cloud Services Integration - massively scalable set of services deployed in Windows Azure (e.g. using some big data processing design ). Among the other things, these services expose API to 3rd party companies and even products of the same company. This API is secured by user tokens, which are replicated from the subscriptions BC. This project is stable and does not change frequently. Product 1 - a new product being delivered by the company. It is developed as a standalone set of systems that enhance user experience, using Cloud API. This product leverages authentication and user management capabilities from \"Subscriptions\" and interoperates with API. Here are some examples of the interactions between these system: If new user is added to the subscriptions, it's auth credentials should be immediately (within 1-2 seconds) replicated to Cloud Services, to enable access via API. If customer's account is locked out due to balance overdraft, then all related users should be locked out of the API. When services consumption is detected in the API, it should be within 5 minutes reported to subscriptions portal. Naturally all these systems have to work independently in such way, that if one of these is down, the rest will continue doing their part (at the very least by providing read-only UI, at best - doing everything that is not dependent on the other systems). For example, if subscriptions are down for maintenance or Cloud Services and Product 1 should continue working as they were (all pending changes should be replicated after system comes back online). Additional constraints: Resulting design (with inherent implications) should be relatively easy to explain to a Junior dev. It should also be relatively straightforward to deploy and run systems both locally (xcopy deployment of .NET code) and in the cloud. systems should be able to change independently and rapidly as they follow their individual DDD evolution paths (for example, weekly releases with new business processes but without breaking any relations). no more than 3 people per project to develop and maintain it. Note, that we are focusing here only on the integration between the systems. Internal design of each system might affect such integration, but is less relevant in this case. Still it would be nice, if integration patterns shared natural affinity with internal design of each bounded context (this tends to create systems that are more robust and practical). How would you approach this problem?"}
{"url":"/post/dddd-cqrs-and-other-enterprise-development-buzz-words/","title":"DDDD, CQRS and Other Enterprise Development Buzz-words","text":"DDDD, CQRS and Other Enterprise Development Buzz-words Update: there is a CQRS Roadmap , that was written at a later moment of time and does better job in bringing these buzz-words and terms together, while explaining the benefits and relations. Check it out ! Let's do a brief overview of DDDD, CQRS and other related buzz-words in the domain of enterprise development. I'll try to give a simplified explanation, highlight some logical relations and provide links for further reading . By the way, if you are interested in practical side of applying CQRS to the .NET platform and Windows Azure - check out Lokad CQRS project, which has a lot of samples and tutorials. We'll walk over such terms as: Domain Driven Design (DDD and DDDD) Command Query Responsibility Separation (CQRS) Event Sourcing and Audit Logs Messaging and integration Basically CQRS + DDDD are just a group of patterns, design principles and approaches that happen to work quite well together, especially in complex large-scale enterprise solutions. Although they happen to help in simpler scenarios as well. Since CQRS and DDDD are often seen together with the other specific development patterns, this sometimes creates an additional confusion, making it hard to distinguish and understand different logical concepts. In the DevExpress thread the discussion (which inspired this article) started with the event sourcing. Let’s do the same here. Event Sourcing describes a concept of persisting application entities (i.e.: aggregate roots) as sequence of events that create and alter them: AccountCreated BillingAddressChanged CustomPropertyAdded AccountCharged AccountSuspended etc By its nature, the simplest implementation of event sourcing only needs two tables (table with primary keys and another one with serialized events) and represents full audit log of all the changes . When we load the entity, we simply replay all events since the beginning. Obviously, such storage in the classical SQL world would be less than performant, if we regenerate large entity for every operation. However we can simply keep the entity in memory (they are not that large anyway), use snapshots and distribute aggregate roots between the machines. Martin Fowler has written extensively about event sourcing and focusing on events in general. One can create audit logs without the event sourcing . For example in the billing subsystem, where accounts are allowed to have balances, balance could be defined as an append-only table, just like in the real-world accounting. In order to modify the balance, append a row with the change amount and new value. Optionally accountant information and operation name could be included. In order to get the actual balance – simply: SELECT TOP 1 * FROM [Balance] ORDER BY [Id] DESC WHERE [AccountId] = ? Although for the majority of the display purposes (which happen to be more than 80% of the cases) accessing denormalized query table would work perfectly without hitting the DB hard even in \"Show total balance by account\" types of the reports. CQRS uses this approach heavily and will be discussed later. Software will always have bugs and problems. Keeping extra information around (audit logs is one form of that) helps to make life less painful and issues - less expensive. Release It!: Design and Deploy Production-Ready Software is an incredible read on the subject of delivering and handling real-world solutions. Aggregate Root concept (along with terms like Bounded Context , Ubiquitous Language or Value Object ), comes from the world of Domain Driven Design (DDD), which defines a way of modeling business entities in the process of software design, development and evolution. Basically, it links together business concepts and the evolving software model. Principles start from the principles of thinking and communicating and go up to the allowed logical relations between the entities. Jonathan Oliver has gathered a nice overview of materials introducing developer in the world of DDDD and CQRS. Fourth D in the DDDD comes from the Distributed . It, as the name implies, brings us closer to everything that is distributed, large-scale and cloudy. By a coincidence, principles of Command-query responsibility separation (CQRS) happen to address some problems that the DDDD faces. The very principle of representing business changes and processes in form of events, commands and messages, is close to the concepts of the messaging systems , middleware and, as far as message processing is concerned, service buses . Concepts of message-driven architecture help to decouple complex systems and processes (this usually happens within the boundaries of a service layer), while making them more reliable and easier to comprehend . Enterprise integration (especially in the world of unreliable systems) also depends on messaging heavily. Enterprise Integration Patterns book is a must-read for everybody interested in the subject. Advanced Message Queuing Protocol Specification is also an eye-opening material (it's not boring). CQRS as a concept is a way of architecturing systems that attempts to deal with some of the frustrating problems of delivering enterprise software : Performance bottlenecks and scalability Concurrency conflicts, their resolution and prevention Data staleness Complexity of the design, development and maintenance CQRS attempts to deal with these problems by reevaluating constraints and assumptions that we’ve been considering to be true and valid for the last X0 years. This allows to rethink core principles and the architecture. In an oversimplified manner, CQRS separates commands (that change the data) from the queries (that read the data). This simple decision brings along a few changes to the classical architecture with service layers along with some positive side effects and opportunities. At the micro-development level Command-query separation says that method can either be a query (returning data to the caller) or a command (changing the state), but not both. In other words, asking the question should not change the answer. So if we go deeper, Command-query Responsibility Separation is about development principles, patterns and the guidance to build solution architecture on top of them. Udi Dahan has a brilliant paper on the Clarified CQRS . He also blogs frequently on the subject along with Greg Young (who is supposed to be writing a book on the DDDD and CQRS) and Jonathan Oliver . Here's one of the simple architectural overviews with some details. All data presented to the user is stale anyway, since we don’t keep UI in constant sync with the database, refreshing every millisecond. And since it is ok, let’s use this. Querying data for presentation and accessing it for the changes do not have to use the same resources. In fact, since querying happens in 80%-90% of the cases (people read often, modify less), let’s take some time (a few seconds, for example) and denormalize the data heavily and publish it to some easily accessible locations. Presentation UI will just have to do SELECT * FROM [MyCustomersView] WHERE… Instead of updating the entire entities (i.e. reading and writing the entire BillingAccount when just the email changes) let’s send a ChangeBillingEmailCommand to the server. Before sending this command let’s use our query tables to verify that the email address is unique and correct. This should give us 99.99% probability that it will be accepted. It’s OK. So send the command and move on. When the command lands into the server – place it into the queue. This way it’ll be processed even if the server is off-line, too loaded or encounters an eventual deadlock. We could additionally distribute work between multiple servers, which is becoming embarrassingly easy in the era of cloud computing. When command is being processed, double check the validation and business rules. In the unlikely chance that something is wrong with the command data – send user a message (i.e.: email address has changed by somebody else). Flow of commands could be saved somewhere, forming a sort of auxiliary audit log, JIC. After the command is processed – take a little bit of CPU cycles to update our query tables to make them easier to be consumed by presentation and validation logic. Here we just make sure that all these complex joins have to happen only once, keeping reads (which happen more often than writes) extremely fast. In fact, since storage is cheap, we can aggressively use query tables (i.e.: query per view per role) and also distribute them. Domain commands and events could be used to split complex systems in order to reduce complexity or distribute the load as well. Middleware messaging systems or even rich service buses could be used here. Cloud bursting scenarios are applied here easily as well. Commands, events and other types of messages happen to correlate to the terms from Ubiquitous Language of the domain (they actually form it), so the DDD could be used in the process of communicating over, architecting and evolving the enterprise system. Persistence does not really matter here , so we can go ignore it, while using rather rare patterns like event sourcing (giving us full audits and simplifying the replication) or document databases. Actually, query data does not need to be in the relatively expensive RDB at all. CQRS and Event Sourcing also simplify implementation of the flexible entity models with various custom fields and properties that are often defined at the run-time and used in layout and drag-n-drop designers by the end-users. CQRS significantly simplifies introduction of business intelligence into the enterprise solutions . BI helps to make better decisions that make business - better (reducing expenses and increasing profits). As you can see from this brief overview, there are quite a lot of different patterns and ideas composing the domain around CQRS and DDDD . Common trend is that they generally have synergy effect, where using a few of these patterns might create additional benefits or simplify the design. Yet, as in any real-world project, it is strongly advised against taking all these principles and dumping into the architecture blindly . It’ll probably hurt. It’s more beneficial to learn, understand and practice all of these, but use only the ones that fit the project (feeling free to adapt them as needed). Please keep in mind that this overview is brief and quite a bit of important problems were left out along with their solutions. Some of these problems are listed below (to give you the idea of the scope) while answers could be found in the materials referenced in the article. How do we handle failing commands from the user's perspective? How can we handle delays in the updates of the query data from the UI perspective? Data validation vs. business context rules in the CQRS Resolving database deadlocks and the command-handling level. Elastically scaling processing capacities. How do we apply CQRS principles in the world of AJAX and web applications; Smart Clients? How and where do we host command handling services? etc If you are interested in more articles on this topic, you can stay tuned by subscribing to this blog via RSS . Note, that this document starts outlining the scope of the xLim 4 body of knowledge on efficiently building flexible and light distributed information management systems. There also is a separate page aggregating information on the CQRS . All comments, thoughts, questions and any other feedback are welcome and appreciated. So, what do you think? Related Posts: Command-Query Responsibility Segregation CQRS - Validation and Business Rules CQRS - Automatically Visualize and Document Your Solution"}
{"url":"/post/dddes-missing-guidance/","title":"DDD/ES - Missing Guidance","text":"DDD/ES - Missing Guidance Over the course of last few days I received a lot of feedback on the IDDD sample for Aggregates with Event Sourcing (ES) published. I tried to address some questions either by replying or by adding more documentation to the actual project (which resulted in commits like this one ). However, there still is a set of questions that can't be addressed within that specific sample - this will overload sample and hijack the original intent of showing how a single aggregate works with event sourcing . As it seems, the missing guidance currently includes following topics : Added : High-level overview of DDD/ES architecture (a la Lokad) and building blocks in various deployment configurations. Overview of available case studies; influence by DDD and Event Sourcing. Added : Reality vs Domain Model vs Implementation; capturing domain model via coding exercise and evolving it towards deeper insight; features of good domain models. Added : Detailed overview of building blocks that can be used for DDD/ES architecture; how to model, design, test and maintain each block individually and as a part of bounded context: Application Services Aggregates with Event Sourcing View Projections Tasks Event Ports Dealing with remote services (or integration points) that can fail, timeout or require case-by-case automated performance adjustment (e.g. throttling); Inter-aggregate relations, sources of truth, eventual consistency and compensating actions; Capturing long-running business processes in a way that both matches Domain-driven Design (DDD) and is generally simpler to deal with in practice than sagas; Integrating multiple bounded contexts together (e.g. when you need to share some information between); Added : Developing Clients UIs: composition of UI and data, eventual consistency and task-based UIs, specifics of targeting multiple platforms (Web, Desktop clients, mobile apps), authorization; Maintenance, debugging, release and failure management of systems implemented with DDD+ES approach; Addressing specific scalability and performance requirements; cloud deployments. Added : Multi-tenancy, tenant-specific customizations and processing. Deployments and maintenance of multi-tenant systems. Added : Building occasionally-connected systems. Concurrent editing, conflict resolution strategies and merging. Feedback to user. Added : Migrating legacy systems to DDD/ES architecture; PROs and CONs; justifying risks and costs. Added : Appendix for .NET developers: Structuring Visual Studio Solutions Developing ASP.NET MVC Web applications Windows Azure practices for DDD/ES systems Helpful tools and frameworks Does this list of topics cover all black spots, or are there any other areas that need some coverage ?"}
{"url":"/post/decomposing-happypancake-into-components/","title":"Decomposing HappyPancake into components","text":"Decomposing HappyPancake into components Last night we had a really productive pairing session with Pieter , discussing the design of HPC2 and trying to decompose it into a bunch of components aligned with the use cases. Components are represented by golang packages composed of a bunch of small files each. Examples of such packages are: auth register profile notify poke etc This maps to C# projects in .NET world but with a lot less ceremony, which feels really good. Terms \"domain\" and \"bounded context\" came up in the discussion along with the question to how they relate to these components. I think, these terms represent much higher level of abstraction which is simply absent from our case. Domain of HappyPancake is quite simple for now. It is just a dating social network with a bunch of small features. Massive kudos go here to Tomas who is continuously pruning features and keeping core product lean and focused. He pushes this to the level I've never experienced before in projects. We can say that HappyPancake, at this point of its life-cycle, has only one domain and a single bounded context. Later on we will probably see things like \"ad integration domain\" and \"mobile notifications\" showing up. Domain of HPC2 is initially captured via a set of use cases. We have to complete these in order to deliver a first beta. While working with Pieter yesterday, we tried to map these use cases to golang packages. As he said: Packages should reflect functional model, not technical. With this approach we apply Domain-Driven Design principles at the strategic level (identifying domains, establishing vocabulary, dealing with domain boundaries and interactions), while design of the domain itself is refined with tactical approaches (related to \"micro-services\" movement these days): decompose domain into components; components work together in order to implement use cases; components expose a public contract (e.g.: events, http handlers, service contracts) and have a private implementation; it is impossible to get component map right from the first attempt; deeper insight into the domain will emerge as we evolve components and their boundaries. Benefits of domain decomposition into small components (represented by golang packages in the code) were summarized nicely by Pieter last night: packages can be built and tested in isolation; less merge conflicts (since developers usually work on use cases and these are aligned with our packages); packages map to the use cases (functional model), which makes it easier to reason about things; names become more verbose or more refined (e.g. compare hpc.handlers.ProfileHandler vs hpc.profile.handler ); components can grow independently and be developed in collaboration; there is much less context switching while working on a use cases (no need to jump across the solution); golang tooling is focused around packages and supports this kind of development really well ( godoc , looper ). In words of Tomas : Opinions opinions opinions, none should be considered authoritative for your domain and your context... Such an approach has an additional benefit (as apposed to layered implementation of the domain) - each component exists in its own context, enforced by the packaging design of golang. This helps to be more opinionated in the implementation of each component, without suffering from syndrome of \"pattern over-application\"."}
{"url":"/post/design-observations-on-big-data-for-retail/","title":"Design Observations on Big Data for Retail","text":"Design Observations on Big Data for Retail Change of technologies and approaches tends to bring a lot of challenges and problems (which eventually turn into \"lessons learned\"). This is especially true, when you probe paths that are not common. Curiously enough, as Charles de Gaulle once noted, such less common paths are also the ones where you are likely to encounter much less competition. At the moment of writing, one of current projects at Lokad is about rewrite of our Salescast product, which is a cloud-based business intelligence platform for retail (see tech case study ). This rewrite features better design which captures core business concepts at a deeper level . This allows to achieve simpler implementation, better cloud affinity and scalability, while discarding such technologies like IoC Container, SQL and NHibernate ORM. If you are interested in reasons for discarding these technologies: SQL - too expensive and complex for dealing with bit data in cloud; ORM - complex and unneeded; IoC Container - I prefer simple designs that don't need it. Obviously such mess as WCF, WWF, Dynamic Proxies, AOP, MSMQ etc - are also something I try to avoid at all costs. One of the side effects is that this system no longer needs complex setups for local development : message queues, event stores, documents, BLOBs and persistent read models are stored in files. We are using event sourcing for the behavioral elements of the system, while \"big data\" number crunching is based on a different approach . This approach has an interesting side effect that I didn't expect. If anybody in the team discovers a problem in some complex data processing pipe (or any other logic, including business rules, map-reduce step, report generation etc), with exception bubbling up, then in order to reproduce the exact state of the system on a different machine : Stop the solution. Archive data folder and send it to responsible person faulty for the problem (usually me). Responsible person unarchives data folder and starts the solution. Exception will bubble up. You see, when exception bubbles up in the development environment, the message still remains in the message queue (as a file in a folder). So when we transfer all data to another machine and start the solution - system will try to pick that same message up and reprocess it. Since all data dependencies are included in the data folder, this will lead to the same exception showing up. Obviously, production deployment of such system is quite different (using cloud-specific implementations for data storage, messaging and event streams), yet principles would still work. This happens because I mostly store data either in append-only structures (BLOBs for large data and event streams for behavioral domain models) or this data is irrelevant (persistent read models that are automatically rebuilt from event streams). I'm using Lokad.CQRS Sample Project as a baseline for developing this and similar systems. Here are a few more technology-specific observations: TSV + GZIP is quite good for storing large non-structured streams of data in table form and with little effort (plus, you don't need any tools to view and check such data); When you need decent performance while storing sequences of complex structures with little effort (e.g. sequence of object graphs), then Google Protocol Buffers (prefix-based serialization) offer a fast approach (wrap it with GZIP and SHA1, if there are repetitive strings); when it is worth a few days to optimize storage and processing of big data to insane levels (e.g.: for permanent storage), then some custom case-specific serialization and compression algorithm can do magic (rule of thumb: this might be needed only in 1 or 2 places); do not optimize till it is really necessary ; quite often you can save massive amount of time by avoiding optimization and simply using a bigger virtual machine on the cloud (which is cheaper); whenever possible stream big data through memory , as opposed to loading huge datasets entirely. You'll be surprised how much data your small machines will be able to process; You don't need expensive licenses and hardware (e.g. Oracle, IBM, Microsoft setups usually offered by consultants) to store and process thousands of stores with years of sales history. Likewise, you don't need large teams or big budgets to get the thing ready and delivered. A lot of that can be avoided with the appropriate design . Especially, if that design factors in not only technological and organizational factors, but also shares affinity with business model of a company."}
{"url":"/post/design-obsessions/","title":"Design Obsessions","text":"Design Obsessions I had a lot of failures in my past development experience. Most of them were caused by being completely obsessed by some cool technology or a trick. These things were so appealing that desire to use them became the central idea of an application. Among the failures I had in my development, these ones were caused by design obsessed with some technology: Design driven by principles of UI composition and Flexibility , where you build ultimately flexible CRM system with any number of fields, queries and forms. Inversion-of-control-container driven design, where you design a system by dropping a large pile of services, controllers and managers into the IoC container, and then letting it resolve a complex graph. ORM-driven design , where you design your \"business objects\" and the rest of the system is wired almost automatically. CQRS-driven design , where you take this principle as architectural guideline and end up with a complete mess of messages and views. Lesson learned - if central idea of your design is about technology, then such system will become a slave of this technology . All advantages and limitations of such technology will eventually become forth and strike you really hard. If you start your system design by assumption of using a certain framework, database or tool - you are already paying a tribute to this obsession. It is unavoidable to some extent, since we are limited by knowledge and capabilities of our development teams. However, we can reduce bad side-effects by trying hard to focus on the idea that is worth becoming the center of your application . As you probably have guessed, this idea is about solving the real-world business problem you have at hand (granted that this problem is worth solving). Examples of such problems are: helping business to optimize it's pricing strategies across hundreds of thousands of products to increase turn-over and reduce amount of inventory that is thrown away; enabling a company to serve millions of its customers better by allowing behavioral analysis of each individual and suggesting healthier and cheaper products; helping a hospital to serve it's patients better by providing more efficient ways to diagnose patients, schedule available resources or collaborate on information about treatments and medications. Technologies, stacks and approaches are merely replaceable tools that help to support such solution (even if tech is as cool as cloud computing, event sourcing or $YourCurrentlyFavoriteTechnologyHere$). Pick them consciously and don't let them become the core idea behind design of your solution. Such obsessions are among the most expensive ones . If you have too many - you can even end up with a severe case of analysis paralysis . While designing systems we try to use all cool tech we love. Design obsession with solving business problems is better. Update: if you want to hear a bit more on the subject and my mistakes - check out Episode 8 of Being The Worst Podcast"}
{"url":"/post/distributed-decide-act-report-model/","title":"Distributed Decide-Act-Report Model","text":"Distributed Decide-Act-Report Model Yesterday I gave a talk on CQRS with Windows Azure at Copenhagen GOTO Conference ( slides, PDF ). Mark Seemann (who was the very reason of me coming to Copenhagen) provided invaluable feedback on the talk. He really liked the analogy for explaining relations between commands, events and views in the cloud architectures. So let's go over it in this blog post. Let's think, how some real-world organizations might function like. With some imagination you can identify 3 roles: Managers, that run organization; they read paper reports or call assistants, decide and issue orders for the workers below them to execute. Workers, that receive orders, act upon them (where they can and have the resources) and notify various departments about the job done. Assistants, that gather together all these notifications, mails and memos into various reports , making them available to anybody, who has to make the decision. Obviously, the entire iterative process of decide-act-report takes some time. It is not instantaneous, because humans are slow. However, this somehow seems to works in the real world. Companies seem to make right decisions that guide them through the ever-changing business world. They even manage to grow into large organizations (with more complex structures). In short, this structure - works. Now, take a look at the image below. Does this \"Decide-Act-Report\" resemble a bit the classical distributed architecture implemented with CQRS in mind? In this world, users are the managers, who decide , what to do in the UI. They use the latest reports available to them in form of Views (aka Read Models or Projections ) in a way, that makes it simple to make a decision. User interface captures their intent in the form of command messages , that order server to do some job. Servers, then, work hard to do the job, acting upon the commands messages sent to them. Upon the completion (or any other outcome), notifications are sent to all interested parties in form of events messages published via pub/sub. View event handlers ( Projection Hosts ) receive these notifications, building Views to report their data to the user. They work even harder to keep these up-to-date, updating them upon every event message. Since these reports are kept up-to-date, any interested party can query and an immediate result, without the need to wait for the report to be computed. Everything is rather straightforward, as you can see. At the same time, some of the analogies from the real world can still apply. For example: There could be multiple managers, operating the same organization at the same time == multiple users can work concurrently with an application. If there is too much work, you can hire some more workers == if there are too many commands, you can add more servers. Actual reports can be copied and distributed around the organization, JIC if manager needs them right now == you can spread read models around the globe to keep them close to the client (or even keep them at the client). Manager, workers and reporting assistants could be in the same building or they could be spread across the continents, while exchanging mail between each other == distributed application with messaging can have all components as in a single process or it can spread them across the data centers. So, again: User - looks at views, decides and issues commands Command Handlers - receive commands, act upon them and publish notifications View Handlers - receive interesting notifications and update views, immediately reporting them to the interested parties per request. Does this analogy make things a little bit simpler in the distributed CQRS world with users, command handlers and event handlers?"}
{"url":"/post/does-your-event-store-only-serve-one-node/","title":"Does your event store only serve one node?","text":"Does your event store only serve one node? In reply to the question from Stacy: Does your event store only serve one node type? Or does your event store serve any and all nodes? I did it in various ways, since different scenarios might require different deployment strategies. Here are some cases that worked out. 1 event store for multiple subdomains , hosted within the same worker process. Event Store is hosted in the same worker process as well (with in-memory cache) and accessible via direct calls to application services from these subdomains. Worked nicely. 1 event store per node for our scalable FTP proxy for Azure project. Each node writes to it's own event store (hosting the engine). Event consumers (if there are any) join events together. Technically, in this case one could have one central event store, accessible by multiple nodes, but I didn't want to bother with complex deployment at this phase of the project. Plus, more dependent servers you have - higher risk of failure and problems. Multiple distinct applications, hosted in different nodes, with each of their own event stores . They can exchange information by sending commands or by pushing views with well-known schema to a well-known location. Event stores are kept private. So generally I treat event store as something private to the application. In single-node projects - one event store per application, in multi-node environments - one event store per node (or have multiple nodes share the same event store in case of some partitioning and load balancing). Obviously, this ratio does not involve various replication scenarios (e.g. master-slave), where replicas don't count towards the event store score (they still contain the same information)."}
{"url":"/post/domain-driven-design-and-organizational-politics/","title":"Domain-Driven Design and Organizational Politics","text":"Domain-Driven Design and Organizational Politics Domain-Driven Design can often lend a helping hand in diffusing a tight political situation in a conservative organisation (to our benefit). Consider case, when business processes of company are backed up by a bunch of IT teams responsible for different software systems. More often than not, these IT teams would be not so friendly to each other, trying to protect their data and software from any unfreindly influence, changes or even access to the data. This creates noticeable friction for any new initiatives, which need access to this data or simply integration with the software. At Lokad we've seen this pattern on more than one occasion. One way of solving this situation is to leverage Domain-Driven Design methodology to identify most important factors, risks and stakeholders at play (see context mapping). Then, once we identified and prioritized separate areas, specific development methodologies can be applied to shift the odds each battle to our favor. For example, teams fighting for their projects, can reduce cost of change and focus on real problems by evolving their domain models in collaboration with domain experts. Scalability issues (quite common in legacy CRUD domains) can be worked around by applying patterns like Domain Events and Command-Query Responsibility Segregation . This would put such teams in more favorable position, compared to other teams. Business owners like those who deliver fast and build trust. More favorable position can be leveraged to gain more influence, personal freedom or other bonuses. Another example would be about fighting off teams of SQL Database Administrators who resist any change in database schema. Let them have their database for reporting purposes, while persisting everything internally with event sourcing. We'll simply propagate our own changes to their databases via projection of events to SQL tables. It will not be our fault if SQL database can't keep up with the performance of event-sourced backend. Then, at some point, we could simply offer to replace entire pretty expensive Oracle cluster with a bunch of Redis servers running on some commodity hardware without the need of expensive SQL tuning. Such approach can massively reduce costs in a company, which is a strong polical leverage for further improvements. In other words, DDD helps to come up with consistent strategy for dealing with complexity, friction and inefficiency in organisations. Various tactical patterns and architectural styles could be applied locally to support this strategy in different specific situations."}
{"url":"/post/dont-be-depressed-be-the-worst/","title":"Don't Be Depressed, Be The Worst","text":"Don't Be Depressed, Be The Worst I just got this comment to my old post on 10 Steps to Become Better .NET Developer : I just graduated, and i am willing to become a VB. Net developer, but after saw your posting, i found myself is nothing, depressed. Ok, first of all: I'm sorry that my post made you feel this way. I'm just a guy that still learns software development (and feels really stupid for a number of reasons on a regular basis) and so please don't take my words for granted. Second, technologies are pretty much irrelevant and replaceable (they never teach relevant technologies at universities, anyway). Third, at the university you have gained the most important skill for you future career - ability to learn, reason and move forward . This will allow you to become whomever you want to, even when you find your life goals expanding in the upcoming years. Fourth, please read this article Be the Worst , which was shared with me by Kerry Street last night. Feeling like you are the worst is a good position to move forward on your learning path (and definitely much better than feeling like you are the best). Fifth, good luck. If you have any questions, please, don't hesitate to ask and reach out . I'll try to help. Besides, .NET has a marvellous development community that is always willing to help, if you are willing to move forward with this trade. PS: BTW, I graduated from an economic department of a university in the middle of Russia. They taught us how to plan a cattle herd, theoretically command a military unit and manage a Soviet Republic. My first .NET classes were self-taught by reading a text-book while spending summer working on constructions sites of Baltimore in USA. I think, that was pretty low for a start of .NET career (not that I got that far from there since then). You can do much better."}
{"url":"/post/efficient-development-deployment-and-customer-feedback-at-lo/","title":"Efficient Development, Deployment and Customer Feedback at Lokad","text":"Efficient Development, Deployment and Customer Feedback at Lokad A while ago I wrote about deployment and updates of the desktop applications and outlined the technology we use at Lokad for rapid and efficient delivery of various updates to the customers. Let's expand on the topic a little bit more, showing how a small company could create quite a bit of business value via some efficient development practices. There will be a small questionnaire at the end of this article as well. Development and Deployment Experience Current application deployment and publication loop for Lokad Client Applications works like this: when developers commit new features or fixes to version control repository and hit a special build button on integration server , following happens automatically: All unit tests , integration and code quality tests are executed against the latest codebase (build breaks if these fail). New version of install package (and optional downloads) is deployed to publicly available location. Download page is updated with the links to the new version. Existing users of an application get a nice \"Update Available!\" notification in the corner of their toolbar. Clicking on it launches semi-automatic download and upgrade process. Note, that we are not using ClickOnce technology here, since it has important limitations we couldn't live with. Everything took a bit of time to setup once and then it is just started flowing. Customer Feedback Experience Recently, in order to enhance customer experience, yet another simple feature had been added to the client applications: integrated feedback reporting. Simply put, customer could click a button and send a message to the company. It will immediately show up in the customer support system. Additionally, all unhandled exceptions (should they happen) could be reported, too. Feedback reports (if customers decide to) could include following anonymous information: system descriptor of the software (version, OS version); versions of the assemblies loaded into the AppDomain; latest portion of application log; statistics of some performance counters showing client-side information about interactions with our servers (number of communication failures, retries, amount of data transferred); exception counter statistics (information about unique exceptions handled by the application along with their counts). Note, that sources for the SystemDescriptor , ExecutionCounter and ExceptionCounter are available in Lokad Shared Libraries . Actual feedback reporting shard is shared with the community in the Lokad SDK Feedback system had been implemented as an infrastructure shard within the shared codebase. This implied IoC-friendly component-driven design and allowed to efficiently reuse all components in multiple applications (even in these that didn't have any IoC infrastructure around). All improvements and fixes to the shared components, in such scenario, automatically benefit all applications that use them. Additionally, since the feedback system is implemented as an independent infrastructure shard, it can be used to trap and report possible exceptions that take place on the application startup (even before the Inversion of Control Container is setup). Later, when the application and composite UI elements are ready, we switch stand-alone feedback handler for integrated handler (that uses, application's Viewspace and action policies, for instance) This reduced feedback friction for the customers and made issue reports a lot more useful and helpful for the development team. Since all feedback reports get routed to the internal customer support system, we could feel the change really fast and resolve problems and provide solutions more efficiently. New Challenge - Update Flood This kind of automatic and tight integration between development, deployment and customer support spawns new possibilities and brings forth new challenges. First of all, it definitely affects the way we think about the development and plan future changes. Second, since it is too easy to deploy an update, we've got to start thinking about protecting customers from being flooded with a number of updates (software should solve the problems, not create new ones), since theoretically we could fire the deployment update after every single verified fix that is landed in the trunk. Here's how one of the implementation scenarios looks like right now. All software updates (packaged as a complete version, as usually) could be described with a: version; download url; tags (Critical, RC, Test, Major etc) optional description. Primary download page is automatically updated with the links only when \"Critical\" and \"Major\" versions are released. Customers are unformed about the Major updates to the installed software with a usual manner (i.e.: toolbar notification). Less-important updates are not shown to them explicitly but could be pulled (i.e.: from the menu), if customer needs that. This scenario should work if we are facing scenario, when there is a minor fix affecting only a single customer. We could deploy an update via the normal procedure, but it does not make a lot of sense to push updates to every single customer. Implementation should be rather straightforward: Instead of updating simple txt on the server per every update, we could update an XML file in the RSS/Atom format with the build version, url, description and tags. By default every application filters out all non-Primary updates, displaying the rest on the toolbar. This way we deliver important functionality to customers, while saving them from the barrage of less important builds. Customers might explicitly look into the update stream, picking and firing the specific update they need (i.e. when informed about the fix by the customer support). Beta testers and partners might update application settings to be notified about RC and Minor deployments Developers and Testers might switch their filtering settings to include Beta and Minor updates. Potential synergy effects: Updates could be subscribed to via a mere RSS reader and mashed with a company news (automagically). Since we are already tracking Delivery-Build labels for issues, this information could be pasted to the description dialog, providing customers and testers with more context (could be available from the application and from the web/RSS). It is possible to push the idea even further and let customers set simple filter in their application (\"inform me when ticket #X is fixed\"). Then, whenever the next deployment is detected, a simple regex could scan the description for the ticket number, raising friendly alert, whenever it is mentioned there. If we have a critical update at hand (i.e.: breaking API change or an important security vulnerability) we can block the entire app and insist on updating. Everything might get even more interesting as extensible and composable application architectures (with extensions being developed and delivered concurrently) get into the picture, allowing fine-grained and yet automated customer-company interactions. Summary In this short article we've walked over some development practices adopted in Lokad. These practices help the company to resolve customer problems and provide new solutions to their business challenges. In short the outlined practices include: efficient software deployment and updates; efficient customer support and feedback collection; considering synergy effects and automation, whenever possible and feasible. Of course, there's a lot more in efficient development, than these few practices. I'll address them eventually. You can subscribe to this journal to get all the updates. Voting for the next article to be written is also possible. As always, all feedback to the article is welcome and appreciated. Here are also a few questions, if you are willing to share your experience with the community around this Journal: How much time does it takes for the customer to report a problem and get an update in your development environment? Does your company share the code with the community? Are you a customer of Lokad? What do you think about this development approach? How could it be improved? Are you a competitor of Lokad? What do you think? ))"}
{"url":"/post/essay-on-messaging-and-design-patterns/","title":"Essay on Messaging and Design Patterns","text":"Essay on Messaging and Design Patterns Use messaging to tackle complexity of reactive applications. Development ecosystem is a constantly changing battlefield, affected by various forces: software vendors, technological improvements, buzzwords and new ideas. One of recent changes involves transition towards reactive applications . Application requirements have changed dramatically in recent years. Only a few years ago a large application had tens of servers, seconds of response time, hours of offline maintenance and gigabytes of data. Today applications are deployed on everything from mobile devices to cloud-based clusters running thousands of multicore processors. Users expect millisecond response times and 100% uptime. Data needs are expanding into the petabytes. Let's explore one of the ways to approach reactive designs. We'll talk about a specific flavour of in-memory messaging which is present in open-source software projects like: EventStore from Greg Young and his team. Getting Things Done Task Manager from BeingTheWorst Podcast . Data Platform Sample from Lokad These systems implement in-memory messaging to tackle complex reactive domains. Benefits are: Break down some application functionality into separate components; Improving testability of these components and the entire application; Explicitly expressing and handling time-based concepts, which would be hard to deal with otherwise; Delivering new features incrementally without disrupting existing codebase; Simplifying concurrency; Delivering systems that can degrade gracefully under load, instead of failing completely. What is Messaging? Messages are named data objects designed to capture some concepts and ideas . In code they could look as simple as that: public class RegisterCustomer { public string FirstName; public string LastName; public string Email; } Messaging itself is about design where we send messages between components to drive system in reactive way. This is much like using emails to drive business workflows in a company. Similar to emails, passing messages is non-blocking - we fire message without any expectations about when it will be handled and by whom exactly. This definition seems to be both obvious and too vague. Understanding \"What messaging is?\" does not give a slightest clue on how to apply it and gain some benefits. Blindly implementing system where any component could send anything to everybody is likely to end up in a complicated mess. Actually, this happened many times before. Eventually developers started noticing common patterns that were present in successful projects. These messaging patterns were small, simple and focused enough to be useful and reusable. They helped to structure complicated software and make it understandable for outside developers. Eventually they got catchy names, too. Example of Some Messaging Patterns Working Together Let's try to gain better understanding of messaging by taking a look at design patterns for messaging used in EventStore, GTD Task Manager and Lokad Data Platform: Within this specific flavour we can identify distinct building blocks with different roles and capabilities. Queue Main Controller Publishing Bus Subscribing Components Queue in this diagram is an in-memory message queue aggregating all incoming messages. Messages might potentially come from different threads. Should system be under the load, queue is the place that will hold messages till they can be processed. Main Controller - main message handling class that is responsible for taking messages from the queue one by one and reacting to them. It serves as main entry point for messages and system coordinator. Main controller can sometimes be implemented as a finite state machine (FSM) which would handle messages differently in different states. For example, we might discard all external requests while system is in StartingUpState or ShuttingDownState starts up or shuts down, while passing them through to the dedicated handlers in WorkingState . In this design, even though messages come from different threads, they will be processed by Main Controller on one thread. This is a perfect synchronisation point. Of course, if we find out that certain operations take too much time (e.g. CPU or IO) we could route related messages down to dedicated handlers which would have their own pool of threads. Bus (publisher) - maintains a list of subscribers interested in different messages. This list is usually defined at application startup and stays immutable since then. When a new message is passed down from the controller to the bus, it will be immediately (and synchronously) handed over to each subscriber. For example, if we are implementing event-driven reactive desktop application, various view controllers could be implemented as components that: subscribe to interesting events on the bus; update their corresponding views in response; put UI messages back to the main queue when user clicks buttons, enters text or interacts in any other way. This would allow developers to add more features to the system by implementing new controllers (along with the corresponding views) and plugging them in. More Messaging Patterns There, obviously could be other, more specialised messaging patterns like timeout managers, process managers, forwarders or reply envelopes. Each comes with a well-defined role and place in the overall design. Enterprise Integration Patterns might be a good start for learning more about established terminology and time-proven techniques."}
{"url":"/post/essence-of-domain-driven-design-re-explained/","title":"Essence of Domain-Driven Design - Re-Explained","text":"Essence of Domain-Driven Design - Re-Explained This question came from the latest episode of Being The Worst: Linguistic Cartographers from Johan. In that episode Kerry and I tried to apply DDD to the exploration of new domain. First of all thanks for a great podcast series, I’m always looking forward to the next episode! I’m just a little confused when it comes to bounded contexts, domains and services (SOA term). From what I understand when listening to an episode of the “Distributed Podcast” with Udi Dahan, bounded contexts in DDD terms are more or less interchangeable with services in SOA terms. If I remember it correctly Udi also said that you should only use CQRS in those BCs/services that are in a collaborative domain. In DDD terms the BC is a linguistic boundary (am I right?) which makes sense when I look at your context map. What I don’t really understand is what Udi means when he says that CQRS should (only) be used in certain BCs . To me it seems like CQRS may be used in a certain domain (like a core domain) that resides inside a certain BC. Other domains in the same BC may or may not use CQRS. Have I understood it correctly or am I misstaken? Sorry if the question is a bit off topic. We will be talking about some of these concepts in later episodes (some already recorded), but let's get this straight in writing. We'll try to stick to the original definitions as heard from Eric Evans, Vaughn Vernon and Gregory Young. Here's my interpretation, which tries to stay as close to the origins as possible. We apply Domain-Driven Design to help solving problems in some vast and complicated problem space . This problem space is called Domain or \"the Business\" - how people actually do things in the real world. While doing DDD, we try to divide the large problem space in smaller and more manageable regions, which can be conquered separately. We identify these regions just like how cartographers identify places - by looking at some shared traits. In our case: language used by experts, organisational boundaries, existing common knowledge etc. These regions are called bounded contexts . We identify existing solutions (applications, components, projects) and plan our new solutions. Usually existing solutions are developed without the regard to bounded contexts. Ideally, new solutions would fit precisely into the problem regions we identified. However existing solutions can overlap or have various sorts of ugly shapes. These solutions are called subdomains . The process of first identifying problem space and currently existing solutions, then, mapping it to some picture is called context mapping . It is extremely important, since it provides you with the strategic overview of your battlefield, helping to prioritise and make decisions. Context Map is not a view of the future, but rather a current state. One of the most important traits used in identifying boundaries is language of the problem space, choice of terms and words, which are used by experts. We would try to use the same language while talking with them, with domain experts and all the way through the solution process. This will drastically reduce the confusion. Language is called Ubiquitous Language . The idea is to use this language as a way to explore our Domain and, with the help of the Domain experts, capture it in a Domain Model - a useful abstraction of the business, captured in the code. While building new subdomains for the identified bounded contexts, we try to develop them in such a way, that the solution would be tightly linked to the original problem. This will make it more robust, understandable and evolution-tolerant; real, in other words. Ideally, language of the problem space (along with the boundaries) will find it's way into the solution, including name of projects, classes and methods. Being The Worst podcast talks about the practical side of such development quite a bit. We can't address all problems at once, so different solutions will have different priorities. They will also have different possible implementations. Hence, we can differentiate in subdomains: Core domain - the most important subdomain, which is essential for the business. Without it the business would fail. If you ever need to pick the first solution to implement - start with the core domain. Supporting subdomain - subdomain, which is less valuable for business than Core domain. Without it business may be can even survive for some time. But it still is quite important (supports core domain), it also is specific for the domain and has to be developed. In this case, for some reason, we can't buy an existing software or component to solve the problem. Generic subdomain - subdomain which is less valuable for business than Core domain. It also is generic enough to allow buying it off the shelf (unlike supporting domain ). While building solutions for the specific problems, we can optimise our solutions by picking the best tooling, approaches, methodologies from the available. For example, CQRS with Event Sourcing might be a good fit for the Core Domain, while CRUD SQL system can be the best fit for another. These different methodologies will not conflict or confuse people, since we explicitly keep them within the subdomain boundaries. Also, while talking about Context Maps, we can identify relations between different subdomains , based on the real-world situation. It might include politics, organisational boundaries, personal distrust between teams or tight budgets. Considering these factors can help to identify potential issues in advance, while also baking in work-arounds and contingency plans into the actual software being developed. We can also identify potential integration and extensibility points. Here are some of the common terms used: Published Language - linguistic elements (e.g. command and event contracts) that are frozen and made visible outside of the subdomain. This way other subdomains can interact with it. Shared Kernel - part of the subdomain implementation that is frozen and extracted (so that the others can reuse and integrate with it better). Anti-Corruption Layer - Code, which explicitly protects logical model of a subdomain from messy logic or changes in other subdomains, it has to integrate with. Once again, Essence of Domain-Driven Design is not about some patterns (i.e.: repositories, value objects, event sourcing etc), it is about looking at complex real world problems, learning how to break them down into smaller pieces and then solving them in the most efficient way. DDD teaches methodologies that help to move along this process. Ideally, for each bounded context, there will be one subdomain that fit's it perfectly. Although this might work for green-field projects, in reality the situation can be different. This especially is true in legacy systems. In legacy systems, entire problem space can be treated as one big and complex problem, with a few solutions thrown there and there without any explicit reasoning or boundaries. In this case we can have one messy and confusing bounded context with a bunch of subdomains stepping on each other's toes. DDD also provides guidance on gradually getting out of this mess. Please, keep in mind, that these definitions are my own personal interpretations, presented in the simplified form. If you want to go to the roots - check out the Blue Book by Eric Evans PS: And no, Bounded Contexts are not interchangeable with services in SOA terms. The former is the problem, the latter is the solution. Update: there is an episode of Being the worst podcast which goes into more detail on this topic."}
{"url":"/post/essential-reading-on-big-data-and-persistence/","title":"Essential Reading on Big Data and Persistence","text":"Essential Reading on Big Data and Persistence In my previous post we've discussed some design considerations for handling big data in retail . Let's continue from here. Joannes Vermorel has just completed a really interesting whitepaper on storing sales data in retail. He outlines a few rather simple principles that allow to store 1 year of detailed sales history of 1000 stores on a smartphone . Both the white paper (PDF) and source code are shared by Lokad on github. I'm not claiming, that this is a production-ready scenario, since it is missing things like continuous replication (to another smartphone), checksumming and BI capabilities. However the point here is that SQL server or generic No SQL server might not be necessarily be the best fit for this situation. Curiously enough, in scenarios when companies need to store similar amounts of sales history, they don't take simple and rather cheap approaches like this one. Instead, consultants sell them rather expensive Oracle, Microsoft (put any company in big data field) software and hardware setups that still fail to keep up with the throughput of the data . For some reason, if you can write 50000 ticket receipts per second to a file (where each receipt usually contains a dozen products), this does not necessarily mean that you can have the same throughput inserting rows to your favorite SQL database cluster. So why do we even use them? I don't hold anything against SQL (or any other relational storage), except the fact that SQL DB is being sold as a silver-bullet for cases, where it is clearly not applicable. And I hate to see huge amounts of money wasted in a useless way (at least, donate them to a charity or noble cause instead). By the way, check out this great paper by Erik Meijer and Gavin Bierman: A co-Relational Model of Data for Large Shared Data Banks . It provides nice insight into the nature of relational (SQL) and document (Not Only SQL) persistence options. So why do we keep on applying expensive sub-optimal solutions to problems that do not fit them ? Probably, because \"nobody get fired for buying IBM\", while trying some non-conventional approach and failing is more risky to your career. However this will not necessarily hold true in the next years. Economic and technology forces are too strong. Just read this amazing white paper from Pat Helland , which was written way back in 2007 (and don't get surprised if you find a lot of things that look like modern principles behind event sourcing and domain-driven design). I do not intend to criticize SQL databases or any other product, but rather to give broader perspective - they are not the only data persistence solutions out there. There are more options. And sometimes, a few specialized lines of code can beat a generic product both hands down (simply because they can be more tailored to the problem, than a product would ever dream to be)."}
{"url":"/post/event-sourcing-a-la-lokad/","title":"Event Sourcing a la Lokad","text":"Event Sourcing a la Lokad I've seen quite a few ways of doing Aggregate Roots and event sourcing. Here's the dead simple approach that works for me. For the sake of simplicity, we logically separate aggregate root (event generation from commands, given certain internal state to make the decisions) from the aggregate state (building state from the events). public interface IEvent {} public interface ICommand {} public interface IAggregateState { void Apply ( IEvent e ); } public interface IAggregateRoot { void Execute ( ICommand c ); } Given that, an aggregate state looks like a view handler , while aggregate root itself resembles a command handler (from the scenarios, where ES is not employed at all). That's actually the point. Let's define a simple aggregate, that tracks opened bills for the customer. First, we define contracts for our commands and events (I'm using a VS T4+ANTRL Combo here to avoid writing message contract classes by hand or bloating this post): let customerId = CustomerId CustomerId; CreateBill? (customerId, DateTime startDateUtc) BillCreated! (customerId, DateTime startDateUtc) CloseBill? (DateTime closeDateUtc) BillClosed! (customerId, DateTime closeDateUtc) AddServicesToBill? (int serviceCount) ServicesAddedToBill! (int serviceCount) Then, we proceed to write the aggregate state. It should know, how to build itself from the events. public sealed class BillAggregateState : IAggregateState { public CustomerId Customer { get ; private set ; } public DateTime Opened { get ; private set ; } public DateTime Closed { get ; private set ; } public int Services { get ; private set ; } public void Apply ( IEvent @event ) { RedirectToWhen . InvokeEvent ( this , @event ); } public void When ( BillClosed e ) { Closed = e . CloseDateUtc ; } public void When ( BillCreated e ) { Customer = e . CustomerId ; Opened = e . StartDateUtc ; } public void When ( ServicesAddedToBill e ) { Services += e . Services ; } } The only \"magical\" place here is the RedirectToWhen helper, which is actually quite simple ( see gist ). Given the state, we can define our aggregate as: public class BillAggregate : IAggregateRoot { readonly BillAggregateState _state ; readonly Action & lt ; IEvent & gt ; _observer ; public BillAggregate ( Action & lt ; IEvent & gt ; observer , BillAggregateState state ) { _state = state ; _observer = observer ; } void Apply ( IEvent e ) { _state . Apply ( e ); _observer ( e ); } public void Execute ( ICommand c ) { RedirectToWhen . InvokeCommand ( this , c ); } public void When ( CreateBill bill ) { Apply ( new BillCreated ( bill . CustomerId , bill . StartDateUtc )); } public void When ( AddServicesToBill c ) { Apply ( new ServicesAddedToBill ( c . ServiceCount )); } public void When ( CloseBill e ) { Apply ( new BillClosed ( _state . Customer , e . CloseDateUtc )); } } Having said all that, here's how the \"event sourcing magic\" actually works: IEnumerable < IEvent > givenEvents = ...; IEnumerable < ICommand > whenCommands = ...; // load state from the event history // or, if you have snapshot - load it here first // we will not do the latter here var state = new BillAggregateState (); foreach ( var e in givenEvents ) { state . Apply ( e ); } var thenEvents = new List < IEvent > (); var ar = new BillAggregate ( thenEvents . Add , cs ); foreach ( var c in whenCommands ) { ar . Execute ( c ); } // do something with the events that were produced. // for example - append them to the history and then publish in async // or do both at once and face 2PC return thenEvents ; That's, basically, it. Note, that we are not relying on any frameworks, code or interface definitions outside the scope of this article. A few caveats: aggregate identities are carried outside of the commands/events and passed by the message bus via strongly-typed message contexts (see Lokad CQRS PDF for detail). Aggregates don't care about their own identity. versioning and version checks are not within the scope of this article, yet they could be added to the snippet above as needed. all commands that come in, are joined by a logical transaction; obviously your message bus must support command batching in order for this to work. This is the current approach of Lokad to Event Sourcing in the distributed world. It's likely to evolve a bit further, if we find ways to make it even more simple and straightforward. BTW, the situation gets even more interesting if we assume that: all messages (commands and events alike) carry unique identifier that is used at all steps of message processing to enforce message deduplication (required for repartitioning or cloud environment in general). entity identities (i.e.: aggregate root identifiers) that are carried in the trasport headers, not only simplify our contracts (while still being exposed to the domain code in a decoupled way), but also provide simple foundation for message routing and aggregate re-partitioning. if we keep track of the message causality (event X was caused by command batch Y) in the transport headers along with client vectors, this provides us with the foundation to do partial message ordering (for the cases where cloud environments are really stressed and tend to mess up order a lot). This topic is continued in the post on Tape Storage , which serves as persistence foundation for event sourcing and also enable fully portable systems to be developed and deployed with Lokad.CQRS ."}
{"url":"/post/event-sourcing-aggregates/","title":"Event Sourcing - Aggregates","text":"Event Sourcing - Aggregates A quick way to understand aggregates that use event sourcing is to consider the following analogy. Imagine a car parts factory that is being managed by a director. This director commands this factory by sending instructions from his office down to the production floor. These instructions could be: recalibrate all machines; make that guy work 16 hours tomorrow; ship order to customer A. Now! Each command is carried over to the plant and executed by workers (as diligently as possible). As this happens, papers and memos are filled (since we need to inform the headquarters about what has really happened). So each of these commands can have various outcomes, which depend on the current state of the plant. For example, when we send command to _recalibrate all machines , we can get response that all 10 machines were re-calibrated . Another response could be 5 machines were re-calibrated, 3 were found to be broken and 2 were actually found to be missing`. Same with the other commands. When we try to make a worker work 16 hours he might agree or (if he is a Russian), he could just say some rich and powerful words to the management. When we are trying to ship all orders, one of the following sequence of events might happen: given that all order items are prepared - orders get shipped; given that some items are missing but can be replaced with equivalent parts - order gets shipped with some replaced parts and then also record about that replacement. given that order can't be fixed by replacing certain items - an apology gets written. Let's bring in the analogies between an aggregate root and a factory: factory - aggregate ; command to factory - command ; given state of a factory - an internal aggregate state ; occurrence of some event - event ; Let's implement aggregate for this factory without using any external frameworks. First we start by defining interfaces for our elements: event, command, state and aggregate. public interface IEvent {} Then, we define contracts for our events (I'm using a VS T4+ANTRL Combo here to avoid writing message contract classes by hand or bloating this post). Events would look like this: MachineRecalibrated : IEvent ( Guid factory , int machineId , DateTime calibrationDate , double accuracy ) MachineReportedStolen : IEvent ( Guid factory , int machineId , Guid policeReportNumber ) MachineReportedBroken : IEvent ( Guid factory , int machineId , Guid incidentReportId ) OrderShipped : IEvent ( Guid factory , int orderId , ShippingReport report ) OrderCantBeShipped : IEvent ( Guid factory , int orderId , ShipDenialReport report ) WorkerWorkShiftExtended ( Guid factory , int workerId , int hours ) WorkerUsedRussianCurses ( Guid factory , int workerId , string message ) WorkerRefusedToExtendShift ( Guid factory , int workerId ) Given these events, let's start by implementing an aggregate called FactoryAggregate . In the following method we hold all behaviors required to recalibrate all factory machines . Note, that here we are taking a dependency on IServiceGuy . public void RecalibrateAllMachines ( IServiceGuy guy ) { var id = _state . FactoryId ; foreach ( var machine in _state . Machines ) { var report = guy . RecalibrateMachine ( machine ); switch ( report . State ) { case CalibrationReport . Missing : Apply ( new MachineReportedStolen ( id , machine . Id , report . Id )); break ; case CalibrationReport . Broken : Apply ( new MachineReportedBroken ( id , machine . Id , report . Id )); break ; case CalibrationReport . Recalibrated : Apply ( new MachineRecalibrated ( id , machine . Id , report . Date , report . Accuracy )); break ; default : throw new UnsupportedReport ( report ); break ; } } } Here is how another method can be implemented: public void MakeWorkerPullLongShift ( int workerId , int hours , IRussianCurseGenerator gen ) { var worker = _state . GetWorker ( workerId ); if ( worker . AgreesToPullHours ( hours )) { Apply ( new WorkerWorkShiftExtended ( _state . FactoryId , workerId , hours )); } else { Apply ( new WorkerRefusedToExtendShift ( _state . FactoryId , workerId )); } if ( worker . IsEnraged && worker . IsRussian ) { var curses = get . GenerateCurses ( worker . RageLevel ); Apply ( new WorkerSentMessageToManagement ( _state . FactoryId , workerId , curses )); return ; } } So far all is straightforward. But we have 2 unknown items here - _state and Apply . These are two sides of the same coin here. _state reflects the current state of things on the factory right now. Apply is used to publish events while also updating this official state of things. // private variable that holds an observer delegate readonly Action < IEvent > _addToUnitOfWork ; void Apply ( IEvent event ) { // Pass the event to state (and let it update itself) _state . Apply ( event ); // Add event to the unit of work _addToUnitOfWork ( event ); } Then, we proceed to write the aggregate state. It should know, how to apply given events. public sealed class FactoryState { public void Apply ( IEvent @event ) { RedirectToWhen . InvokeEvent ( this , @event ); } public MachineList _machines ; public MachineList _brokenMachines ; public void When ( MachineAdded e ) { _machines . AddNew ( e . Id , e . InitialCalibrationDate , e . InitialAccuracy ); } public void When ( MachineReportedStolen e ) { _machines . Remove ( e . MachineId ); } public void When ( MachineReportedBroken e ) { var m = _machines [ e . MachineId ]; _brokenMachines . Add ( m ); _machines . Remove ( m ); } public void When ( MachineRecalibrated e ) { var m = _machines [ e . MachineId ]; m . CalibrationDate = e . CalibrationDate ; m . Accuracy = e . Accuracy ; } The only \"magical\" place here is the RedirectToWhen helper, which is actually quite simple ( see gist ). As you can see, commands are executed against the aggregate. If needed, aggregate uses additional services. While executing commands, aggregate employs complex behaviors which are based on the current state. This execution results in changes to the state, which are expressed as events. Event Sourcing Magic Let's see how all these come together in a message handler that accepts command messages and passes them to the aggregate store. public void Consume ( RecalibrateAllMachines cmd ) { var guy = _container . Resolve < IServiceGuy > (); _factoryRepository . Execute ( cmd . FactoryId , c => { c . RecalibrateAllMachines ( guy ); }); } public void Consume ( MakeWorkerPullLongShift cmd ) { var generator = _container . Resolve < IRussianCurseGenerator > (); _factoryRepository . Execute ( cmd . FactoryId , c => { c . MakeWorkerPullLongShift ( cmd . WorkerId , cmd . Hours , generator ); }); } Having said all that, here's how the \"event sourcing magic\" actually works: public void Execute ( int id , Action < FactoryAggregate > when ) { var streamName = \"factory-\" + id ; var given = _stream . ReadAllEvents ( streamName ); // load state from the event history // or, if you have snapshot - load it here first // we will not do the latter here var state = new FactoryState (); foreach ( var e in givenEvents ) { state . Apply ( e ); } var thenEvents = new List < IEvent > (); var ar = new FactoryAggregate ( thenEvents . Add , cs ); // execute actual command when ( ar ); // do something with the events that were produced. // for example - append them to the history and then // publish in async or do both at once and face 2PC _stream . AppendEvents ( streamName , thenEvents ); } IEnumerable < IEvent > givenEvents = ...; That's, basically, it. Note, that we are not relying on any frameworks, code or interface definitions outside the scope of this article. A few caveats: aggregate identities are carried outside of the commands/events and passed by the message bus via strongly-typed message contexts (see Lokad CQRS PDF for detail). Aggregates don't care about their own identity. versioning and version checks are not within the scope of this article, yet they could be added to the snippet above as needed. all commands that come in, are joined by a logical transaction; obviously your message bus must support command batching in order for this to work. How is Aggregate Created? Event Stream for an aggregate is created by sending a command to the aggregate that does not exist. This command could look like: new CreateUserCommand { Id = Guid . NewGuid (), Name = registerModel . Name , Login = registerModel . Login , Password = registerModel . Password } Then, somewhere in the handler we could have: public void Consume ( CreateUserCommand cmd ) { var service = _container . Resolve < EmailCheckService > (); _factoryRepository . CreateAndExecute ( cmd . Id , c => { c . CreateNewUser ( cmd . Id , cmd . Name , cmd . Login , cmd . Password , service ); }); } The only difference between Execute and CreateAndExecute methods is that the latter does not throw any exceptions if the stream does not exist before the call (and it might actually throw an exception if the stream did exist). Note, that in the advanced cases of user registrations we might want to start by creating a RegistrationAggregate (to track all associated information and the actual process of registration), that would then pass control to various instances of CustomerAggregate , SecurityAccountAggregate and UserLoginAggregate . These details depend on your domain and are subject to discussion with domain experts. Advanced Scenarios Above is a simplified approach to event sourcing that is generally described and explained in articles on the subject. However, there are ways to reduce development friction and increase expressiveness of the code, if Event centric approach a la Lokad is used. There the situation gets even more interesting if we assume that: all messages (commands and events alike) carry unique identifier that is used at all steps of message processing to enforce message de-duplication (required for repartitioning or cloud environment in general). entity identities (i.e.: aggregate root identifiers) that are carried in the transport headers, not only simplify our contracts (while still being exposed to the domain code in a decoupled way), but also provide simple foundation for message routing and aggregate re-partitioning. if we keep track of the message causality (event X was caused by command batch Y) in the transport headers along with client vectors, this provides us with the foundation to do partial message ordering (for the cases where cloud environments are really stressed and tend to mess up order a lot). This topic is continued in the post on Tape Storage , which serves as persistence foundation for event sourcing and also enable fully portable systems to be developed and deployed with Lokad.CQRS . Message Builders Sometimes our messages will be so large, that composing them via constructors would create fragile and non-readable code. If such an instantiation happens too often for a single event, we can improve code readability in this specific case by introducing event builders. Event builder is a statefull class, which provides helpful and expressive syntax to create an instance of an event. It can look as simple as this: // TODO: add a snippet"}
{"url":"/post/event-sourcing-projections/","title":"Event Sourcing - Projections","text":"Event Sourcing - Projections Projection is an important concept while building event-centric systems. At the same time, it is extremely simple. Projection is about deriving current state from the stream of events . For instance, consider a situation, where a stream of events is published out by a server to all subscribers. These events are related to user registrations and look like: UserAddedToAccount { UserId = 150 , AccountId = 47 , Username = \"spam-me-not\" , RegistrationToken = \"27fa3h…\" } UserVerifiedEmail { UserId = 150 , Email = \"rinat.abdullin@gmail.com\" } UserRenamed { UserId = 150 , NewName = \"abdullin\" } We can attach a subscriber to stream of these events to project this stream into a persistent read model, used to serve user details in a Web UI. Final read model could look like: { \"UserId\" : 150 , \"AccountId\" : 74 , \"Username\" : \"abdullin\" , \"Email\" : \"rinat.abdullin@gmail.com\" , \"RegistrationToken\" : \"27fa3h...\" } And projection logic could look like this (in pseudocode): if ( e is UserAddedToAccount ) { store . Save ( new UserView { UserId = e . UserId , AccountId = e . AccountId , Username = e . Username , RegistrationToken = e . Token }); } if ( e is UserVerifiedEmail ) { var view = store . Load < UserView > ( e . UserId , Lock . Enforce ); view . Email = e . VerifiedEmail ; store . Update ( view ); } // etc As you can see, this pseudo-code is rather straightforward (and quite boring). It probably uses something like ORM to do the actual persistence. However, there is an important concept behind this simplicity. Given the stream of events, we can project them to any structural representation . Structural representation here refers not only to the schema of a read model, but also to the implementation details of how this model is stored and accessed. Here are some possibilities: SQL database (MS SQL, mySQL, Postgres or any other relational database, including cloud counterparts like Azure SQL or Amazon RDS); NoSQL database (CouchDB, MongoDB, RavenDB or any other document database); No database (serialized files on a file system, including cloud counterparts of Amazon S3, Azure Cloud Storage or Rackspace CloudFiles); No persistence (blazing-fast in-memory store, which is rebuilt from local event stream whenever server is rebooted). Obviously, in either way these read models will always be kept-up-to-date . Since, we are projecting new events to the read model, as soon as they come in. In a sense such read models are nothing more than an almost perfect cache (the one that is updated as soon as the change comes in). Now consider the fact that we can multiplex event stream and have multiple similar projections subscribed to it in parallel. This is the reason, why we can have almost-infinite scalability on read side with our projected read models. Another reason why projections are important: exactly the same concepts apply not only to the persistent read models, but also to aggregate roots implemented with event sourcing. An aggregate root can have it's own private state, which is projected from the same events that it creates . These similarities help to make event-centric distributed systems to look more simple and consistent. Terminology Let's get our terminology straight. Projecting is process of converting (or aggregating) a stream of events into a structural representation. This structural representation (which is being updated as we traverse the stream) can be called many names: persistent read model, view or a state. Process of projecting is executed by a set of event handlers , which essentially are just methods executed whenever a specific type of event comes in. These methods perform CRUD operation upon the persistent read model. For the convenience, we will be grouping all event handlers into projection classes based on the read model that they are working with. Simple Scenario Let's talk explore writing and managing projections on a simplified scenario (which actually works in practice for the majority of projects in cloud and on premises). It is based on assumption, that each read model is a single document that can be accessed and modified as a whole. For instance, in a simple task-tracking system you could have following read models (which map to the screens in different UIs): TodoListView - single document listing all tasks that are yet to be done TaskDetailView - a document per task with all the details TasksPerDayView - a document per day, that lists all tasks scheduled for a given day. TasksCompletedLastMonth - a document of tasks that were completed last month. The list can go on. It's actually not that relevant, since we can always add new projections and reconstruct read models by running event history through. Here's how the projection class might look like: TODO : Projection class from FarleyFile If you have been working with service buses, you would notice familiar IConsume interface. It is a marker interface that is recognized by infrastructure, which will be responsible for calling appropriate methods, whenever the corresponding event comes in. Obviously, the entire auto-wiring could be replaced by a little bit of manual code as well. The only new interface here is IAtomicWriter, which can be defined as: public interface IAtomicWriter<in TKey, TEntity> { TEntity AddOrUpdate(TKey key, Func<TEntity> addFactory, Func<TEntity, TEntity> update); bool TryDelete(TKey key); } As long as your persistence supports these two methods, you can The only subtle moment here is the atomicity of the updates. We should be able to write updated view if somebody else has already modified it in parallel. Fortunately, this is supported by the majority of persistence providers. Relational databases provide transactions, while no-sql systems feature either conditional operations (as in \"update if version is still X\") or provide direct support for simple atomic changes. Rebuilding Quite often you would want to change your projections or add completely new ones to the system. Obviously we would want to go back in time and make everything look like these changes were there since the beginning of time. This is where replaying events come into play. In order to be able to do that, we should set up our system to record all passing events into a separate event log for the domain. This event log is completely separate from aggregate event streams (should these be used in the system). It's sole purpose is to simplify event replays for projections. To do that we simply add an additional subscriber that would listen to all events and append them to that log: if ( ! log . TryAppend ( streamer . SaveEnvelopeData ( message ))) throw new Exception ( \"Failed to save to domain event log\" ) ; Whenever time comes to rebuild a certain projection, you could stop the system (this could be done without stopping, but the procedure would be a bit more complex then), drop all view models and then recreate them by pushing events through the projections once more. If you are using some simple persistence for your views (i.e.: documents, files or blobs), then the procedure could be simplified by a bit. Download latest domain event log (or simply sync local copy with the remote version) Manually wire up projections to use local store (files or even memory) for persisting view models. Normally this is done in a helper tool without any bus or messaging infrastructure. We just sequentially read messages from file and pass them to the manually constructed instances of projection classes. Once the rebuild is complete, upload views to the new container in production, overwriting the old views. Obviously this works for a simple systems with low traffic, ones that can be stopped while we perform upgrades. However, if needed, an upgrade can be performed with zero (or little downtime). It involves having parallel deployments for the client and projection hosts. We will have: ProjectionHost_v1 - a subsystem that projects incoming events to a store Views-v1 . There will be a Client-v1 (ie UI) configured to use Views-v1 . ProjectionHost_v2 - a subsystem that uses store Views-v2 but is currently empty. Client-v2 wil be configured to use that store. Load balancer will be configured to redirect all web requests to Client-v1 . We perform upgrade by: Bring ProjectionHost-v2 online and subscribe it to events queue (but do not consume yet, just keep in memory). Tell ProjectionHost-v2 to run all events from the domain event log, and afterwards start consuming all events from the events queue (there is a subtle chance of message duplication here, that will be handled by the sliding cache). Once projection host catches up with all the events, we simply tell LoadBalancer to switch from Client-v1 to Client-v2 . Afterwards all v1 elements could be dropped. TODO: mention role of upgraders TODO: separated from bounded contexts Technological Options TODO: Cloud Flavor, Local persistence, in-memory analysis and rebuilding, NoSQL and SQL TODO: use http://abdullin.com/journal/2011/1/19/scalable-and-simple-cqrs-views-in-the-cloud.html"}
{"url":"/post/event-sourcing-specifications/","title":"Event Sourcing - Specifications","text":"Event Sourcing - Specifications Event Sourcing: Specifications When you hear about Event Sourcing , one of the mentioned advantages is ability to write expressive and non-fragile tests. This is done via specifications. Specification is just a certain scenario that says: Given some preconditions, When something happens, Expect this. Specifications are a rather nice and flexible way to express and verify certain behaviors. If you are developing with Microsoft .NET, check out a sample of generic Specifications framework see SimpleTesting by Greg Young . Obviously, specifications can be bound and executed via any unit testing engine (or using stand-alone runners). For NUnit bindings - see snippet by Yevhen Bobrov // TODO: update with my own snippet, which reduces friction a little bit more. Coded specifications for ES might look like this: public Specification Creating_customer_opens_bill = new CustomerSpec { Before = () => Current . DateIs ( 2011 , 10 , 6 ), When = new CreateCustomer ( cust , \"Northwind\" ), Expect = { new CustomerCreated ( cust , \"Northwind\" ), new CustomerBillOpened ( cust , 1 , Date ( 2011 , 10 , 6 )) }, Finally = Current . Reset }; Testing Aggregates Specifications act as a natural way to test complex business logic, which is implemented with CQRS+ES way . This happens because behaviors in aggregates are coded like: Given a certain events in the past (used to rebuild the state) When a command is passed Generate certain events. Hence, we can easily test any single command in various scenarios, by providing preconditions (events that happened before) and comparing actually produced results with the expected ones. // TODO: repeat spec snippet and include explanation In certain cases we might also need to provide test setups and tear-downs within the specification. For instance, if aggregate uses some sort of MyOverridableTime static class to retrieve time in a testable way (and record it in an event), we could handle the situation like this: public Specification Creating_customer_opens_bill = new CustomerSpec { Before = () => MyOverridableTime . DateIs ( 2011 , 10 , 6 ), When = new CreateCustomer ( cust , \"Northwind\" ), Expect = { new CustomerCreated ( cust , \"Northwind\" ), new CustomerBillOpened ( cust , 1 , Date ( 2011 , 10 , 6 )) }, Finally = MyOverridableTime . Reset }; More complex scenarios Sometimes, simple field-based specification syntax is not flexible enough for our needs. Or it can create a messy code. In this case you can actually define specifications in methods that return either an instance of Specification or IEnumerable<Specification> // TODO: include sample snippet. You can also use builders (either a stand-alone class or a method within the specification suite) to reduce repetitive code in your unit tests. // TODO: include snippet from testing bill-2-invoice conversion If you want to push the scenario even further ( at the cost of higher fragility ), you can actually record Given-When-Then using some textual serialization. This would create a text file per specification. Then, while running unit tests, simply enumerate all files in a directory (or resources in an assembly), building a specification for each one. Organizing specifications Low Fragility Lower fragility of specifications (for instance, compared to unit testing business logic bound to the database) comes from the fact that we are actually using Ubiquitous Language to express behaviors of the subject under test. While doing that, we don't couple our tests with the actual implementations of these behaviors. For all we know, inner code can be wildly refactored, completely changing everything. Yet, as long as the behaviors stay the same - we don't care. Another reason for lower unit test fragility - behaviors are expressed as messages which are structured around real-world language of the problem domain. This language does not change often. In fact, it requires a small revolution in order for this to happen. events would still change, though. Although language doesn't really change, our understanding of it can evolve. I wrote an article recently about versioning event contracts Specifications as a Living Documentation If you have a specification, you can easily and automatically convert it into a readable scenario document (SimpleTesting has a snippet showing how to achieve that). Creating customer opens bill - Passed Date Is ( 2011 , 10 , 6 ) When : Create customer 'Northwind' with id 7 Expectations : [ Passed ] Created customer 7 'Northwind' [ Passed ] Opened bill 7 / 1 from 20111006 The simple trick that actually creates all these readable lines is about using .ToString() of each message (or any other equivalent). So you can actually print out your entire unit test suite and then verify it with the business team. If you are delivering a project, acceptance can sign these scenario specifications (actually saying that they agree with how unit tests are run). What's more interesting, these specifications (in their textual form) can be written by a business team. Then you can have a junior developer to convert them into the code. Another developer can actually start working on implementing the business logic in an aggregate. You could always measure this progress of this developer by calculating implementedSpecifications / totalSpecifications . And while developer is coding in behaviors (which can be quite complex), another team can be actually working on UI and projections. Reusing specifications to test contracts. Message contracts (both commands and events) are not fragile. Yet serializing them can yield unexpected results, especially if you are switching between different serializers. For instance, ServiceStack serializer has issues with serializing almost all structs (except the primitive ones). Another example is limited support of DateTime.Kind by ProtoBuf and ServiceStack . I used to verify (occasionally) contract serialization stability by creating unit tests that fill some objects with data, roundtrip via serializer and then compare results. This approach had some issues: not all messages are covered; tedious to create every single case of valid data. However, if we are using CQRS+ES and Specifications for testing, there is a better way. We just need to scan the unit test assembly for all specifications available. Grab them, but do not execute. Instead, just aggregate all messages that ever were passed into Given , When or Expect . Voila, you already have a few populated message objects for each contract. Just round-trip them through your favorite serializers to see compatibility issues in advance. If you are using NUnit, then a little bit of magic with TestCaseData could give something like this: And comparison code can be as simple as: [TestCaseSource(\"ListMessages\")] public void GoogleProtoBuf ( IEnumerable < object > msgs ) { foreach ( var exp in msgs ) { var actual = Serializer . DeepClone ( exp ); var compare = CompareObjects . FindDifferences ( exp , actual ); if ( ! string . IsNullOrWhiteSpace ( compare )) { Assert . Fail ( compare ); } } } Where CompareObject is just a plug to CompareObjects.NET (many thanks for Yves Reynhout for pointing this project). Low-Friction Event Comparison There are multiple approaches to compare events, while validating specifications. I usually perform deep comparison of event objects by using CompareObjects.NET. If the events are different, then the actual member-by-member diff result might look too complicated for a human being to understand with a single glance: // TODO: put compare output snippet. What we can do is to replace this output with ToString representations, which usually have a human-readable string.Format in them. // TODO: sample output of ToString compare However, it could happen, that both string representations are equivalent (i.e. there are actually no ToString overloads, ir difference is detected in a field that is not presented within the string). Then we need to fall back to the usual diff produced by CompareObjects . If you find yourself seeing a lot of member-by-member diffs, then this might be a symptom that your text representations of events could be refactored to reflect events better. Redirecting Context.Explain In the document on writing plain AR+ES classes we've mentioned use of Context.Explain to capture and annotate non-trivial business decisions. We can actually capture that information while running our specification. This will make our specifications more understandable for developers (while debugging and fixing tests) and to business people (while reviewing specifications). // TODO: include snippet of the same test with explain output Obviously, if you are using some logger (as a framework or an abstraction), you can achieve the same with slightly more complicated code. Just make sure, that running tests in parallel will not intermix their outputs between each other."}
{"url":"/post/event-sourcing-versioning/","title":"Event Sourcing - Versioning","text":"Event Sourcing - Versioning A while ago we've started talking about the basics of event sourcing and reasons when to use it (or not to use it). Let's proceed to the most frequent question related to this subject: How exactly do we handle changes in event contracts: renaming fields, adding new members or even performing a complete refactoring. Contracts Event contract is simply a definition of event, that could explain how to serialize some object into a transportable data structure and then rehydrate the object back. Usually contracts are defined as POCO classes. For example, here's one option: [DataContract] public partial class RequestPaypalPayment : ICommand < PaymentId > { [DataMember] public PaymentId Id { get ; set ; } [DataMember] public InvoiceDefinition Definition { get ; set ; } [DataMember] public Email Billing { get ; set ; } } Logical meaning behind the contract (how the name is chosen, what fields are used etc) comes from the world of Domain-Driven Design, applied to CQRS and Event sourcing. The problem comes from the fact, that it's nearly impossible to get these contracts right from the first attempt in building a system. And even if you manage to, there is a significant chance that our own perception of domain model will change, creating the need to adjust the contract as well. So contracts change . These changes can break our ability to read previously saved events (and previous history can easily go years back), or cause logical data corruption. Let's go over various types of changes in contracts and ways to handle them. Minor Corrections The easiest kind of change to event contracts deals with correcting typos and changing naming of class members. All these are completely painless and transparent, if your contract serializer is evolution-friendly. For instance, in you can use Google ProtoBuf Serializer, which does not care about names, relying on integral tags instead. In .NET world POCO contract would look like: [ProtoContract] public partial class RequestPaypalPayment : ICommand<PaymentId> { [ProtoMember(1)] public PaymentId Id { get; set; } [ProtoMember(2)] public InvoiceDefinition Definition { get; set; } [ProtoMember(3)] public Email Billing { get; set; } } By the way, there is an interesting trick with this serializer. Instead of using attributes from ProtoBuf.dll , you can leverage attributes of DataContractSerializer , which is a part of .NET BCL. This will make your contract libraries decoupled from non-BCL code (reducing change of dependency hell). [DataContract] public partial class RequestPaypalPayment : ICommand<PaymentId> { [DataMember(Order = 1)] public PaymentId Id { get ; set; } [DataMember(Order = 2)] public InvoiceDefinition Definition { get ; set; } [DataMember(Order = 3)] public Email Billing { get ; set; } } From now on, when we will be talking about event contracts, simplified contract definition syntax will be used. For instance the above contract could be represented as: PaypalPaymentReceived(PaymentId id, string paymentId, invoice, account, amount) This explanatory syntax is not only more compact, but it could also be used to build actual contract classes for any platform, given a DSL parser. Alternatively, you can just pass them to a Junior Developer and he'll be able to code them in. So the first type of changes to event contract deals with minor corrections (renaming members). Good serializer will handle this natively. If you don't have such serializer, or the change is more deep - you can use in-memory upgrader, discussed below. Enriching Events More common versioning scenario with event contracts involves enriching events by adding new fields. Consider following event: UserDisabled(Guid id, string reason) Imagine we have a system in production, that uses this event. However, at some point a request comes from the business to create a web view that would list names of disabled users for each account. There are multiple approaches to solve this problem. Some of them involve writing complex read-models (capturing a wide variety of events in order to persist some additional info). However, it will be much easier if we were to publish this event: UserDisabled(Guid id, string reason, string name) Given this definition, writing a view handler would be much easier. It is also quite easy to start publishing these events from the domain model. However, if we were to replay history to rebuild views, we would have events with null name. That's where domain in-memory upgraders come to the rescue. They are responsible for encapsulating this upgrade logic. Upgraders are defined like implementations of this interface: public interface IUpgradeDomainEvents { IEnumerable < IDomainEvent > Upgrade ( IDomainEvent @event , string messageId ); DateTime DefinedOn { get ; } } Where DefinedOn is the hard-coded date on which this upgrader was implemented (this is used to order upgraders within the chain). Actual upgrade method returns an IEnumerable to handle cases, when a single event is split into multiple events, or when we are compressing multiple events into one. Implementations of these upgrade methods are usually located and distributed within the contracts library (so they are part of the schema). Any code (or a node in distributed system) that stores history events and reprocesses them later - could easily locate and load these upgraders by a simple reflection. Potential implementation might look like: public class EnrichUserDisabled : IUpgradeDomainEvents { readonly IDictionary < Guid , string > _userNames = … public DateTime DefinedOn { get { return new DateTime ( 2011 , 10 , 4 ); } } public IEnumerable < IDomainEvent > Upgrade ( IDomainEvent e , string id ) { // if this is event with info , capture it . var named = e as UserNamed ; if ( null != named ) { _userNames [ named . Id ] = named . Name ; yield return e ; yield break ; } // if this is target event - enrich it var d = e as UserDisabled ; if ( null != d ) { if ( string . IsNullOrEmpty ( d . Name )) { yield return new UserDisabled ( d . Id , d . Reason , _userNamed [ d . Id ]); yield break ; } } // otherwise just pass it through yield return e ; } } Please keep in mind, that upgraders are just a convenience short-cut that prevents views (or any other event consumers) from being \"contaminated\" by logic to handle versioning of event contracts and their upgrades. Since upgraders are a shortcut, we can discard them altogether This is done by taking existing event streams, running them through the upgraders and overwriting source streams with results. Deep refactoring There are scenarios, when event has to be replaced completely or even split into multiple events. For example, after evolving our system, we might find out that AccountRegisteredEvent now becomes three distinct events: AccountCreatedEvent; UserAddedEvent; UserActivatedEvent. IUpgradeDomainEvents can handle such cases as well. You just return three different events in this case: yield return new AccountCreatedEvent ( e . AccountId , ... yield return new UserCreatedEvent ( userId , e . AccountId , ... yield return new UserActivatedEvent ( userId , e . AccountId , ... Upgrader approach can also be used to merge events , if really necessary. You just need to push first event on the stack, when it comes, and then replace the second event with the result."}
{"url":"/post/event-sourcing-why/","title":"Why Event Sourcing?","text":"Why Event Sourcing? Event Sourcing is a concept that becomes increasingly popular day by day. Even ThoughtWorks has brought it into it's latest Technology Radar . Let's do a quick overview of ES one more time. Why Event Sourcing? In essence event sourcing is about persisting data in a way that preserves every single bit of information . It's about representing objects as a sequence of events that took place through the time and led them to the current state. For instance, if I were to persist information about my pocket money (i.e.: 67 EUR), I could simply save the latest state somewhere in a variable or database: Balance : 100 EUR Now, whenever there is a change, we would overwrite this value with the new value (discarding the previous one). Then at some point in time we will have something like this: Balance : 67 EUR Simple and elegant (and works perfectly in a large number of scenarios). However, we are performing a logical compression here (lossy one) and discarding some information. Let's see what would happen if we were to preserve all the changes : Got from ATM: 100 EUR Bought metro tickets: -12 EUR Grabbed a lunch: -8 EUR Found a coin: 1 EUR Took taxi: -14 EUR Obviously, if we have such a sequence of events, we can always \"reconstruct\" the current balance, by doing a total: Balance : 100 - 12 - 8 + 1 - 14 = 67 EUR In essence, the final state (Balance) is a left-fold function of the previous states (equivalent of IEnumerable.Aggregate in .NET, std::accumulate in C++ or array.reduce in JavaScript). Now, you might ask yourself a question - What's the point in storing all these intermediate steps, when you can just save the final balance? This way of persistence via event sourcing possesses some really interesting features. Below are some of these. By the way, if you are asking yourself about the performance of such an approach to storage, don't worry. It can easily beat relational databases in both scalability and throughput (ceteris paribus). Storage Ignorance It is really easy to save data as a stream of events. All we need is to define POCO ( Plain Old CLR Object ) classes (one for each event) and then serialize them to one of many formats available: Google ProtoBuf, JSON, Binary, XML etc. Now, before you say that it takes to much code to define events: GotMoneyFromAtm! (amount, transaction, time) BoughtMetroTickets! (count, amount, machine, time) GrabbedALunch! (amount, cost, time, menu, place) FoundACoin! (amount, gps, time) TookTaxi! (amount, rideDuration, taxiCompany, route, time) That's how the definitions could look like for C# language, if you used some T4 of Visual Studio (see code contracts explanation). Flexibility Given the sequence of events, we can project them to any desired structural representation. This is an extremely important feature. For instance, we could write a projection that would summarize all our expenses and produce the latest balance. However, we can do much more: Generate list of cities, in which coins can be found most often. Get a list of taxi companies that are the cheapest ones or fastest ones. List TOP 5 favorite places for having a lunch on Monday. What's more interesting, we don't need any really complex queries in order to do that. Writing event projections (at least in C#) is something that is quire boring. Try doing that if you have just a single field of Balance, or even if you had a list of changes (credit/debit). However, as long as you have a steam of events, you can project it to any form, even a conventional SQL database. Messaging Capabilities Events represent a serializable and immutable data structures that are appended to an append-only stream. As such, they share all the capabilities of messages. So we can: Multiplex them to multiple subscribers (i.e.: have a list of 5 last restaurants visited by Lady Gaga; while keeping that list up-to-date and continuously replicated around the globe to load balance the pressure of fans). Implement redundancy and reliability (i.e.: write events to 3 different locations at once, where we have 1 master and 2 slaves). Support load-balancing (competing consumers) and parallel processing. Support incremental sync that fits in a few lines of code. Here are these lines of code from the production system (we are replicating from remote to cache ): var next = _cache . GetCurrentVersion (); while ( true ) { var items = _remote . ReadRecords ( next , BatchSize ); if ( items . Length == 0 ) break ; next = items . Max ( m => m . Version ); _cache . AppendNonAtomic ( items ); logger ( string . Format ( \"Loaded {0} records\" , items . Length )); } Of course, in a more conventional system (that does not employ event sourcing) you can leverage something like SQL Replication or Microsoft Sync Framework . Improve performance Truth to be told, performance and scalability aspects are a by-products of inherent capabilities offered by event sourcing approach. In essence, we can get almost-infinite scalability on reads with blazing throughput and no deadlocks. All this is attributed to the following facts: events can be published as soon as they happen; it is dead-easy to precompute read models from the events (do a projection to some desired state); events can be processed in parallel; events can be multiplexed to multiple subscribers, where each one could maintain it's own copy of some read model. What is a read model? \"Balance\" is one sample of a read model; \"List of favorite Gaga's restaurants\" is another one. Essentially read model is some view (precomputed result of an SQL Query in SQL World). Since we have more flexibility with projecting events and passing them around, we can easily do more interesting things, reaching up to the speeds of LMAX (which was described by Martin Fowler ): Keep read models (precomputed query results) directly in memory in some MemCached instances (if server goes down, we can always repopulate cache from the history of the events). Run multiple processing units in parallel (if any one goes down - switch to the backup). Benefit from the real-time signal processing knowledge of robotics and critical systems (imagine, what could do an FGPA / ASIC optimized for parallel processing of events). Simplify Developer's Life There are a few more interesting aspects of event sourcing: Simplifies deployment and maintenance (less SQL, upgrade scripts and versioning). Reduces expenses on both the hardware and software (no need to have extremely powerful and redundant servers or commercial databases). Integration between systems is more straightforward (all Enterprise Integration Patterns apply here directly). Since no data is ever lost, we gain full audit (plus the ability to go back at any point in time) and excellent debugging capabilities. Event Sourcing is a natural fit for software developers that want to capture the essence of business domains (especially the most complex ones) while staying decoupled from platform complexity or hard performance issues. ES approaches help to provide clear answers to some of the new challenges brought to us by the market and technology: Cloud Computing, Big Data processing, Mobile and occasionally connected systems, real-time business intelligence. There also are some financial and political benefits for project stakeholders to be interested in. They all revolve around the ability to have better flexibility in project delivery, managing resources and risks. Ability to keep things simple, defer important decisions and adapt business solutions can be a powerful enabler in large conservative organizations. Smaller companies (such as lean startups) can also gain more competitive advantage and reduce time-to-market with such approaches. However, here we are getting already in the area of synergy effects with CQRS/DDD methodologies and their practical application to distributed environments (esp. clouds). This is a topic for a different blog post or a talk. Downsides Obviously, Event Sourcing is not a silver bullet , it is just a different approach to think and represent changes and data. If you are a C# or C++ developer, then this feels like going back to assembler. If you are a project manager - it's like consciously going back from Microsoft Project Server to task lists and custom budgeting software. This explains why there are quite a few problems with this approach. Defining these events is a complex art of it's own, which requires skills in domain modeling (hint: if you have a lot of events with following words in their names, then you are doing something wrong: Create, Insert, Update, Delete, Set, Change, Add). Domain-Driven Design (as both a book and a body of knowledge) is an entry point into this skill. There is little software and hardware to support event sourcing . Luckily, we need much less of that (as compared to SQL/NoSQL), but still. In the next few years we will see interesting solutions in this field. At the time being, there is even bigger lack of information and guidance on this body of knowledge (to be fixed within the next year). Since we have limited information, acceptance and software, naturally there is a limited number of experienced developers with true DDD/ES skills. All these downsides are quite surprising, since the actual principles behind event sourcing are extremely old; they have been discovered and applied in multiple areas over and over again. Even the replication of SQL (transaction logs) uses similar principles. Concerns There are a few additional concerns that might look like downsides of approach, but in fact are not that important. Extra storage costs - are usually negligible, when compared to the business value that might be created. For instance, cost of storing 200k events in the cloud is roughly 10 cents per month. Oh, and I've counted this one 10 times just for the sake of having 10 replicas in different data centers for redundancy. If this negligible cost would have saved me at least a few days of development, then this could be a bargain. However, event sourcing saved much more than that. Slower performance is not an issue, since we can optimize IO via snapshotting and persistent read models. And leveraging push-based nature of events, we can get immediately invalidating caches. In short, there are multiple technical solutions that could be plugged later , if there is such need. Fragility (loosing an event in the past causes the entire stream to be corrupt) is not an issue, since you can determine yourself the levels of SLAs to go for (via replication and redundancy). Corruption in any single replica can be reliably detected using git's approach: event includes SHA1 signature computed against it's contents and signature of the previous event. Versioning is sometimes perceived as a problem, since our systems tend to grow and event contracts (schemas) can gradually evolve to new formats that are incompatible with the old saved events. Yet, if approached consciously this can be solved (and solution is more elegant and simple than SQL migration scripts). I use a combination of 3 elements here: Using serializer that is evolution-friendly (i.e.: Google ProtocolBuffers is not only among the fastest serializers, it also allows you to rename class members without breaking anything). Using in-memory upgraders which are simple classes stacked on top of the event streams for replays and accessing history. They can split, merge, convert events or even fill in missing fields (if this data is available in the other events). Upgraders are just a lazy and quick replacement for rewriting the entire event stream history completely. Using careful event modeling (which is usually based on the Ubiquitous Language coming from DDD) to define events in a way that they will rarely change a lot (they will be based on concepts that do not change, no matter how their relations evolve). Further Reading If I haven't scared you enough with the downsides of Event Sourcing, here is some further reading. CQRS Info by Greg Young : Events as a storage mechanism Building Event Storage CQRS and Event Sourcing A few of my own articles : Event Sourcing a la Lokad Scenario-based Testing for Event Sourcing (and old article, to be replaced by a proper one on BDD testing). Some more materials : Event Sourcing and CQRS, Let's use it by Jérémie Chassaing (recommended read for C# guys) Really nice EventStore framework for .NET by Jonathan Oliver. Listen to the distributed podcast . Stories As nicely said by Mike Nichols in DDD/CQRS mailing list: My experience has been that ES and its promotion of business semantics over technological terms has a way of bending my mind toward modeling behaviors rather than essence. It also lets me avoid the ceremony of modeling state that doesn't contribute to behaviors. I see this as a good thing and as a side effect I probably use language the business person understands more. I just can't find a reason to use ORM in my domain anymore. ES seems to let me more rapidly model ... having a change log/audit trail is about the furthest thing from my mind when I reach for it."}
{"url":"/post/example-of-self-documenting-unit-test-with-event-sourcing/","title":"Example of Self-documenting Unit Test with Event Sourcing","text":"Example of Self-documenting Unit Test with Event Sourcing One of the biggest advantages of event sourcing approach is it's inherent capability to turn unit tests into a living documentation . Below is an example of specification that I've worked my way through today (that's how NUnit prints it out). registrations: duplicate email fails - Passed Environment: index includes email(\"contact@lokad.com\") When: Register 'd6e64e': Customer Name: Lokad Contact Email: contact@lokad.com Date: 2011-12-27 Expectations: [Passed] Registration 'd6e64e': Customer Name: Lokad Contact Email: contact@lokad.com Date: 2011-12-27 [Passed] Registration 'd6e64e' failed: Email 'contact@lokad.com' is already taken. And here's how actual NUnit code looks like: public Specification duplicate_email_fails() { var info = new RegistrationInfoBuilder(\"contact@lokad.com\", \"Lokad\").Build(); var index = new MockUniquenessService(); return new RegistrationSpec(index) { Before = {() => index.includes_email(\"contact@lokad.com\")}, When = new CreateRegistration(reg, info), Expect = { new RegistrationCreated(reg, info), new RegistrationFailed(reg, new[] { \"Email 'contact@lokad.com' is already taken.\" }) }, Finally = index.Clear }; } All was achieved without any special magic or even fancy tools. I've just pulled over sources of SimpleTesting and CompareObjects for additional readability. For those who are interested in RegistrationSpec class, it is just a simple snippet wiring together dependencies of aggregate root to a strongly-typed specification deriving from TypedSpecification in SimpleTesting : public sealed class RegistrationSpec : AggregateSpecification<RegistrationId> { public RegistrationSpec(IRegistrationUniquenessService service) { Factory = (events, observer) => { var state = new RegistrationAggregateState(events); return new RegistrationAggregate(state, observer, service, new TestPassword(), new TestIdentity()); }; } } Explicit strong-typing of aggregates (as described in bliki ) works all the way back in unit test specification by allowing to benefit from compiler-time checking and IntelliSense support. In other words: you don't need to navigate through hundreds of messages to figure out which ones are actually applicable in test."}
{"url":"/post/experience-with-cqrsesddd-in-current-project/","title":"Experience with CQRS+ES/DDD in Current Project","text":"Experience with CQRS+ES/DDD in Current Project Just a quick summary of tech I'm using in the current project. So far it works out rather nicely. Project targets Windows Azure . However, just like our other recent projects at Lokad, it is developed to be cloud-ignorant (can be deployed anywhere where Microsoft .NET can be run). Development machine : MacBook Air 2011 13' with intel i5 CPU (1.7Mhz), 4GB RAM and 128 GB SSD. OSX (Lion) with Windows 7 running in Parallels. I used to boot it in BootCamp (for native performance) a few times and then switched to pure VM mode. Development IDE : Visual Studio 2010 with ReSharper 6 (solution-wide analysis is OFF). Windows Azure SDK is installed (with Emulator and Storage) However it's not used at this project at all, due to being rather inefficient with resources of my MBA. File system is used for both queues and storage instead. In production they will be reconfigured to use Azure-specific adapters (abstractions provided by Lokad.Cqrs). Architecture : Distributed system with \"CQRS approach\" where aggregates are implemented using Event Sourcing . Business processes are stateless routing rules (as opposed to more classical sagas), with more complex workflow logic and state being pushed to aggregates or dedicated entities. The latter makes perfect sense with event sourcing and deployments in cloud environments. Read-models are implemented using document model for now (each view is a single document being updated atomically). Still no SQL. Huge lists will be handled on per-platform basis (if they ever become a problem worth solving). Frameworks : Domain logic (the most complex part) does not really depend on anything except Base Class library (System. namespace). Infrastructure (essentially, configuration code) relies on ProtoBuf/ServiceStack for serialization and Lokad.CQRS with Autofac for hosting everything server-side. Web UI - ASP.NET. Testing : NUnit used to wire custom specifications (code copied and adjusted from SimpleTesting). Custom tools : using side helper to generate command/event contract classes (got lots of them) out of compact DSL. C# code is generated whenever I do Ctrl-S and is immediately picked up by R# (this is slightly improved version if T4-driven CodeDSL. Improvement - it no longer depends on T4 and hence is much easier to change. Actual DSL generation is completely within the project now. details are in gist . This custom tool can be dropped at any point (completely along with the DSL files), contract files will stay. What makes development complex : it's hard to get domain models and bounded contexts right (however coding or testing itself is no longer an issue. Scalability and performance - even more so); hard to communicate finer-grain detail to people used to SQL/ORM; I still haven't figured how to explain (graph) business processes. So far sequence diagrams work the best; Event sourcing plumbing had to be implemented manually (EventStore was too much of a dependency, something more simple was needed). For the record, this project is a replacement of the project that was using following stack: NHibernate + FluentNHibernate Newtonsoft JSON Autofac Windows Azure Microsoft SQL Server Protobuf Lokad Shared Libraries Migrating existing system to ES requires an additional custom \"Reverse Engineering\" app, which scans database and generates events for the new deployment (roughly 300000 events). Best lessons learned : NHibernate and ORMs in general are a pain (especially in systems that has to live through more than 1 deployment). Although we are so used to that pain, that it does not feel. Well, until you have to go back and work on projects that have not been migrated, yet. Specification tests (and ability to use them as a living documentation readable by non-dev people) create some interesting additional opportunities. Simpler the code is - better. Abstraction from the persistence not only reduces friction, but also allows to have multiple deployment options (in cloud, on-premises, in memory etc). If domain model is defined (i.e.: there is BC and specifications for an aggregate), then actually coding Aggregate root with event sourcing (AR+ES) feels like a monkey coding - you just need to make all tests pass. Helland's principles (of building distributed systems in eventually consistent world) can be enforced in CQRS/DDD world, if we get rid of sagas in classical interpretation (something that has state, subscribes to events and publishes commands), replacing them with stateless routing rules and pushing state to entities (aggregates). Underlying framework becomes much easier (no need to explicitly manage and correlate saga state). Dynamic scalability - too. I'm using Event Sourcing even for simple things (like replicated user logins), simply because introducing Sql (or any other storage) would add too much complexity and friction without any visible gains. Using specifications for testing AR+ES has another side advantage - I added a single unit test, that gathers pre-populated commands and events from these specifications. This data is round-ripped through a few serializers and compared. This ensures that message contracts (and their use) stays compatible with all primary serialization options (ProtoBuf, JSON). Helps to catch use of DateTimeKind issues (which is generally not supported by JSON+PB). Obviously, this is highly subjective experience (and really rewarding). It might change as we are pushing the project to the first release (and further). Your mileage may vary. If you are really interested in long-term production experience - please ask me about that in a few years . PS: This is not the first project where ES has been used. But it combines the best lessons learned in lean development at Lokad ."}
{"url":"/post/from-net-to-erlang-starting-work-with-large-free-dating-webs/","title":"From .NET to Erlang - Starting Work with Large Free Dating Website","text":"From .NET to Erlang - Starting Work with Large Free Dating Website 10 years spent in .NET were fun, but it is time to try something radically different . As you probably already know, I'm moving to a new project at HappyPancake - large free online dating website (largest one in the North Europe, if I'm not mistaken). I'll be working with outstanding people there: Tomas Roos and Tom Janssens . I'm a completely newbie in this development stack (summarised by Tom): Migrating a large free dating website from the Microsoft stack to Erlang/Linux... Some keywords: - Programmer anarchy!!! - Erlang/Python/C/ - FoundationDB / Phonegap / flightjs / responsive design - Chatroom devops - Remoting - CQRS / DDD / Eventstorming - Wercker / DigitalOcean / Github / BrowserStack / PhantomJs / Behave - Campfire / Mindmup / Google docs / Dropbox / Google Hangout - And many more! I hope to be learning fast enough to keep up with these guys, so that they will not kick me out of the team due to the sheer stupidity, of which I'm guilty on more than one occasion. So far the experience has been truly amazing. We're going to have RnD blog for this project . I'm still going to stay with Lokad in a support role, however my time spent there is going to be limited. The company needs somebody to take on some of my responsibilities. If you are an experienced .NET developer with desire to get immersed into Windows Azure devops, Big Data, Business Analytics and event sourcing, there is an opportunity for you in Paris (I'll try to help as much as I can in this process). Please get in touch via contact@lokad.com ."}
{"url":"/post/get-better-understanding/","title":"Get Better Understanding","text":"Get Better Understanding I know BizTalk and would not hesistate to use it. Enterprise Architect As it seems, a lot of problems in software domain stem from a single cause - we don't truly understand a lot of things and fail to acknowledge that. This leads to complex designs and half-baked solutions. How many times have you encountered a marvelous new technology that has promised you better life and all the riches in this world? How many times, have you discarded this technology after a few days or years of trying? What about Aspect Oriented Programming, Hibernate or tools that calculate code metrics? I believe, this highlights one of the most essential problems we have in the development today: we try to substitute true understanding of something with smart technologies and tools that promise to take care of us and our tasks at hand. In fact, they can, but their applicability is more limited than it is advertised. And they can lead you into really dark valleys, while trying to hide some of their inherent issues. Do these sound familiar to you: SQL upgrade scripts, WCF configuration code, NServiceBus endpoint management or making your favorite IoC container work with some framework. If you want decent results a tool can help. However, if you want to push your limits beyond the limitations of your tools and technologies - go get your hands dirty. What I'm trying to encourage you to do - to dare to try and get rid of some tech or tool in your next project (whether it is work or a hobby). It could be something like IoC Container, service bus framework, ORM tool or entire storage engine stack. It certainly would hurt, just like any exercise does. But you might also find some deeper understanding there. The one that would let you push your limits beyond the limitations of these tools and technologies."}
{"url":"/post/get-real-and-move-forward/","title":"Get Real and Move Forward","text":"Get Real and Move Forward These are a few interesting facts about Lokad, that you might find entertaining. In last 10 days team of 3 devs at Lokad delivered 9 official releases. This covers 4 different projects developed in parallel. Tenth release goes live tomorrow: an upgrade of Salescast. It brings online final elements of infrastructure needed to deliver quantile forecasts that represent major breakthrough in inventory optimization done by our amazing analytics R&D team. As you know, one of our customers managed to save million of euro at monthly cost of 150 EUR, and this was before quantiles surfaced. All this business intelligence is cloud-capable from the start and keeps on getting further reductions in development friction, risks and wasted efforts. We just keep on moving forward faster, adapting to changing environment on the go. This acceleration happens not via the increase of development teams but by working smarter. While doing that we don't hide everything behind the curtains but rather share as much secrets as possible with the community (check our open source projects ). This is rewarding in many ways. When I hear about development teams and companies that spend time moving at slower pace than that - it does not feel right. We are living in amazing times, where speed of change keeps on accelerating, as new disruptive technologies come into play. Cloud computing, mobile devices, personalized large-scale data processing, real-time intelligence and global socialization were unheard of by our parents and now they are a reality. Can you guess what things your kids would play with? I can't even imagine, but I can try to help building that different future. It will come fast. So get out of greenhouses with slow monthly releases and reality distortion field of ivory-tower architects and unlimited budgets. These walls are not going to hold on for much longer; things like \"economical crisis\", \"technology changes\", \"political reforms\" and \"new market players\" are probably already breaking through into previous safe heavens. Get out of the house and go play outside. Find things that you enjoy and push them to their limits. You can work from the ocean beach at Bali, travel around the cities for beers with friends or just explore the world along the way. This would also help to fuel inspiration and motivation on the go. Get real. Challenge authorities and people who dare to tell you what is wrong and what is right without proving their words by practice. Even if they prove - still challenge them, you might be able to do better. You should be able to do better - that's how humanity pushes forward state of the art. It's also fun."}
{"url":"/post/getting-rid-of-cqrs-view-rebuilds/","title":"Getting Rid of CQRS View Rebuilds","text":"Getting Rid of CQRS View Rebuilds We all know that one of the coolest reasons of event-centric architectures is: you can always discard your persistent read models and rebuild them from scratch , optimizing for the specific UI. For instance you can replace that paginated list of users in your system with a list of the most active users along with their latest activities. You can also leverage this capability to change underlying technologies, introduce replication or in-memory caching. We've been using this approach a lot recently, even created an in-house tool for that. It discovers projections in the code (projection is a class that is responsible for transforming an event stream to a specific type of view) and runs an event stream through them. While doing that, first we save all views in memory (just makes things faster, especially when you have hundreds of thousands of events). When the entire rebuild is complete, we \"dump\" in-memory views to the actual persistent storage, overwriting the old version of views. For .NET developers we've published a sample source code for that in Lokad.CQRS Sample Project (you are interested in 'Audit'). Let's say, I have an event stream from production and want to see how my updated views will look like in local environment with the brand new UI: Start Audit tool. Point it to the event stream for the specific bounded context. Open views tab and pick the views I want to rebuild (more views to rebuild - longer it takes). Wait for the rebuild to finish copy materialized views from temp folder to the storage Start system. This sounds good and useful, isn't it? Now you can let your UI developers go wild, developing various denormalized view models, that could be used for really non-conventional views, while staying extremely responsive. However, there is a catch. Such functionality is a great enabler (it shortens UI-focused iterations), and its get's pushed to the extreme. Extreme is: you iterate UIs and and underlying views so often, that manually rebuilding views become tedious and time consuming. If I were to rate ORM+SQL experience on that tediousness scale, I would need a logarithmic one. So a few days ago I went ahead and pushed further the idea of self managing projection servers. I discarded rebuild functionality completely in one of our projects. Instead, server is tasked with duty of tracking his own projections and rebuilding only the ones that have changed. In normal production this would happen only once after a deployment that touches projections. In rapid development this happens whenever I edit some projections and restart the server. It will: Check if views exist. And if they don't - rebuild in memory and update before starting to process messages. If views exist - check if projection code have changed since views were touched last time. If they have - rebuild corresponding views in memory and replace them in the stores. Obviously, server does not process any new events, while rebuilding views. So I no longer have to worry about updating views either in production or in development - server tracks this for me. This is a major simplification. And since server rebuild only projections that have changed - it is always as fast as possible. How do we detect if projection code has changed? Answer is simple: we cheat by teaching the server how to decompile itself with the help from awesome Mono.Cecil (this is a trick I've learned from Greg and Svein and their awesome work on MightyMoose ). While doing that, to keep everything simple, we assume that no messy IoC containers are used and we explicitly wire and instantiate out projections the stupid way . Given that, we just need to generate a decompiled snapshot of a projection from the very code that is being executed (while discarding IL offsets) and store it whenever we generate views. Decompiled snapshot is just a list of instructions: Void DomainIndexProjection::.ctor(IAtomicWriter`2<unit,DomainIdentityVector>) IL_0000: ldarg.0 IL_0000: call Void Object::.ctor() IL_0000: nop IL_0000: nop IL_0000: ldarg.0 IL_0000: ldarg.1 IL_0000: stfld IAtomicWriter`2<unit,DomainIdentityVector> DomainIndexProjection::_writer IL_0000: nop IL_0000: ret Developers that use Lokad.CQRS Sample project for their dark production purposes, would immediately notice familiar class names. Actually the simplest code to generate something like this can start as: var builder = new StringBuilder(); foreach (var @class in projections.OrderBy(p => p.GetType().Name)) { var type = @class.GetType(); builder.AppendLine(type.ToString()); var typeDefinition = mod.GetType(type.FullName); foreach (var md in typeDefinition.Methods.OrderBy(m => m.ToString())) { builder.AppendLine(\" \" + md); foreach (var instruction in md.Body.Instructions) { // we don't care about offsets instruction.Offset = 0; builder.AppendLine(\" \" + instruction); } } } return builder.ToString(); So we just generate our representation of the executable code of the projections and compare it with the existing one. If some projections don't match - they need to be upgraded. In practice it's slightly a bit more complicated, since I have to walk down the referenced classes in order to grab things expressed inside referenced anonymous methods. Once the approach is stable I'll push it to the Lokad.CQRS Sample. Once you have this thing delegated to the server, life suddenly becomes more simple and practical. Both the development and deployment experience are improved. There is even no need to track which projections have changed within the iteration, since server can do this better. This especially becomes useful, when you need to deploy to the cloud. [0001070]: Engine ready. Running rebuild [0001071]: Projections in hub-domain-view are up-to-date [0001109]: Projections in hub-client-view are up-to-date [0001112]: Projections in hub-apiops-view are up-to-date [0006684]: Started Version - 2572 (Hub.Engine) Now if I could only figure the dead-simple simple way to automate VIP swaps between projection versions, so that they could be upgraded with no downtime at all… PS: Yes I did use decompiled snapshot of projection .ctor in this listing. Methods are slightly longer and wouldn't make a big difference."}
{"url":"/post/golden-rule-of-technology/","title":"Golden Rule of Technology","text":"Golden Rule of Technology The rule says: 1 new technology goes into codebase only after 1 old technology goes out . This is a simple empirical rule that we've coined with Joannes , while talking about business and managing technology. As it turns out, this rule makes a really nice guideline for planning future development. The rule can apply to all sorts of things: new technological concepts, frameworks, ideas, tools and software. Basically, a technology in this list is anything that a new Junior Developer will have to learn about the project, before he can productively work the assignments . This approach encourages technology pruning and explains how to structure technological evolution of your company, almost turning this process into a game. This is quite important, since otherwise it will be hard to \"embrace the change\" and leverage new technologies and tools, as opposed to being hit hard by every new \"revolution\". Essentially it helps to define your own Technology Radar at the project level. For instance, we have the following evolution path for the User Interfaces in the world of Microsoft .NET Framework: Windows.Forms Silverlight/WPF Windows 8/HTML5/some Metro UI If you treat these UI technologies as something to focus completely on, then every new change might be quite expensive (i.e. from Silverlight to HTML5), because of the way the problem is being structured: We are going to develop new eCommerce suite with Silverlight. However, treating technologies as something transient and replaceable can reduce costs and even improve architecture (making it more resilient to risks and changes). Consider this way to structure a problem: We are going to deliver eCommerce suite, while using Silverlight for UI for the time being. Obviously, this approach requires an architecture that favors decoupling and simplicity (i.e.: CQRS/DDD/ES methodology ). However, this is not limited by the software architecture alone. Here are some samples: Frameworks Version Control Systems Testing Approaches Project Management Concepts Integration Servers Deployment Strategies Serialization Formats Another interesting side effect of this Golder Rule of Technology is that it encourages developers to simplify and refactor in order to be able to try some new cool things in the project. And sometimes, when you get rid of an old technology, there already is so much improvement, that you don't even want to add anything instead. One example of such replacement in place is gradual migration from traditional SQL-driven persistence to Event Sourcing at our place . It allows us to discard quite a few items from the current stack: NHibernate with all the related binary dependencies SQL Servers along with the entire concept of relational database and object-relational impedance mismatch Evolutionary databases and managing them (including upgrade scripts and unit tests working against mock databases to verify behaviors on continuous integration servers) Another practical sample would be - planned replacement of API implemented as REST via WCF (with SOAP and all sorts of weird configuration problems) towards a dead-simple implementation on HttpListener. An interesting thing is that I never really considered complexity of technologies related to a feature till starting to think about that in terms of \"Return on Technology Introduced\". As it turns out: Usage of technologies is extremely overrated (sometimes attributed to the marketing hype related to certain buzz-words). Usage of additional technologies tends to increase both burden and complexity of a project at hand. Emprical evidence suggests that linear increases in these - lead to exponential rise in project costs. The process of complexity increase can continue till the project becomes so expensive that it stalls. By reducing the number of non-essential technologies used, we can have better management of overall complexity (fighting delays, quality problems and costs). In essence, simplicity is a technology of its own . But it does not count as one."}
{"url":"/post/gregs-event-store-and-windows-azure/","title":"Greg's Event Store and Windows Azure","text":"Greg's Event Store and Windows Azure About Event Store As you probably already know, Greg Young built a file-based Event Store with a team in Ukraine. The product was publicly announced last September and now they are moving towards the second version. Single-node is open source and free to use (and it's a wonderful learning material for C# developers). If you want to have more failure-tolerance, then multi-node version is available. It uses negotiation protocols to ensure that cluster of Event Store nodes will stay up and keep all the information as long as half of the machines in it are alive (if I recall it properly, they actually use chaos monkey testing to ensure that, among other things). This store can get you thousands of persisted transactions per second on rather slow notebook. There is a javascript projection host built into this event store (currently it's in beta and hence has to be enabled manually). Alternatively you can define your own chasing projections and connect them to the store. ES offers HTTP REST API (designed to be scalable for reads using commodity building blocks like reverse proxies) and high-performance TCP API. With Windows Azure Here's what I currently know about behaviour of Greg's Event Store on Windows Azure (this question seems to be the recurring theme). Event Store works really closely file system, where the data is kept broken into chunks. Whenever possible these chunks are loaded into unmanaged memory to speed up read operations (which are really cheap). If you want to ensure certain degree of write durability, then these writes will be almost as fast as you can flush to your disk controller. This setup makes ES really performant, and also prevents it from getting the same level of performance in single-node configurations on Windows Azure. Single Event Store node can run on Azure in 2 major configurations: Data is stored on a local hard-drive . In this case, writes will be fast (thousands of tx) but there are no guarantees, that data on local drive will not disappear at some points. Local drives can serve only as temporary caches. This scenario is fast but less reliable . Data is stored on page blob mounted as VHD drive . In this case writes will be passed to the operating system and then flushed to Windows Azure Blob storage (where they will be replicated between multiple machines). Azure VHDs are guaranteed to stay there, however this comes at a cost ( reliable but slow ): you can expect roughly 125 writes per second to page blob on a single thread (afaik, ES currently needs multiple writes to commit a single transaction, which will yield even less than 125 transactions per seconds). when EventStore starts up (e.g. after a node is restarted by Fabric Controller), it needs to read from VHD, which can take minutes, during which the store will be unavailable for writes or reads. Luckily, for small scenarios: Windows Azure does not restart machines that often (how often do you need to roll out Windows update which needs a machine restart?) You can run Event Store on a Linux machine, which requires even less restarts. Depending on your business case, resulting performance might be enough for your scenario, or more tweaking would be needed. Only measurements will tell. Getting Better Performance You can theoretically get higher tx write throughput with some crash guarantees by: running event store with a local drive. adding a small background process which will incrementally copy ES to the blob. on a startup, before even starting ES, if local drive is empty - grab latest backup of event store from Azure. Alternatively, with a single node scenario, you can try to optimise performance of event store on Azure via enabling host caching for your VHD data disk (it's enabled for durable OS drives, but disabled for additional data drives). If the business grows, then it also might eventually be beneficial to sign-up for paid version of event store (putting event store data on transient disks in different fault domains). In this case, replication will be managed and guaranteed by Event Store engine. And Windows Azure will try to guarantee that nodes in different fault domains would not go down together."}
{"url":"/post/handling-big-data-in-cloud/","title":"Handling Big Data in Cloud","text":"Handling Big Data in Cloud Big Data is one of the new hype terms that is gradually gaining popularity. No big surprise - amount of data around us is gradually growing and by properly mining it we can get competitive advantage and eventually make more money. Money attracts money. Usually by term big data we mean a collection of datasets that are so large, that they become too hard to be processed on a single machine . This data can even be so big, that it takes hours or days to process it in a data warehouse. Fortunately, cloud computing comes to the rescue. It provides following resources: nearly unlimited scalable storage capacity for data (i.e. Azure Blob storage or Amazon S3) elastic compute capacity to process this data (i.e. Azure Worker Roles or Amazon EC) network capacities and services to transfer data and coordinate processing (queues and actual networks). These resources are elastic (you can get as much as you need) and paid-on-demand. Latter is really important, since you pay only for what you use. I'm assuming here, that in your business model, more data processed means more money made. There are three distinct approaches in handling big data, based on the specific challenges. Batch Processing First one is about batch processing , where you do not need extremely fast computation results but have terabytes and petabytes of data. This is essentially about implementing MapReduce functionality in your system, often resorting to hacky but extremely practical tricks. Below I'll talk about some of the lessons learned in this direction in Lokad (we provide forecasts as a service without any hard limit on amount of data to process). Obviously, storage of this data becomes a primary concern with this approach. Fortunately starting companies can leverage already existing cloud computing resources, starting from Google's GFS and up to Azure Store . Or if you are big enough (and this is worth it), you can roll a data center of your own. Idem for computing and network capacities. Stream Processing Second approach deals with cases, where you need high-throughput and consistently low latency on gigabytes of data. High-frequency trading is one of the most known domains here (various real-world telemetry systems being the second one). Abusing event streams and ring buffers (like in LMAX ) along with partitioning - helps to deal with the challenge. Due to latency requirements, cloud computing might be not the best choice for latency-sensitive solutions. It is more efficient to roll out fine-tuned infrastructure of your own including specialized hardware and soft (things like InfiniBand and ASIC or even Mixed-signal chips). However if you are OK with a bit of latency and are more interested in high-throughput continuous computation, then the flexible network and CPU resources provided by cloud are a good match. Realtime Batch Processing Third approach to Big Data involves dealing with more complex requirement - providing real-time capabilities over vast amounts of data (so we have to be both fast and capable of handling petabytes of data). Twitter is a vivid example here - it needs to provide (almost) real-time analysis over billions of tweets around the world (although fortunately it does not to go at microsecond level). This challenge can be solved by actually mixing first two approaches: we would use slower batch-processing for dealing with actual bulk of data (tricks like preprocessing, append-only storage and MapReduce work here), while latest data will be handled through the real-time streams (stream processing and continuous computation). Results are merged as they come out of these two data pipes. Over the course of time (i.e. daily) latest data is pushed from \"hot\" zone into the bulk data. More often than not, real-time data processing is done by simplified approximating algorithms that deal with subset of data. Batch processing is more through and precise (at the cost of being slower). By pushing real-time data back to the bulk data, results of the computations are actually corrected, as they are recomputed by batch algorithms. Twitter's Storm project features nice overview of this approach."}
{"url":"/post/handling-concurrency-in-cloud-views/","title":"Handling Concurrency in Cloud Views","text":"Handling Concurrency in Cloud Views Let's expand on the topic of views in the Cloud Architectures (CQRS) by introducing concurrency handling. This is an update to the original article on Views (it was changed accordingly) that comes as a response to Olav's question (thanks!) How do we handle concurrency conflicts with the views: when the same view being accessed by more than thread simultaneously ? Actually we don't care a lot about cases with multiple readers and a single writer at once, since cloud storages providers generally ensure atomicity at this level. There are two major approaches for handling concurrency while updating views by multiple writers: simplistic and logical. The simplest approach is too keep updates of a single view entity limited to a single thread. For example, you can start with a single worker thread processing all view updates. As your application grows, increasing load and availability requirements, you can split updates of the different view types and entities between different threads and workers. In other words, you will partition views by type and/or view ID. Note, that we don't need to scale out actual view persistence, since it is generally handled by the cloud storage provider, to start with. However, such scaling tends to be limited by the world region (i.e.: North Europe) and we still might need to enable CDN or manually replicate data between multiple cloud centers. This is relatively easy to do just by streaming domain events to these data centers. As long as you pass entity ID in message headers (recommended in distributed systems), it will be easy to route domain event messages between different queues/workers. Just a quick clarification of terms. View Type is a message contract or the POCO class, while view entity is a single instance of this type, as identified and tracked by its unique identity. This identity serves as a primary key used to store and retrieve the actual view data. In the case of singleton views we have a single entity per type. Eventually you might encounter the need to allow multiple concurrent threads (whether hosted in a single worker or different ones) to be able to update the same view entity at once. This is where optimistic concurrency control comes in. We just need to modify our view writer implementations to keep track of blob's ETag (analogue of version ID), while downloading it. Then, after executing local update, we upload it back, while passing last known ETag back to the server. Azure Blob Storage (just like any other cloud storage) is aware of this feature and will update view data only if the ETags match. If they don't (somebody else managed to concurrently update our view) - update will fail and we'll get an exception. This exception will captured by the service bus, which will retry the actual update operation later. If such an update fails more than 4 times at once, this would mean some heavy concurrency issues probably coming from really complex update operations. This article is a part of R&D series on Practical Distributed Architectures and Clouds (aka CQRS in Cloud). You can subscribe to the news feed to stay tuned for more updates."}
{"url":"/post/handling-multiple-event-streams-in-event-store/","title":"Handling Multiple Event Streams in Event Store","text":"Handling Multiple Event Streams in Event Store Writing and reading 300000 events is fast even in VM running on MacBookAir (as fast as appending ProtoBuf serialized data with SHA1 hash on top). However the problems start, when you get thousands of individual aggregate streams to be written along the way. This keeps me thinking about the potential design options for an event store that could get decent performance even on low-end machines (i.e. these ARM-powered platforms that Kelly Sommers had been playing recently with). The culprit is that we want to have data reliably on disk as fast as possible. At the same time, we might be pushing thousands of event-sourced Aggregate Roots (AR+ES) to a single event store partition (served by one thread). And we want to keep these AR+ES instances separate. As you probably know, switching between files can reduce performance a bit (unless you are writing directly to disk, although even then would be seek penalty). The simplest solution is actually dead-simple: \"What if we introduce a slight delay between the moment data is persisted on disk and moment it actually is persisted in a separate file\"? The tricky part might be in reading events back per stream (since we would need to replay them for AR rehydration). One of the solutions is to route them to servicing threads that read individual entity streams. We can afford blocking read queries a little bit, if individual stream has not been updated to reflect the latest data. We can easily figure this out by checking up on absolutely consistent in-memory pointer which is updated by the ring writer (essentially just a dictionary of persisted streams and their latest versions). Interesting culprit is that we can actually speed up the performance of writer by configuring it (and all the slaves) to persist the main buffer in the memory. In this case, if master fails or crashes - we can always fail-over to one of the slaves (while letting the former master to catch up)"}
{"url":"/post/helping-non-technical-people-understand-the-data/","title":"Helping non technical people understand the data","text":"Helping non technical people understand the data I discovered this interesting question in Enterprise Architecture group in LinkedIn: I'm curious what some of the best practices are to share what information is stored in our database. I am aware of an ERD. I am looking for a business friendly solution that I can show to non technical employees to make them aware what data is available. Assuming that we are talking about relational databases, database representation is just a historical way to store data in a way that is optimised to save expensive disk and RAM space . Unfortunately this optimisation makes data schema overly complicated by trying to reduce information duplication and used space (we normalize, overwrite and fragment data). Curiously enough, prices on disk and RAM have dropped significantly during the last years. For example, Rackspace charges 0.15 USD for storing 1 Gigabyte of Data for 1 month on a hard drive (for that amount Windows Azure would even keep multiple replicas for you). Yes, we still try to store our data in databases, while accepting all the imposed limitations and cryptic storage schema . Slightly better way of storing business data (and sharing it) is to capture it as a sequence of documents which capture important business events. For example: NewCustomerCreated { Id : 37659, SSN : 0274178654, FirstName : \"Rinat\", LastName : \"Abdullin\", ... Manager : \"John Big\", ManagerId : 707, CreatedOn : 2013-06-12 16:34:19, CreatedAtLocation : \"Paris-345 SE HQ\", CreatedAtLocationId : 921 } AccountOpenedForCustomer { CustomerId : 37659, AccountId : 227461382600000045, Currency : EUR, Type : Deposit ... ApprovedByManager : \"Merry Shea\", ApprovedByManagerId : 802, ... } You can put as much information to these documents, as you want without worrying about the schema, as long as you keep them documented. Storage is cheap, so we can even add some extra data that might be useful later. This data is already more understandable by non-technical people than a schema in third normal form. People are used to documents and can read them. A sequence of such business documents can be automatically represented as a readable activity stream for a specific entity (e.g: \"facebook wall of a customer\"). Human understanding story does not end here, though. When you need to help another team to use your data, all you need is to: Provide them with the dictionary of these change documents, where each document has a short summary describing its purpose and a detailed explanation of all the fields. Provide team with the web service endpoint, from which they can retrieve a sequence of all events (for which they have access to) from any point in time and then stay updated in real time. If the team in question needs a database to work with that data, then they can easily transform these business change documents (or business events) into SQL schema that is populated with the data and then kept up-to-date. In essence they would have a persistent and up-to-date SQL cache of customer data on their side. Obviously, a team might be interested in doing something more interesting like cross-referencing or scanning data for new fraud detection patterns. Then they could transform these business events into batches for Map Reduce jobs in Hadoop or into star schema for OLAP analysis. In my experience, writing such transformations is an extremely simple task. Such team might eventually discover some new fraud patterns and would want to provide real-time alerts about these. They would just need to consume all new events in real-time, sending a notification when a match is detected. I wouldn't go in detail into performance forces behind such approach. It's enough to say that LinkedIn was forced to switch some systems from Oracle to streams of change documents just for the sake of performance (and lower license costs). It's needless to say that Twitter uses streams of events as one of the cornerstones of it's architecture. And very few systems can drink from it's firehose without drowning. At Lokad we had related experience through the use of Lokad Data Platform - Cloud-hosted server for storage and querying gigabytes of business events. Exposing business data as sequences of documented events (with business semantics) seems to work in enterprise world much better than trying to share database schema. It definitely helps to unlock \"data silos\" and enable inter-team collaboration within the enterprise."}
{"url":"/post/how-do-you-decompose-complicated-system-into-components/","title":"How do you decompose complicated system into components?","text":"How do you decompose complicated system into components? The answer to the problem of finding a proper way to decompose existing complicated system into smaller components is two-fold: Use known methodologies for finding potential boundaries; Iterate a lot to see if system system could be divided across these boundaries. Let's start with examples of methods for finding potential boundaries : Look at the domain model from the perspective of Domain-Driven Design , seeking out bounded contexts and aggregate boundaries there (although classical DDDish BCs and aggregates could often be split in smaller focused parts as well). Separate domain-side problems (need domain expert) from implementation problems like scalability (in some projects performance might be part of the domain, too); Use domain events to capture state changes in some parts of your system; then identify related events and group them - they could be a core of a new component; non-related events could probably be put into separate components; Look from the perspective of separating project between multiple teams and individual developers ; Look at transactional boundaries in your system. If there are that changes must absolutely happen together (or fail together) then they might be candidates for being put into a single component. If some changes don't really affect the others, then you could probably consider keeping them separately. Search for processes which must be coupled tightly from the temporal perspective - they might belong to one component. At the same time, presence of queues might indicate potential component boundary. And, above all, simply keep on asking yourself one simple question - what things can be taken apart ? How can we separate our software into atomic things that can be composed to solve the problem . Decomposition is not simply about breaking some project into a bunch of tightly coupled modules. The idea is to identify the boundaries and invest effort into evolving them . Methods for finding potential boundaries (like the ones listed above) only provide you with some hints and ideas. Your objective (as a designer) would be to play with these ideas, iterate and evolve system, while finding options that truly make it more robust, evolution-friendly and simple (put your priorities here)."}
{"url":"/post/how-i-stopped-worrying-and-learned-to-love-the-www-and-unix/","title":"How I Stopped Worrying and Learned to Love the WWW and UNIX Way","text":"How I Stopped Worrying and Learned to Love the WWW and UNIX Way Lokad provides big data analytics for retail as a service . For a long time, in order to deliver this to our customers, we used to invent new technologies and frameworks to fit our \"custom needs\". More than 20 public github repositories are still out there, including custom Azure execution framework, ORM and message bus. This was a good journey with a lot of learning about distributed systems, event centric designs and big data processing. There were many challenges, too. However, eventually we got tired of unnecessary challenges and became lazy (\"we\" as in \"I\"). This lead to one simple realisation: why do we even need to invent so much, if there already exists largest distributed system that we can learn from and reuse what others built? It is called World Wide Web. Surely, underlying principles might be not as sexy as some brand new \"Enterprise stuff\" (like AMQP or Azure Service Bus), yet they seem to work. Besides, WWW: has a huge amount of documented experience (more than any \"Enterprise\" software); wide variety of tooling; is frustratingly simple. In my current and limited experience, more we shift our design towards underlying principles of WWW (and away from latest sexy tech), more it feels like a huge relief and falling into the pit of success. Current project that we are working on in Ufa office (rewrite of business and SaaS backend of Lokad) is nothing like the previous systems. It's composed of relatively small and stand-alone applications which communicate over simple protocols (JSON over HTTP) using constructs aligned with the established domain model. A lot of complex technology is gone. For example, we ditched use of Azure Queues for communication between various components, replacing that with one-way RPC calls via JSON over HTTP (queueing can be plugged internally). All of a sudden, this: reduces software complexity (e.g.: your backend server is only accepts one-way commands in JSON over HTTP and publishes events as JSON entities in ATOM feeds); provides much better debugging and development experience (all of a sudden you can use tools like Fiddler or curl to interact and play with your backend server); allows scaling writes without complicated message topologies (just queue up all your one-way HTTP PUT/POST/DELETE requests); allows scaling reads by using dead simple force multipliers like reverse proxies; actually allows to use more efficiently services provided by Windows Azure while reducing vendor lock-in. In fact, in this project we stopped using all of Azure storage and messaging capabilities, since they are no longer needed. The only thing still used is Azure hosting model: instances of Worker and Web Roles which are managed by Azure fabric and run behind load balancers. However, should there be need it would be easy to move to a different cloud provider or managed hardware. Here are a few other new cool possibilities that opened: now it's possible to create integration tests for backend API by recording HTTP traffic with Fiddler and then replaying that; backend API just became self documented and accessible via XML, JSON, SOAP, thanks to ServiceStack; it's possible to rewrite some component using a completely new technology and nobody would ever notice; design is suddenly more friendly for things like reactive programming and single-page applications; it's much easier to deliver features by first implementing contract using hacky approach and then swapping that implementation for a proper code, whenever necessary. Things work out in such a way, as if we were trying to steer closer to UNIX Philosophy : This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface. In Unix world, small programs can be composed together to perform more complex tasks, communicating via the pipe or text files. Here's a beautiful example of installing Ruby Version Manager from a terminal: curl -L https://get.rvm.io | bash In case of distributed systems, we can think of small and focused applications which communicate over simple and human-friendly protocol (e.g. one-way JSON over HTTP in a RESTful way). If you align messages of this protocol with the domain model , like they do in Domain-Driven Design methodology, you'll have something that can withstand change pretty well . Keeping these applications really small and focused (like UNIX programs) would allow to reduce cost and friction of changing implementations so much that they could be easily thrown away and rewritten from scratch. Add immutability to the list of underlying design principles of these components and suddenly you get nice and predictable scalability and fault tolerance . All of a sudden, my head hurts less."}
{"url":"/post/how-micro-services-approach-worked-out-in-production/","title":"How Micro-Services approach worked out in production","text":"How Micro-Services approach worked out in production Last week we released a new version of business backend at Lokad. Previous version was based on message-driven design with heavy use of CQRS principle, event sourcing and queues. There is a case study on that at Microsoft CQRS Journey . As I've discovered soon after, this architecture was a big pain in the ass due to being too monolithic and dogmatic. There was a lot of unnecessary ceremony involved in the simplest actions. Back then it didn't feel as such, since literally everything in .NET world involves a lot of ceremony. Yet, after gaining initial exposure to the world outside of .NET, my perspective has changed. The concept of decomposing a complicated system into a bunch of tiny and simple components that work together and yet evolve separately - was refreshing and liberating. Current version of business backend at Lokad grew upon these ideas, while adapting them to the .NET reality (making things slightly more monolithic, where it was absolutely necessary). Still the result is quite nice: 13 backend components which have on average 4-5 classes and the same number of public methods; 2 thin UI components (simple ASP.NET MVC web sites); public API : JSON over HTTP; ability to subscribe to events from components; no overreaching architecture in each component, although event sourcing with in-memory views shows up quite often; all components are designed in a way which allows to have multiple instances running behind load balancer. almost no unit tests, since the components are extremely simple. If I were to draw a map of all backend components, it will look nothing like my previous monolithic designs: arrows point in direction from upstream component (provides services/data) to downstream component (uses services/data). Many of these components were rewritten multiple times from scratch, as the system was gradually evolving from legacy monolithic design towards deeper insight. This often happened in situations when we felt that there was a better way to decompose business capabilities into components. Sometimes \"better\" meant \"simpler\". The fact that previous version of the system was running event sourcing allowed to migrate functionality to new design in small steps. For example, at some points in time events were flowing from the old version to the new one. At other points in time, legacy code in the old system was calling directly newly established components that were already running within the boundaries of the new system. Basically, effort to decompose existing business capabilities into small chunks started paint off immediately : it became easier to think and reason about evolution of the design. It also became possible to break down work in really small steps (which minimised risks), while still maintaining reasonable development throughput (because multiple components were developed in parallel). I quite like the final result - so far the system is extremely simple and there were surprisingly few problems with the migration (due to the fact that the system is quite simple). Performance is not an issue . Current configuration could easily handle the load even if number of Lokad users increases a few hundred times (if it goes beyond that - we'll need to ask Azure to deploy one more server instance). This happened because performance of each component is measured along with its usage. In cases when components were frequently used and reasonably slow (as proven by stats from the production), they were tuned to better performance. Ultimate performance tweak for reads was about serving reads from an in-memory cache which is kept up-to-date via persistent TCP subscription to event stream with changes (just an event projection that subscribes to event store and stores projected results in memory). Ultimate performance tweak for writes was about putting command messages to a queue and processing them in background (work is shared across all nodes). If you remember my previous posts about Lokad.CQRS building blocks, you would recognise some of the patterns. The biggest change from \"Lokad.CQRS architecture\" is that there is no longer a uniform architecture in this new design. There are a few design constraints (e.g.: home components communicate or max size of the components), yet the internal implementation details are private to each component. Such an approach leads to situation, where each component can be tailored specifically for the job at hand. It does not need to follow a unified way to access data or handle failures. This might lead to a greater diversity. Yet, the lack of unnecessary ceremony allows to get right to the core of the problem in the most simple and efficient way possible. Hence components can be diverse and simple at the same time, just like cells in human body. Probably, the system could be made even more simple, if it were taken outside of common .NET stack. What infrastructure do we use? There is no real infrastructure, we mostly use some external libraries for our needs: ServiceStack for hosting app services (web services made easy) Metrics.NET for performance measurement Event Store of Greg Young to store events and to push them to all subscribers Windows Azure to host backend workers and frontend ASP.NET MVC 4 for Web UI"}
{"url":"/post/how-sticking-to-http-and-rest-makes-your-life-simpler/","title":"How sticking to HTTP and REST makes your life simpler","text":"How sticking to HTTP and REST makes your life simpler As I discovered recently, stepping back from unnecessary technologies can lead you to a more happy path. For example, consider a simple system running business systems for a small startup. Previously I'd try to design it as a composite backend server and a bunch of front end UI applications. They would send commands via some messaging middleware, while retrieving read models from some cloud storage. This was pretty scalable and seemed rather simple. Still, there were a few problems: Special tools are needed to diagnose and inspect the system (custom test clients, cloud storage explorers etc) Special tools are needed to interact with these systems and script them. As Greg Young once said, the need of special tooling might indicate a bigger problem . Sometimes you don't even realise that there is a problem, till you try an approach that avoids it . Taking a business backend and switching from some middleware (which was not justified in that case) to use JSON over HTTP for communications suddenly gave much deeper insight into the system. For example, we could use Fiddler to see what's happening inside the backend: Automagically you get a plenty of existing tooling for logging requests, performance tracking, analysing sessions etc. I'm not even talking about ability to leverage existing internet infrastructure with all the intermediaries. It is a huge enabling factor to have an ability to take any given request, edit and replay it at will. Of course, we can do all that with special middle-ware tools (or write out our own). However, they will probably be less polished and numerous than open source tools for debugging HTTP requests."}
{"url":"/post/how-to-deal-with-unexpected-problems/","title":"How to Deal with Unexpected Problems","text":"How to Deal with Unexpected Problems We live in a cruel world, where a lot of unexpected things could happen. So people that build reliable software tend to be really pessimistic about it and plan for the worst. That costs a lot (think of all the atomic bunkers built during the cold war). However, real life is much worse than any plans. It always proves our assumptions to be wrong. Expected bad things rarely happen, but unexpected weird things do happen. Here are some samples from software world: database connections timing out; Virtual Machines going corrupt; customer bringing in 100x of load that you planned capacities for; 3rd party services replying at the speed of 1 byte per second; somebody pressing the wrong button. SLAs, probabilities and estimates always lie. They are nothing more than expectations based on the assumption, that worst case scenario would be as bad as the one you've seen earlier (or just little bit worse). It's impossible to plan in advance. Get used to it. However, here's the approach that you can bet on (in fact it has been proven to be successful by thousands of years of evolution that brought fourth Homo Sapiens specie). Stop spending time trying to prepare for everything in advance. This will be waste of time and resources (think of all the dinosaurs that didn't survive Ice Age, although some were quite packed with muscles and claws). Instead, learn to evolve and adapt fast: reduce development and delivery friction and ensure that it is easy and cheap to make small changes with confidence . This will give you ability to react fast when life throws something unexpected at you. Make sure that you know your world well - deploy early and have means to keep a steady eye on your deployments, to know what is happening inside . Evolution is about survival of the fittest. So ditch ivory-tower dreams of being prepared for everything and just stay real and fit. Who knows, you might even buy yourself some time to enjoy the sight of competition dying out due to being unprepared for the unexpected."}
{"url":"/post/how-to-produce-a-superb-software-design/","title":"How to produce a superb software design?","text":"How to produce a superb software design? It's impossible to arrive at a perfect design right from the start. There are far too many uncertainties and unknowns involved in the process. Hence, you can safely assume that some sort of iterative learning process would be required. Software development is a learning process. Working code is a side effect. © Alberto Brandolini, Model Storming Presentation Lessons learned could come in various forms: discovered requirements and constraints; implemented features or fixed bugs; written documentation (wiki articles, email threads, diagrams, napkin drawings, whiteboard photos, software guides etc). One of the most important goals of this learning process is to keep iteratively integrating newly discovered knowledge into already captured body of knowledge, while enriching it. This task might be more challenging than it seems at a first glance. Let's step away from software for a moment. If you think of it, continuously adding stuff to any existing storage container could eventually lead to a situation when we not only ran out of the free space, but it becomes really challenging to do anything with things that are already stored there. Consider filling cabinet with papers or adding more tools into a small closet. Unless we somehow reorganize the space (add more filing cabinets, introduce more shelves in a closet), it will be hard to work with all the accumulated stuff . Another option would be to come with smarter way to organize papers (e.g.: alphabetically) or tools (e.g. put the ones that are used most frequently - upfront). Same principle can apply to software design. Even though software size is rarely a limitation these days, our mind is quite limited in the number of things it can handle at once without the support of some sane mental model. …number of objects an average human can hold in working memory is 7 ± 2. © Wikipedia, The Magical Number Seven That's what happened in far too many projects that I've seen : new features were continuously added to the project without adjusting software design to handle increased complexity. Such situations lead to the point where software got so fragile that developers would spend more time fixing bugs than adding new features. Quite often resolving one bug would release a horde of new ones. One of the most frequent solutions to this problem sounds like this: \"There is too much legacy. We can't add any more features to our software till we rewrite everything from the scratch in version two.\" In reality, approach of rewriting complex software from the scratch can be more problematic than it appears: rewrite costs time and money; new bugs can be introduced in the process, since we are changing existing and working software for something new; some really important requirements and features can be lost. Ideal solution to the problem of increasing complexity would be to avoid such bottlenecks in the first place. This can be done by: keeping software complexity as low as possible; continuously evaluating fitness of design as more requirements and features are introduced; continously adjusting design to stay meaningful and simple despite new features being added. In this case we can talk about iteratively integrating new insights (features, requirements etc) into the software, while also adjusting the design to handle increased complexity. At each step software would stay healthy enough to keep on pushing forward its evolution without much friction. I believe, this constitutes a superb software design. Hence: Arrive at healthy design by evolving it through a series of enriching transformations. This might sound like a lot of work. However, if we design for such continuous evolution upfront, then at each step we would need to deal only with a limited scope of change. This would reduce risk of introducing unexpected bugs or loosing an important requirement. Developers would stay sane, too. Note: This blog post is a draft of one story for an ebook on healthy design and patterns. Stay tuned, if interested. All comments are welcome."}
{"url":"/post/hunger-for-the-integration-builds/","title":"Hunger for the integration builds","text":"Hunger for the integration builds I want more different types of integration builds with every passing day. Every integration build checks for something, that could not be checked within the IDE or another build. Integration builds are especially good in detecting dormant issues that hide between the different sub-solutions of a development project, between internal and external code-bases or in some other integration joint. Obviously, increasing number of builds will increase the probability that something will break. And this is good, since that means more consistency and stability in the codebase and less dormant issues. The are constraints, though: Adding too many integration builds, will make the development extremely stable and extremely slow (good for mature projects with the established architecture and logic) Complex integration builds tend to have higher hardware requirements (exponential growth, as usual)"}
{"url":"/post/implementing-c-projections-for-event-store/","title":"Implementing C# projections for Event Store","text":"Implementing C# projections for Event Store In one of my previous posts I mentioned migration of some Lokad systems to dedicated Event Store . One of the steps in this migration process involve switch from legacy Lokad.CQRS View projections to new C# projections. New version of view projections does not have smart in-memory replay managed by the system, but they are inherently faster due to batch processing nature. Design is really simple: we subscribe to events from the storage (from last known checkpoint) and pass to projection, batching events together for better performance. After each batch, we update checkpoint. Here's how one projection implementation (maintaining list of comments per account) might look like: public sealed class AccountViewProjection : BatchProjectionFor<AccountView> { public AccountViewProjection(IKeyValueStore store) : base(store) { RequestEventsInBatchesOf(100); } public override void HandleEventsIdempotently(ICollection<object> events) { var comments = events .OfType<AccountCommentAdded>() .ToLookup(c => c.AccountId); if (!comments.Any()) return; // This batch change is fast, but it must be idempotent // for this specific projection type, since underlying // storage does not support transactions spanning multiple keys // Probably we could make this async... comments .AsParallel() .ForAll(g => Store.UpdateEnforcingNew(g.Key, view => { foreach (var added in g) { view.AddComment(added.Comment, added.Manager.FullName); } })); } } This design is shaped by the constraint that we need to work efficiently with dead-simple key-value storage like Azure blob storage (but support simple migration to any database engine). This is caused by the fact that existing Lokad.CQRS projections run use this storage. What we actually do here - for each projection: Start a projection manager as separate runtime task (can be implemented as Task that is retried on failure). Calculate projection version from the codebase. If projection code has changed, then kill all the cached views and reset checkpoint. Subscribe to specific event stream starting from the last known checkpoint. Feed retrieved events in batches to the projection. After processing batch, update the checkpoint. Please, keep in mind: this is the very first version that is not optimised. I'm following mantra: first, make it work, then make it beautiful, then make it fast . Additional notes: Projection can actually specify batch size it's willing to accept. Detection of changes in projection's codebase is done using the same code as in projection rebuilder of Lokad.CQRS (via Mono.Cecil). Since each projection subscribes to event store individually , they are completely independent. However, this could mean a lot of traffic going through. Simple solution would be to have an event store cache per node , which is shared by all projections on the same node. This cache could be either in memory or on disk. See mantra, though. In case of projection failure, it will blow up it's manager. Runtime will restart this task and projection will resume from the last known checkpoint. If we have too many repetitive failures - circuit breaker will kick in to let the system cool down. What do you think? How do you implement your view projections in C#?"}
{"url":"/post/importance-of-execution/","title":"Importance of Execution","text":"Importance of Execution Ideas are worth nothing , no matter how smart they might sound like. They are just theory (even if they are based on real-world practice). And we all know, that in theory there is no difference between theory and practice . While in practice such difference exists . The crucial bridge between theory and practice is called execution . Good execution is the key element that can take a simple idea and turn it into profitable and inspiring business. Bad execution can easily ruin a brilliant idea and turn it into a pile of debts and bunch of burned out people. Ideas expressed in this and following texts could be dangerous or reckless, if considered in isolation, without taking a closer look at the execution. For instance, let's talk about rapid delivery at Lokad , where multiple releases to cloud per day are not an outstanding exception. For us it is a way to outrun competition, rapidly respond changing market or solve some unexpected problems . It is one of the reasons customers give us testimonials like this one: Lokad improved the accuracy of our planning process significantly. The immediate impact was a stock reduction of almost 1 million EUR at a monthly cost of 150 EUR. Thomas Bremont , Head of Supply Chain Bizline This might sound like an impossible idea for people coming from enterprise background with highly formalized environment, predictable monthly milestones and precise technical specifications. In fact, this approach is barely applicable in their case (which strengthens our competitive advantage even further). However, while thinking about reliability of these ideas, please consider their execution aspects at Lokad: we test complex behavioral code thoroughly with things like specifications that also serve as living documentation ; even more complex forecasting code has a dedicated multi-machine cloud deployment and benchmarking infrastructure that continuously cross-tests changes in the code against large data library; this tracks performance of forecasting models and allows our brilliant analytics team to push state of the art in forecasting ; they even have their own stack for Windows Azure ; our master branch (in git source control repository) of integration systems is always close to being stable (large changes happen in separate branches); releases are tagged, backed up and deployed according to deployment protocol ; core data is immutable and append-only (persisted as event sourcing streams for behavior-based entities) and hence it is inherently easy to back up or revert any changes; a lot of systems include sanity checks and self-diagnostic routines that help DevOps to detect any potential problems or edge cases; some even have self-recovery logic; most frequent changes in customer-facing systems deal with user experience, and UI is simple in DDD/CQRS systems , especially when changes to data structure behind UI are managed automatically by cloud servers; all of our new systems inherently support hybrid cloud deployments. Newest designs even support real-time replication of data based on event sourcing, since we must become more reliable and secure than any single cloud . As you can see, we simply took brilliant ideas from Greg Young and other members of CQRS community and try to diligently execute them by applying lessons from companies like 37Signals, github and Twitter. In the following articles I will try to address both aspects in parallel: ideas and execution ."}
{"url":"/post/importance-of-good-domain-models/","title":"Importance of Good Domain Models","text":"Importance of Good Domain Models You might find that I will be writing less about the technical side of event centric systems (with all the CQRS, ES and cloud aspects) along with the problems of scaling and improving their performance. This is because all that is a solved problem in the industry (the only question is about costs and risks of any given solutions). There is a more important problem at hand - coming up with proper domain models that actually represent useful abstraction of the real world and allow event centric systems to be delivered. Or, in other words, coming up with a description of problem domain, that is structured in such a way, that you could easily build a system with that. DDD might sound like less exciting topic for discussion as compared to elastic scalability, clouds or why LMAX can get millions of messages. However the former is just the implementation detail, if you have adequate model. Unless you are doing something really specific with the technology, all problems you would face - have already been solved by many people. That's pure engineering (albeit not the one with a lot of documentation on top of that). You can have more-or-less predictable results, given the initial design in almost any field (aside from areas where technology is so important that it starts melding into the domain field - there really cool things start showing up). Design of domain models is something more of an art or a craft. Given the domain field, you are not guaranteed to arrive at satisfactory model. You are not guaranteed anything at all. Hence it is more of a problem that needs solution. However, if you arrive at something satisfactory, you would get domain model that could easily outlast any technological changes, while even making them less relevant (less expensive and less risky)"}
{"url":"/post/importance-of-tooling-and-statistics-in-cqrs-world/","title":"Importance of Tooling and Statistics in CQRS World","text":"Importance of Tooling and Statistics in CQRS World Tooling is extremely important in debugging and managing application. This is especially true for solutions based on Command-Query Responsibility Segregation ( CQRS ) or any other message-based enterprise architecture. That's how, for example, we approach the problem in Lokad: Or, if we take a closer look at the event message (rendered from the binary Lokad Message Format into the human-readable text): Topic: ReportCreatedEvent ContractName: ReportCreatedEvent Sender: https://salescast.queue.core.windows.net/salescast-log Identity: 2c2e7f77-a62d-4540-8079-9db000a2bc47 CreatedUtc: 5245830405928397257 { \"ReportId\": \"646211a6-fe78-4ccc-9536-9db000a2bb9e\", \"SolutionId\": 251, \"ReportType\": 3, \"Name\": \"Excel Report\", \"StorageContainer\": \"https://salescast.blob.core.windows.net/salescast-solution-000251/\", \"StorageReference\": \"report-646211a6-fe78-4ccc-9536-9db000a2bb9e.xlsx\", \"ContentType\": \"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet\", \"FileName\": \"SalescastReport.xlsx\", \"ReportSize\": 8746, \"Stats\": [ { \"Key\": 71, \"Value\": 475 }, { \"Key\": 72, \"Value\": 8746 } ] } Tooling helps us here to: better understand and visualize dependencies and interactions within the solution; derive custom reports and run queries against the events that happened during the application lifetime; debug and troubleshoot potential issues that happened in the past; replay commands, script environment and automate certain tasks; capture information needed for discovering and eliminating performance bottlenecks (Stats structure holds primary execution statistics associated with the processing of the command message). The last bullet-point actually replies to the question I've promised to answer a long time ago - on tracking statistics related to the message lifecycle as it passes through the system. So we manually capture the necessary statistics in the context of the message and append them to the appropriate events. Statistics could include things like: Number of records processed; CPU resources used; Upload/download speed; Merge-diff statistics. Etc These statistics might not make a lot of sense on their own. For example, download speed means different things, when we are retrieving information from SQL Azure in North-Europe or mySQL database on shared hosting in Moscow. That's why we are capturing and persisting them along with the domain events, which are bound to the time-line and provide context of the situation. When the event is consumed by some component (i.e.: to update denormalized UI View) we would probably ignore these statistics - they will play out later.. Events are stored in the domain log available for the queries a la Time Machine . Then, when we need to actually figure out some performance bottle-neck or understand specifics of some situation, it'll be just a matter of doing event stream analysis (there's plenty of literature on that one already) and writing proper queries. Once we have statistics in the context of actual domain events, nothing prevents us from getting answers to the questions like: What's the average item retrieval speed from mySQL databases? How often do we encounter timeouts and deadlocks? How many seconds does it take to sync 100k products from SQL Azure database in the same data-center? What's the average upload speed to Lokad Forecasting API for datasets larger than 10k series, after that API upgrade in the last iteration? Theoretically, if we combine these statistics with Time Machine Queries (or continuous queries) and real-time logging - we should be able to do some nice things with CQRS solutions: Monitor health state and attach notifications to the key performance indicators of our distributed solutions. Detect potential problems in real-time or even ahead of time (if coupled with the forecasting). Analyze the real impact of performance optimizations. All in all this helps to better understand the realm of the solution, as it evolves in ever changing reality. This allows to reduce resources (developers, time and budget) required for the delivery, while keeping high quality of project deliverables. This article is a part of xLim 4 Body of Knowledge . You are welcome to subscribe to the updates and leave any feedback!"}
{"url":"/post/keeping-mind-of-net-developer-open/","title":"Keeping Mind of .NET Developer Open","text":"Keeping Mind of .NET Developer Open Yesterday I've bought a new laptop, replacing the old one (video output to LCD got fried). I used to stick with DELLs (robust and can take almost any beating), yet this time, there wasn't much time or pocket cash to spend on the upgrade. So, without even thinking a lot, I went for Lenovo laptop with the slowest i7 processor, 4 GB of RAM and a home upgrade to 120GB Corsair SSD (Force) that came a week earlier. This upgrade (and OS reinstall) was an opportunity for me to start rethinking my development approaches and the whole CPU MHz race (in which any dev with VS 2010 and R# eventually ends up). I used to be jealous of blog posts and videos of developers working on blazingly fast Linux IDE environments and still staying really productive. At the same time, development story at Lokad has been going in a slightly opposite direction. Our build, integration and deployment scenarios are becoming so complex, that the big picture and evolution path tend to blur. Even usual everyday handling of dev and integration tasks on a dev box adds a bit of a mental burden and friction. Windows Azure with it's \"lightweight\" build and deployment process does not make things easier. My gut feeling, based on the limited development experience, came up with the rule of the thumb: If you need to keep solving same problem (in slight variations) over and over again, then you are going in the wrong direction. Think outside the box. Look on the other side of the fence, if you need any hints. Here's an example - optimizing database calls was rather a common routine for me a year ago. I used to spend days on tasks like writing smart SQL bulk inserts with high performance, adding small caching layers and optimizing queries. Needless to say, that all these \"features\" took a lot of time, increased complexity and were eventually thrown out of the codebase later. Perf problems were solved at the conceptual level by a simple paradigm shift from coupled data-driven models to message-based cloud architectures. I believe, same could be achieved with the distributed development targeting cloud architectures as well. Classical approach of having high-end dev machine, build server with regular deployments and central version control system - is too coupled and, well, centralized. We should be able to improve the big picture both in simplicity and friction. The latter is extremely important to me for two reasons: Reduced friction means faster reaction time (and ability to get rich feedback loop). For the business, having low-friction development means ability to execute faster, eventually outperforming competitors and getting the most out of the opportunities showing up. Reduced complexity leads to better decisions . Less moving parts you need to keep in mind, easier it is to come up with solutions that consider all primary factors and future probabilities (risks and opportunities alike). This starts from the development, but, for a software-oriented company, goes all the way up to the organizational and policy levels. In a sense, switch to the Cloud Computing and Cloud Architectures (CQRS + DDD + etc), was a big step for Lokad in the last few years. Lot's of lessons learned. Yet there are more potential reductions to be gained in order to keep progress steady and stress levels - reasonable. In order to keep my mind open to such opportunities, while learning as much as possible, I've decided to switch to Linux as my primary day-to-day OS (currently - Ubuntu 10.10), while keeping .NET Microsoft Development isolated within VMs. Integration dev will still stay where it is now - remote and cloud servers. A few insights from less than a day of using Ubuntu with VirtualBox-ed Win7: It takes a few minutes to install Linux distro to HD and Win7 in a VM, provided you use SSD. Linux forces you to be ready to learn new things (acceptance). Seamless mode of VirtualBox is an interesting experience, never thought this to be possible. I tend to use mouse less, which is a bit frustrating (all the keystrokes to learn), but speeds things up in a way similar to R# jedi tricks. Mercurial is really painful and slow for cloning large repositories. I was thinking about upgrading to 8GB of RAM. I'm not sure any more if it's even worth the effort of going to the shop. Linux host stays as responsive as ever (meaning, I can work emails, documents, code) even when Win7 in VM tries to stress machine by doing large compile or NGening .NET 4.0. Needless to say, that I gave away 75% of CPU and RAM to Win7. Thinking \"How can I keep evolving projects to simplify development without VS10?\" already brings up ideas on organizing, testing and integrating large multi-project solutions in a simple and straightforward way within the VS10 stack (before that, there was no coherent picture). Nothing really new - just the common principles of proper component development, applied at the project ad solution levels. It feels good to have a plan, though. Organizational structures and processes layer on top of that in a rather straightforward manner as well. PS: The discussion continues (with some nice comments) in the following posts: Linux Setup Tweaks of .NET Developer with SSD Why Even Bother Trying Linux"}
{"url":"/post/key-cqrs-ingredient/","title":"Key CQRS Ingredient","text":"Key CQRS Ingredient Recently there was a bit of feedback around CQRS (mostly with the regard to Lokad.CQRS for Windows Azure ), where developer gets along with the architecture, design concepts and all the other exciting little things of CQRS, but then he just gets confused. There's too much tech information at once and too little ideas of where to go next. Questions tend to be like this: How do I design these commands and events? Why can't I just have one command handler that processes all requests? I have a problem I want to solve with Windows Azure and your framework. I did the tutorials. What do I do next? I recall being in the same place as well: elements of CQRS architectures make sense, but aside from that it's all too overwhelming and confusing. You don't understand why is everybody so excited with all this mess. You don't see the clear learning path forward . Let's see what we can do to simplify further learning and avoid possible disappointment. First, we'll take a little step back and walk over the basic concepts again. We'll start with the purely technical ones (excerpt from here ): Command-Query Responsibility Segregation Principle is about separation of reads from writes. Circular Architecture (another term that comes up in the groups; don't confuse it with the onion architecture ) is about structuring these reads and writes in a decoupled way with the help of asynchronous messages (namely commands and events) and synchronous view queries. Event Sourcing is another idea - persisting objects as a stream events that happened to them. These are just technical concepts that might look cool but aren't extremely exciting on their own. These are actually the things that have been used over and over in various systems for ages with various degree of success. Understanding these tech ideas alone is not enough for efficiently building valuable and scalable systems in the cloud and on-premises. There is the key ingredient missing from this picture: Domain-Driven Design . DDD is about understanding, explaining and evolving your domain model (that abstract thing that we are trying to express with the code) in a way that: model would focus on the most important characteristics of the problem at hand (while putting less important things aside, for the sake of preserving the sanity of everybody); the model could evolve and still stay in sync with reality; model would help different people with various backgrounds to work together (i.e.: users, sales people and hard-core developers); model would let you avoid costly development mistakes (it could even help to deliver new exciting features as a simple logical extension of what has already been implemented). Isn't that a lot for \"model\": something vague that's not even a code or final architecture document of 200+ pages? Well, DDD is not just \"model\" , but the methodology and principles to create it, distill and keep evolving afterwards. It's way of thinking, learning and talking about the business problem in a manner that implementing everything would be rather simple, despite the initial complexity of the actual problem. CQRS architecture approaches build upon this way of thinking and created models: here we extend the DDD methodology and add new exciting ways to implement everything as a scalable and cost-effective software. We make heavy use of terms like ubiquitous language , bounded contexts , context maps , linking these directly to the technological and architectural constraints of CQRS Architectures. That's actually what Distributed Domain-Driven Design is - DDD principles enhanced with specific CQRS-based patterns-n-practices and a clear way of migrating existing systems towards better complexity management, scalability and integration capabilities. The same is with the Cloud CQRS Architectures (or xLim 4 ) - Domain-Driven Design and understanding of your domain model are the important milestone to go for. Once you have it figured out - things like messages, routing, partitions and elastic scalability will just unfold on top of that model in a rather straightforward fashion without the risk of creating non-scalable, overengineered or expensive cloud computing system. So as you can see Lokad.CQRS for Azure is not extremely important or outstanding on it's own - it is just the framework to help using DDD/CQRS principles for building systems powered by Windows Azure Cloud Computing Platform. So if you got caught by this \"CQRS hype\" with all it's cool technological promises and then got stuck, I recommend to proceed as: Read the Domain-Driven Design by Eric Evans . Go through the materials on Cqrsinfo.com site (to cover the transition from DDD to DDDD). Start a small learning project using DDD and the most familiar technologies and architectures. Gradually walk through migrating this project towards the CQRS architecture (either cloud implementation or not). These references are already available in CQRS Starting point , but I wanted to highlight the importance of DDD one more time."}
{"url":"/post/legacy-code-should-still-be-agile/","title":"Legacy Code should still be Agile","text":"Legacy Code should still be Agile We all have faced numerous situations with the legacy code and brownfield applications. General feeling towards these is \"there is too much development friction, it would be easier to scrap the code and start a new version\". For a long time I've been a believer in this approach as well. This \"long time\" actually goes towards the beginning of my development career. Yet. despite the urge to rewrite everything, most of the times the situation went to restraining myself and gradually evolving the codebase. This path initially feels to be much harder than a simple rewrite, simply because there are more variables software architect has to consider (as we all know a person can keep in mind only 5 +/- 2 distinct entities at a time). There are a few solid advantages, though: Code stays in production-ready state. This keeps the business value at a steady level and does not break the communication loop. Evolving code guarantees that there will not be any missing features or other small-but-important details that tend to go missing in the first rewrite version. Evolving code will stay within the feedback loop of quality assurance teams and customers. This would make it much harder for the codebase to go astray (which tends to happen when we, developers, close up in our dens to write the next version from scratch). Pushing existing codebase is quite hard, due to all the momentum complex systems accrue. However, it creates a much better understanding of domain and applied technology at hand. So the resume is \"dropping legacy code is bad, evolving is better\". How do we do it ? . Here's a simple approach I've came to: Isolate logical portion of the code that has to be evolved in the iteration. Ensure that the scope of iteration is limited and understandable. You might want to use Project Management templates while documenting the scope. Documentation part might be extremely important (even if it is merely about a few paragraphs), since it helps to clarify and settle down vision of the iteration. Written document would also serve as a nice restraining factor preventing the scope creep . Ensure that it is covered with unit tests and other types of tests, if applicable. Implement the prototype This might take some limited time (and a few rewrites), since we will be seeking for the right ideas and implementation concepts Component-driven development will help. Merge the prototype Write down for yourself somewhere: Lessons learned Perspective Basically, code itself does not really matter. Ideas and logic behind it are the real value. I'm not talking about the \"business value\" at this point (which is the development driver and focal point), but rather about architecture, patterns and composition. They are unique for every single domain."}
{"url":"/post/less-is-more/","title":"Less is More","text":"Less is More One of the best lessons I've learned recently is about creating value by removing things that you don't need . This might sound like a controversy in our modern world that is obsessed with hoarding things: features in products, possessions, friends in social networks or frameworks and tools. However, consider this: Apple built an iPhone by removing all buttons except one and putting really restrictive constraints on applications. 37Signals have created quite a reputation by building products that focus on simplicity. Japanese people created beautiful poetry in form of Haiku, that is extremely limited in it's form. Twitter has created wildly popular service by limiting people to 140 characters. In each of these examples, nothing prevented people from avoiding these limitations and going wild. However they choose something non-intuitive - they discarded everything that is not important and focused on the very core. And then a small wonder happened - they created something outstanding that would be remembered by the others. This is an extremely valuable ability and skill: to be able to focus on less in order to gain much more . It is about self-control and self-discipline. This is a way of planning the future that looks a bit deeper than short-term hoarding of features, ideas or things. I'm not asking you to reconsider your current points of view upon the world. However, if you ever come to a crossing point, just ask yourself one of these questions: What if we discarded all frameworks that are not utterly essential for the system? How much would we gain or lose if we discarded half of the features in our system? What is the smallest step needed to solve the problem? What if we try to work smarter instead of hiring more people as the company grows? You can find more interesting questions in Greg's post . By continuously asking yourself these questions, you might be able to stumble upon precious insights and revolutionary discoveries that otherwise would be hidden by clutter and blurred by complexity. Simplicity is the most valuable feature. And it is the hardest one to achieve. It's much easier to add code, functionality or technology, than remove. However as we add them, we lose focus, freedom and flexibility. We might be even so encumbered that we would not be able to avoid problem or jump after new opportunity. More is less. Less is more."}
{"url":"/post/lessons-learned-development-methodology/","title":"Lessons Learned - Development Methodology","text":"Lessons Learned - Development Methodology Software development is a really curious thing to study. It is all about rapidly changing environment, endless risks and people (the most important thing here) that try to hit moving targets in that mess. There are numerous thoughts on the web about the things that should be done in order to succeed in this game. Here's the list that I currently agree with: Adopt efficient development methodology. And keep doing that. There is no best methodology; only continuous improvement will give the development (and the business) a chance of staying efficient. Communicate efficiently . Use meetings, teleconferences, newsgroups, flash movies in emails - anything that's needed to find problems and make decisions quickly. Efficient and stable communication channel saves resources. Additionally there is the problem of communication between the knowledge domain boundaries (i.e. business to development). Well, in this case either get better communication channel and media, or eliminate the problem by putting both ends into the head of a single person. Know your field . The majority of wrong decisions were made because there is not enough information about the problem and the alternatives. Wrong or suboptimal decisions lead to wasted resources. Spending some resources to prevent chance of wasting a lot of them, could be a good deal. This applies to everything: picking development tools, technologies, platforms, market niches etc. Be flexible and open to changes . Statistically it is more efficient to operate with risks, chances and probabilities, rather than to try to nail the Plan that was carved in stone. Minimize the waste . That is it. Saved resource unit does not differ from the resource unit that was earned. It is all about the people . Right people in right positions can move mountains. Basically, if the rules of the game change frequently and you neither can predict those changes, nor influence them, then the only way to succeed is to adapt to those changes. Evolve. Faster - the better."}
{"url":"/post/lessons-learned-diagnostics/","title":"Lessons Learned - Diagnostics","text":"Lessons Learned - Diagnostics Bigger the projects are is, harder it is to make sure that everything is running smoothly and there are no unknown bugs. This becomes even harder if the software is highly flexible and configurable. One way of dealing with this challenge (in addition to the UnitTests and Continuous Integration) is adding some more run-time diagnostics to software. For example you could ship some Unit Tests with your product (Kudos to SSW for the idea) or run built-in diagnostics. One of the easy ways for implementing Diagnostics - just sprinkle self-checking code in critical places of the system, and then run every validation (check out the Validation Application Block, for example) method while building the visualization of your system."}
{"url":"/post/lifehack-query-multiple-aggregates-from-event-stream/","title":"Query Multiple Aggregates from Event Stream","text":"Query Multiple Aggregates from Event Stream Given a domain log (containing all events and commands of some context) it's really easy to load any aggregate (or any group of them) and perform some query, given that you are using event sourcing . Last night I've got a problem with one of the systems running on staging grounds in Rackspace cloud. This system performs some big data processing and analytics for OSA analysis . One of the required steps is to assemble data and then run massive Map Reduce operation on top of it, involving some 3rd party services. Roughly 8 reduce batches (out of 300) failed with the timeouts caused by that external service. This morning I have to figure out the exact data problem and find ways to 'scavenge' the results without rerunning long operation again. Luckily I'm using good old CQRS/ES to manage all behaviors and integrations (actual data is never referenced or managed by AR+ES entities). Also, to facilitate debugging (and view rebuilds), systems record all passing messages into a separate event stream called domain log . So the task is reduced to getting the latest version of this stream (which is just an append-only file) and loading it in a snippet: var dataSerializer = ProtoBufDataSerializer.LoadAndScanDomains(); var envelopeSerializer = new ProtoBufEnvelopeSerializer(); var envelopeStreamer = new EnvelopeStreamer(envelopeSerializer, dataSerializer); // load all messages var stores = new FileTapeStream(@\"C:\\temp\\domain.tmd\") .ReadRecords(0, int.MaxValue) .Select(b => envelopeStreamer.ReadAsEnvelopeData(b.Data)) .SelectMany(b => b.Items.Select(i => i.Content)) // pick only events for an aggregate bound to StoreId .OfType<IEvent<StoreId>>() // group them and load into the state objects .GroupBy(b => b.Id.Id) .Select(events => new StoreAggregateState().With(events)); foreach (var store in stores) { var missingForecasts = store.Forecasts.Values.Where(f => !f.Delivered); foreach (var forecast in missingForecasts) { Console.WriteLine(\"Missing {0} with key {1}\", forecast.Dataset, forecast.ApiKey); } } That gives me enough data to check out the missing batches and deal with the situation. Important: obviously the purpose of this snippet is to facilitate the debugging. In production you would almost never use such queries across the entire domain log . Projections are a better fit here. This approach relies on strongly-typed identities and their binding to aggregate roots in order to simplify querying of event stream. However this is not essential."}
{"url":"/post/lokad-topologies/","title":"Lokad Topologies","text":"Lokad Topologies Let's talk about system topologies of Lokad.CQRS application. System topology describes flow of messages between elements of a single application or distributed cluster. If you remember classical CQRS triangle (client-application server-projection host), this is an example of topology. Normally this should be an episode in Being the worst podcast. However this is a complicated topic for first episodes, and I don't want to make some of the advanced Lokad.CQRS users wait for a few months. There are multiple ways to organize topology, depending on your scenario and technology. I'm going to present you one of such ways, that is partially based on CQRS principle with some support for multiple BCs and scalability. While reading that please keep in mind, that the core focus of this topology is to keep things as simple and uniform as possible. That approach would work with minor modifications for basic projects, however more complicated scenarios might require stepping back from the starting template. We'll start with revisiting basic elements. Here are some assumptions and terminology: Currently our building blocks are: Application Services, Ports, Projections and Tasks. Application Service is just a class that hosts multiple command handlers (methods that are called when command message or remote procedure call arrives to server). Application services contain core business logic and can either be stateless (functions that take command, do some action and publish result as an event) or stateful. Stateful application services usually deal with entities which are uniquely idenfitied by an identity. In Lokad.CQRS projects the majority of stateful application services host aggregates with event sourcing (or A+ES). Currently in Lokad.CQRS the only way to call application service is by sending a command, which will be put to a persistent queue. Message dispatcher on the server will pick this command and dispatch it to the corresponding command handling method on an application service. Application services hosting A+ES do not publish any events directly; instead, they append events to the event store. Message publisher will send them out later. This functionality is being introduced in IDDD branch of Lokad.CQRS and helps to solve the problem of two-phased-commit (if you don't know what this is - don't worry, that's legacy from SQL times). Application Service with A+ES Application service with A+ES and an event store looks like the bit below. We can scale this out between multiple threads on the same system by paritioning. Our partitioning logic will assume that entities with the same ID will always be handled on the same machine. For instance, we can use such router code: ICommand<IIdentity> cmd; IQueueWriter[] queues; queues[cmd.Id.GetStableHash % queues.Length].Put(cmd); In this case, if we have 2 queues, then commands to aggregates with IDs like 0, 2, 4 will always go to the first queue, while IDs 1,3,5 etc will be handled on the second. These application service instances can be located on different threads within the same machine or located on other machines. This approach allows us to parallelize execution, while still ensuring, that all commands of a single aggregate instance will always be handled within the same thread. Such thread affinity allows to simplify a lot of otherwise complicated synrhonization and concurrency scenarios at the cost of some idle threads (or even machines). Fortunately, hardware evolution favors low-power multi-core systems (think about all these energy-efficient ARM chips), which will benefit from this architecture even more. Functional Service With functional services everything is a bit simpler. They handle commands in idempotent fashion (preferrable) and produce events but without any state affinity. This makes them easy candidate for hosting some resource intensive logic (like number crunching). Since we don't have any state to synchronize access to, we can scale such tasks by distribute commands around in any way (e.g. having 8 functional threads on 4 different virtual machines). Round-robin and random distribution are simple but rather efficient approaches to balance the load more-or-less evenly. Besides, you can always add multiple functional command handlers to the same queue, making them compete for work. You might go even further and have these functional command handlers on a separate VM, increasing or decreasing number of VM instances according to the load (e.g. number of unprocessed messages in the queue). You can have these functional services publish events by writing them to the same event store, or you can throttle load on that event store by publishing messages to the \"event recorder queue\", handler or which will write events to the store. Client interactions Presence of command routers and functional recorders come in handy, when you need to bring client applications in picture (e.g.: web client or smartphone client). These applications would need a way to send commands (rename user) or report events (user logged into the web UI). Obviously, these would go to the same \"command router\" or \"event recorder\" queues. Note: you can easily have one single \"router\" queue, which will either route commands or record events, but I found this to complicate the implementations. It's easier to be more explicit. Bringing this all together Having said all that, let's have a look, how all these elements come together for one possible configuration of topology. This picture might look complex (it is actually a topology for Lokad.Salescast2 product), if you try to \"swallow\" all of it at once. If you break it down into individual components - things should get easier. Please note, that implementing this full topology is an overkill for a lot of scenarios . Often you can live with just a few queues and handlers. However, sometimes it helps to know in which direction you can evolve your topology, should requirements become more demanding."}
{"url":"/post/long-story-made-short-project-report-for-the-xlim-1/","title":"Long story made short - Project Report for the xLim 1","text":"Long story made short - Project Report for the xLim 1 xLim 2 is a set of principles/guidelines that come from some experience and serve specific purpose of efficiently building flexible and light distributed information management systems. Long before the xLim 2 there was xLim 1 (or simply \"Lim\" at that time). It was purely a research/study project undertaken in the spring of 2007 to explore efficient ways of building distributed Smart Client applications for information management purposes. It took 52 days to get from the project initialization phase to the proper close-out. As you may know, one of the PM principles is to capture all the \"lessons learned\" on the close-out before moving to another project. That's how it was done with the Lim 1 as well. If you are interested in the history, then you can download xLim 1 Project Report . It is a short 15 page summary that highlights major areas that constitute the body of knowledge captured in the course of xLim 1 project. Additionally there is an archive of screenshots (referenced as Finals.zip in the report document). Here are some pictures from it to give you the idea: Right now this old report seems to be quite naive and incomplete (it actually is), since there has been so much learned with xLim 2 (it looks like it would be extremely hard to keep the xLim 2 report under 50 pages). Yet, it was an important step. Without it xLim 2 could not have started and branched off several commercial implementations. Any comments, feedback or questions are welcome. PS: I definitely wish that other development teams (esp. DevExpress XAF team) published reports like this one (or in any other manner that attempts to capture information essential to the efficient development and delivery of frameworks/platforms and their implementations)."}
{"url":"/post/maintenance-and-monitoring-infrastructure-for-the-enterprise/","title":"Maintenance and Monitoring Infrastructure for the Enterprise Software Systems","text":"Maintenance and Monitoring Infrastructure for the Enterprise Software Systems Enterprise application software systems are about the software that organizations buy or develop to solve their business specific problems . Usually these systems start rather small. However, over the time they tend to grow more and more complex, increasing the number of subsystems involved and developing numerous interactions and synergy effects between them. For example, a company SmartForce might get started with a simple intranet web site and a database behind it. That's the most common scenario of getting started with an in-house enterprise system, that I've seen. That's the easiest one, too. As the company evolves, so does the software solving its problems. So over the time complex business rules and long-running workflows might be added to our system of SmartForce . These workflows might involve integrating with the external information systems or scheduling document processing for the nights. End-users might get interested in receiving custom notifications or scheduled tasks along the way. One of the most common implementations for this scenario (I'm generalizing here) is the introduction of automation server(s) to our web site deployment with a database backend. Then, the complexity of the enterprise software system might increase even further, as project stake-holders decide to introduce rich business functionality, integration with office tools or offline capabilities to the external client applications. This might lead us to Smart Client architecture implementations (whether they are done as desktop or browser-hosted applications). This kind of architectural change, generally requires some sort of the public API to be deployed. External integration scenarios also require such an API. The overall setup looks slightly more complex already. As if it were not enough, we might eventually get to the point of having multiple API versions to maintain and support (that's what usually happens to any API that is exposed to public). So the overall enterprise software architecture at this point might look like the one displayed on the picture below. Obviously, things might get a lot more complicated, but we are not showing them on the picture in order to keep it small and clear for the next part of the article. Yet, one can clearly imagine that the system might expand in size and complexity due to the following requirements: introducing multiple file storage servers integrated into the system; handling the traffic spikes by setting up a few API endpoints and web servers; integrating automated issue reporting and deployment updates into the system; distributing CPU-intensive tasks between multiple automation servers (or even delegating these tasks to the cloud); provide additional setups of the entire infrastructure for people from development, QA and marketing to play with. Such an enterprise software system definitely requires some efficient maintenance and health monitoring in order to function properly despite all the problems. By saying efficient I mean maintenance that does not require having a dedicated team in order detect problems before they happen (that's when fixing them is most cheap), simplifies and speeds up the resolution of issues encountered. Information is critical for this kind of maintenance (you have to know all about your patient in order to keep him healthy). We need a deep insight into the system, information on its behavior and its deviations under different conditions (as we all know, production reality may be quite different from the development despite all attempts to reproduce real-life stress and behavior profiles in testing scenarios). For example, we might need to know such basic things as: uptime and failures of subsystems; CPU and memory consumption on the machines hosting subsystems; utilization levels of the services and resources that our infrastructure provides; repository (database) statistics and performance network latency and response times of the external services; various exceptions captured and handled (or unhandled, which should be a rather exceptional scenario) within the code, failed sanity checks and warnings from the internal subsystems; low-level performance indicators in the code-base (i.e.: execution time, counts or loads of various methods, number of database dead-locks encountered and resolved by the reliability layer, custom counters). It is not enough just to capture these indicators. If we want to be efficient we could also: persist indicators in the operations database in order to see how system performance evolves over the time; present the data in form of real-time reports, making it more easy to spot problems or check the state; create and receive custom notifications when certain indicators or their aggregations go outside their threshold. For example, system going offline should result in an immediate notification to all involved parties, while slight performance degradation is just worth mentioning in the daily report. Obviously this entire setup has to be implemented in an efficient manner that allows to: change any indicator threshold in minutes; modify or add report in minutes; automatically version and deploy all changes. Note, that when providing time estimates, I'm taking into the consideration just the human labor required for the change. Automated deployment processes on their own might take anywhere from 5 minutes and up to 24 hours. This depends on the change propagation policies and processes employed by the specific development infrastructure. Let's take our previous picture of the enterprise system composition and add maintenance-related subsystems and interactions to the mix. One of the possible results might look like this: Basically, at this point we are talking about the infrastructure for performing system-wide maintenance and monitoring related tasks . This sample design works like this. Every primary subsystem exposes a standard monitoring API that could be used by the monitoring server to capture information. Internally every subsystem is instrumented with simple performance counters, capturing low-level execution (and failure) information and making it available via the monitoring interface API. Additionally, you may have your external client applications report exceptions directly to your issue management and monitoring systems. Maintenance monitor then aggregates all the information, applies any notification and warning rules available (some of these may result in notifications being sent out), runs custom reports and publishes them to the dashboard. That's not that hard to implement with existing open source tools. And having an ability to check how many seconds on the average does that complex repository call run against production db in real-time scenario - makes you wonder how could you live without such functionality before . It is like the ReSharper addiction . If you couple this kind of statistics with the domain events from the CQRS world, you would be able to answer tricky questions like: What's the average item retrieval speed from mySQL databases? How often do we encounter timeouts and deadlocks? How many seconds does it take to sync 100k products from SQL Azure database in the same data-center? What's the average upload speed to Lokad Forecasting API for datasets larger than 10k series, after that API upgrade in the last iteration? More on this topic... By the way, it may seem logical to extend the picture above by adding the maintenance server that would: aggregate information from the monitoring subjects across the infrastructure; match that information with the development and deployment reports in order to provide better OLAP analysis of the infrastructure performance across the: resource consumption growth : How do our subsystems handle spikes? Do we provide graceful degradation as resource consumption growth (or is it an exponential curve indicating some critical bottleneck that should be handled soon)? development process : How do new versions affect the overall performance? Do they improve or degrade it? Host and run SLA rules to watch for the resource consumption and provision/surrender resources depending on their consumption levels. But that's another story, that gets closer to the topic of cloud bursting (aka using cloud computing provider to handle spikes in the resource consumption). By the way, there is a post showing, how an enterprise overview report might look on an iPad dashboard . This document is an essential part of xLim 3 body of knowledge . What do you think about this post? : Do you have monitoring infrastructures in your enterprise application system? If yes , is your implementation different from the provided simple architecture? What would you like to add to it, if you were able to? If not , do you think, that it is worth having such an infrastructure?"}
{"url":"/post/make-code-explicit-and-stupid/","title":"Make Code Explicit and Stupid","text":"Make Code Explicit and Stupid There is one thing that surprises me in software development. We, as developers, go at great lengths, explaining how event-centric architectures are so better than CRUD-based ones: you don't need to reverse engineer behaviors from state, since you express them explicitly (as events) and can project to any structural form. At Lokad we leverage this heavily for web UIs, that tend to be slightly more intuitive because of that. However, when things get to the code, we suddenly change our minds to the opposite and vastly favor implicit conventions and smart heuristics to discover and wire components in an application. Maybe this is because, we are so good with solving complex problems, that we see them in every single problem ? (this is only a part of the picture, component instantiation is in another class). We say, that such approach is smart, generic and reduces friction (when such approach is pushed to extreme, it is called Inversion of Control Container ). That it generally is a better alternative than this: Seriously, is it? An important remark : Both pieces of code were written by me. At the moment of writing of each one, I was extremely proud of the approach being used :)"}
{"url":"/post/making-sense-of-insane-software-requirements/","title":"Making Sense of Insane Software Requirements","text":"Making Sense of Insane Software Requirements Writing is a thought you can analyze (c) Tim Ferris Have you ever found yourself looking at a pile of software requirements without any clue what to do with them? Some of them could make sense, while others - little sense or even contradict to each other. All in all this would leave this nagging feeling of \"big ball of mud\", that you have to sort out. This happens quite often to me. I used to hate such situations before. However these days such problems are simply tasks that usually have step-by-step solution. Let's see why these big piles of requirements (or specifications) even show up. They are probably caused by the fact that it's rare for a single person to have a clear vision of all requirements in his head. Usually these come from different people at different points in time: Different people specify different \"I want to have this awesome feature\" ideas for the same project. Somebody might provide one suggestion (\"Make it green, it'll look awesome!\"), forget about it, then provide completely contradicting solution a week later (\"Make it red, it'll look awesome!\"). This happens to busy people all the time. People might provide feature requests without full knowledge about how things are already implemented (\"Please add this helicopter landing pad on top of our building\", while the building is simply a bungalo on the beach) The most interesting part is that it's duty of software developer to accomplish all that in the best possible way, while turning down things that are too expensive or risky (\"You know, adding underground parking garage to this existing building would probably make it collapse\"). One approach to accomplish that is to simply start organizing these requirements, giving them some sort of structure. You would try to bring closely related concepts together and look at them in isolation from the others (hey, these 2 requirements try to give button \"X\" different colors). You might need to try different approaches to structure and represent requirements, before you find the one that fits your sutiation (put them on cards and shuffle around, add as lines to Excel, draw UI mockups etc). The big goal here is not to try build links and relations in your head - it'll take a lot of energy just to remember them. You can instead try to bring them some place out (even if the first attempt is a mess) and then improve on that structure on a step-by-step basis. At some point in the process you might even discover contradicting requirements (e.g.: making the button red and green at the same time), and this time you will be able to reason about them and explain \"why\" it is so. So next time you encounter a bunch of complex and contradicting requirements - just try putting them on a sheet of paper and shuffling around. Ultimately, in the world of complex business relations and behaviors, you can go further and try to code your domain model while exploring numerous specifications. Code can provide much stronger logical and reasoning support than simple writing. You'll be able to clarify your understanding by encapsulating some complexity inside value objects, hiding business behaviors inside aggregates and isolating UI-related details inside projected views and client-side interactions. This will give you rather structured high-level representation with the ability to dive into any single detail. Better than that, you could send it for somebody to review and improve. If coded domain model happens to capture everything perfectly - it can serve as a basis for production software that will do something useful with it. Otherwise, it's just a written model that you can improve on. Tim Ferriss once said \"Writing is a thought that you can analyze\". In the world of software design we can push it even further and say that \"Code is a representation of the problem that you can analyze\". If you happen to encounter complex business problem that needs solving, then may be capturing it in the code would help you to understand it and reason about it. Just give it a try, even if the code itself would never run in production."}
{"url":"/post/management-lifehacks/","title":"Management Lifehacks","text":"Management Lifehacks First of all, I would like to give my apologies for being sometimes slow on responses to emails and other sorts of communication. Last weeks were throwing new challenges, dealing with which takes a little bit of time. I'll catch up. In case you are interested, here are a few random ideas that have proven to be helpful in handling such situations. These ideas were reiterated so many times, that it's no longer possible to point to the original author . Don't try to be perfect in your decisions . You are human and small errors are inevitable. Besides, every single decision is worthless by itself (just like any \"brilliant\" idea). Only through the continuous and careful application of effort, something worthy can be achieved. Continuous chain of good decisions will beat perfect plan any day (simply because you can adapt and keep on going). Likewise, good execution can be more important than any great idea alone . Delegate . No matter how smart and talented we consider ourself to be, alone we are not able to handle and achieve as much as a team. Hence it is our duty to ensure that incoming challenges are balanced against the entire team. Our purpose is not to keep everybody under 100% load (e.g. by assigning tasks to people who are less efficient in handling them), but to ensure maximum efficiency of the entire unit. Focus . Keeping multiple projects in your head at once is likely to drive you insane, cause insomnia, burnout or do worse. So try breaking entire problem field of your division or company into separate contexts. They can be really diverse: starting from accounting, HR management and up to long-term tech RnD. More often than not, you will find that tackling a specific problem involves just a single context. So you can keep only one context in your head most of the time, switching between them, as your day moves forward. Arrange all tasks within this context in a queue, putting most painful ones upfront. Don't be afraid to drop tasks or change their priorities, if environment changes. If you are tired but don't have time to rest - switch contexts. If you are exhausted but don't have time to rest - stop complaining and find some time. Full personal burn-out is far more expensive than a little bit of rest. Adapt to changes on the battlefield. You would be surprised by the amount of resources wasted on projects even after they are doomed. Wars are not won by the sheer force, unless you are USA (and even States encountered some issues while bringing democracy to countries which didn't have serious fire-power but had population willing to take benefit of every hill, forest and trick in the book). By being willing to accept that our initial plan is imperfect and adapt, we can reduce risks, save resources and potentially leverage new opportunities. Think \"Instagram\" or \"Apple\". People are the only reason things happen around , don't forget about it. Universe was empty without purpose till humanity showed up bringing purpose along (with a noticeable degree of chaos). It might be tempting to forget about people and focus on a single technology, idea, concept or code. However, by doing that we risk missing the whole point and actually undermining our own efforts. For example, writing complex code without unit tests or documentation is often perceived as a sign of outstanding hacker or even a guru. Sometimes this even is worth it. However, more often than not, other people will have to maintain such code for years. If it is the case, then it is egoistic not to think about them. Code can have stronger positive impact if we put putting additional effort in making it helpful and friendly. It is much harder to do that, than delivering egoistic \"easy to write and hard to read\" code. Result will have greater long-term impact, though. Same principle applies to all the other things we do in our everyday lives. This is not even an altruism, but merely common sense and adequate long-term thinking. And the last one - Don't stop and keep pushing . PS: I don't follow these quotes every single day (not strong enough for that). However, continuously trying to do that - helps to keep moving. Update : Daniel have mentioned, this post can also be summarised with Viking laws"}
{"url":"/post/many-ways-for-an-emergent-design-in-a-component/","title":"Many ways for an Emergent Design in a component","text":"Many ways for an Emergent Design in a component Recently I mentioned 6 steps of an evolutionary design in software development . These steps describe iterative process aiming at continuous improvement. Such an improvement process can happen at two distinct levels: High-level view of the entire system involving components and their interactions; implementation details of a component. While high-level system evolution is covered pretty well in methodologies like domain-driven modelling (strategic design), implementation level can be more project-specific and hard to explain in uniform fashion. May be that's because there is no generic approach to describe evolution of components or services in a real-world system. Each element might need to evolve in a unique way to order to reach the best balance between complexity, performance and capabilities. For example, let's consider an evolution path that a single component can go through in a startup team focused on emergent design, rapid iterations and .NET stack: Team dynamics, past experience and current political situation might lead to a design approach, where each component starts as a simple console app and then evolves towards more complicated design in order to fulfil specific requirements. We try to keep things as simple as possible, but no simpler. If a component is kept simple and focused (which is a task of a strategic design), then at any point in time it could be rewritten from scratch. Evolution tree below is merely a visualisation of existing design approach inside a given team, serving as a way to make design options more explicit and allow better communication. Any change in team, business priorities or design methodologies could affect this evolution tree. Here is a bigger version of this image . Please note, that at any evolution along such design tree is a specific optimisation that comes at the cost of complexity. Sometimes it is better to delay paying that complexity cost and keep your options open ."}
{"url":"/post/messages-and-distributed-systems/","title":"Messages and Distributed Systems","text":"Messages and Distributed Systems Messages are essential to building robust and distributed systems, so let’s talk a bit about them. Message is a named data structure, which can be sent from one distributed component to another. Components can be located on the same machine or on different sides of the Earth. The basic real-world example of message is an email. It has a sender, subject and one or more recipients. This email might take some time to reach the designation. Then it gets to the inbox, where it could spend even more time, before recipient finally gets time to read it and may be even reply. Messages, just like emails, might take some time to reach recipient (it could be really fast but it is not instantaneous), and they could spend some time in the message queues (analogue of inbox), before the receiving side finally manages to get to work on this message. The primary disadvantage of messages and emails is their asynchronous nature. When we send email or message, we expect the answer some time later, but we can never be sure that we will get it right away. Direct phone calls (or direct method calls) are much better here – once you get the person on the phone, you can talk to him in real time and get results back almost immediately. Despite all these disadvantages, messages could be better than calls for building distributed and scalable systems. With phone calls and method calls you: Can get response immediately, once your call is picked up. Must be calling, while the other side is available (greater distance you have, harder it is to negotiate the call). More stressed the other side is, more time it will take before your call will be picked up. And this does not guarantee, that you will get the answer (when the other side is really stressed you are likely to get: we are busy right now, call next millennia). With messages you: Can send a message and then get back to your work immediately. Must organize your work in such a way, that you will not just sit idle waiting for the response. Can send a message any time, the other side will receive and respond, as soon as it gets to the job. More stressed the other side is, more time it takes to receive the answer. No matter what the level of stress is, the other side will still be processing messages at its own pace without any real stress. Since we are mostly interested in building distributed and scalable systems (which can handle stress and delays) messages are a better fit for us, than the direct method calls in the majority of the cases. They allow decoupling systems and evenly distributing the load. Besides, it is easy to do with messages such things like: replaying failing messages, saving them for audit purposes, redirecting or balancing between multiple recipients. Note, that there are cases, where direct calls work better than messaging. For example, querying in-memory cache does not make sense with messaging. Cache is fast and you want to have the response immediately. For an overview of how messaging works together with a scalable distributed system, check out Decide-Act-Report model."}
{"url":"/post/microsoft-lokad-is-windows-azure-platform-partner-of-the-yea/","title":"Microsoft&#58; Lokad is Windows Azure Platform Partner of the Year","text":"Microsoft&#58; Lokad is Windows Azure Platform Partner of the Year I just wanted to share some exciting news with you. Microsoft has just issued a press release for its upcoming WPC 2010 . Lokad is referenced as the winner in Windows Azure Platform Partner of the Year category. “Lokad’s Lokad.Cloud solution earned them the Microsoft Windows Azure Platform Partner of the Year Award,” said Allison Watson, corporate vice president, Worldwide Partner Group, Microsoft Corp. “Lokad.Cloud facilitates the development of enterprise apps on Windows Azure to bring substantial benefits to the growing community of ISVs who target Windows Azure. Lokad’s early adoption of these technologies and impressive solution has made them a leader in the space.” I'd like to congratulate Lokad Team working together with Microsoft and Windows Azure teams to achieve this significant milestone. Stay tuned for more exciting news from Lokad.Cloud and Lokad.CQRS Open Source projects, as we continue to share our passion and experience with the community (while still focusing on delivering outstanding analytics to your business)."}
{"url":"/post/migrating-legacy-systems-to-event-sourcing/","title":"Migrating Legacy Systems to Event Sourcing","text":"Migrating Legacy Systems to Event Sourcing These days I'm working on migrating really legacy system towards the simplified CQRS/DDD design with event sourcing for the cloud. As part of the migration process, I'm reverse engineering legacy SQL database into a stream of events. These events are not precise representation of what has happened in the past (this exact information is irreversibly lost, as in almost any data-driven system), but rather a pretty good estimate that could be used to prepopulate the new version. Essentially, reverse engineering events is about writing a throw-away utility that will scan database tables (MS Access files or punch-cards) and spit out events that could be used to reproduce that state. For instance, consider this customer record in DB table: Customer { Name : \"GoDaddy\", Id : SomeGuid, Created : 2008-13-12, Status : Deleted, Phone : \"111-22-22\", Reason : \"Supporting SOPA was poor PR move\" } This record could be reversed into the following events CustomerCreated!( Id: SomeGuid, Name: \"GoDaddy\", Created: 2008-13-12 ) CustomerPhoneSpecifid!( Id: SomeGuid, Phone: \"111-22-22\" ) CustomerDeleted!( Id: SomeGuid, Reason: \"Supporting SOPA was poor PR move\", Deleted: 2011-12-24 ) Note, that we actually had to improvise while coming up with this event stream: date of deletion was not stored in the original database (we were losing this information). So we are just substituting some predefined date here (i.e. date of upgrade to CQRS/DDD+ES). When you have a system with a few years of history, quite a few events are generated. The system that I'm currently migrating has data that dates back to the early dates of Lokad , hence 300-400 thousand events is something expected. As part of development process, these events are run through the aggregate state objects and also through the projections . The goal here is to pass all possible sanity checks and get read models that match exactly to the UI currently visible in the old system. If new system looks and behaves exactly like the old one (even if the guts are completely simplified), then we are moving in the right direction. Obviously, during this process, a lot of problems show up, especially with logically inconsistent or corrupt data (i.e. accounting inconsistencies caused by race conditions and dead locks in the legacy database). These things are generally to be resolved manually - there is no magical silver bullet."}
{"url":"/post/migrating-to-eventstore-in-windows-azure/","title":"Migrating to EventStore in Windows Azure","text":"Migrating to EventStore in Windows Azure We have a number of systems at Lokad, which use various versions of Azure EventStore from Lokad.CQRS. Our in-house store is nice, but has performance limitations and some administrative burden. Currently I'm working on gradual migration of our systems towards centralised EventStore on Windows Azure. Some community members expressed an interest in the process, so I'm sharing my current thoughts . Administration Here are a few considerations and assumptions that affect our requirements for Event Store server setup: Uptime with many nines is not critical for at the moment. Built-in Azure replication is more than enough for us at the moment. We'd prefer to use single-node deployment of Event Store at the time being (with built-in Azure replication). We're hosting event store in Widows Azure, exposing it to the entire world. There are two main deployment options that fit: Option 1: Install Event Store as Windows Azure Worker Role. This setup requires a little bit of coding but is more native to Azure. However, single-node configuration would require use of CloudDrive (XDrive) to mount Azure Blob storage as Windows Disk visible to Event Store. I've heard people were having problems with this setup in production (data corruption). If we were to switch to clustered version of Event Store, then worker role setup (with local transient disks) would be the preferred approach, without any doubt. Much better performance, too. Option 2: Install Event Store as windows service on Windows Azure Virtual Machine . This option uses host-managed disks to mount the same Azure Blob storage as Windows Disks. These are the same disks that are used to host operating system itself, so they are more reliable. Option 2 is our current path at Lokad right now. Additional setup tweaks include: Host Event Store db on a separate VHD drive , making sure that it has read-caching turned on. Logging is better to go to a separate transient disk (so that NLog would not compete for IO with event storage). Write-caching is better to be turned off. Expose port 1113 to internet as public endpoint . This port hosts TCP protocol of Event Store, which has just got per-stream security available. keep port 2113 as private . This port serves Web UI and projections via REST API. Migrating systems Theoretically, switching systems to new event storage would be rather simple: Replace current event store implementation in code with adapter that uses .NET client to Greg's Event Store; Migrate all data to a new server. In theory this should work. In practice, blindly jumping to a new setup would be reckless. I expect a number of edge cases that might surface simply from the fact that such Windows Azure setup is not native for EventStore : Event Store unavailability due to VM restarts by Azure. This is expected but has to be dealt with. Transient network issues while communicating with EventStore server via .NET client. Changes in network topology by Windows Azure (e.g. EventStore server is no longer located on the same IP). Technically, EventStore deals with many problems via graceful degradation (as opposed to simply giving up). E.g. if it can't connect to EventStore, then it will retry connecting internally. However, in practice, components using EventStore must be designed with this degradation kept in mind, they should degrade gracefully as well (as opposed to simply giving up). So the migration process would be: Keep existing production systems on Lokad's Event Store, which is battle-tested on Azure. Setup replication to Greg's Event Store and start moving non-critical projections and components there. Let them run there for a while. If the setup holds - gradually move existing productions systems to the new Event Store. Long-term Strategy If things hold well on Azure, I expect to have a number of design simplifications in Lokad systems. Some examples are: Use event store instead of Lokad.CQRS messaging adapter (Files / Azure queues) and avoid Azure message throttling. Switch to catch-up projections, gaining ability to store projected views on any storage (not just azure-hosted blobs). This would massively improve performance and simplify client applications. Interestingly enough the same design improvements allow to make system less coupled to constrains of Windows Azure. In my experience, this is a really good thing. Feedback? I know that some of my readers are already using EventStore in production on Windows Azure. Would you kindly share your setup and problems encountered? Update 1: using RAID for better speed This is an update by Ryan A. You might want to take a look at creating a RAID 0 of 4 disks for Medium (8 disks for Large, 16 for XL) within the VM. This will allow you to utilize all of the IOs allocated to your VM. There is a MAX of 500 IOPS per disk. Virtual Machine and Cloud Service Sizes for Windows Azure In some initial tests, I was able to get WRFL 10 1000000 --> 5576/s running the test client from an external machine in the same datacenter. Albeit, the host was a Extra Large (8 cores, 14 GB memory) with 16 5GB disks in RAID 0 Update 2: Doubtful performance of Cloud Drive This is an update by Alexey Zimarev CloudDrive is still based on SDK 1.8 and according to Msft people \" the future of CloudDrive is now being discussed \". I was busy deploying RavenDb in a worker role and it works with the StorageClient assembly, which is now seems to be a wrapper around the new Storage 2.0 client. However we can never be sure what Msft will decide to do with CloudDrive. Also there were some reports about performance of the CloudDrive versus VHD although it must be the same, but still. One of the issues with data stored on a CloudDrive in worker roles is a scaling up problem when you increase the number of instances for your worker role and it is unable to get write access to the database file since it is on exclusive lock. In this fashion VMs are much more controllable since you instantiate each machine separately and have separate disks and you setup replication between the VMs in a way you want it."}
{"url":"/post/mindsets-in-development-enterprise-vs-startup/","title":"Mindsets in development - Enterprise vs Startup","text":"Mindsets in development - Enterprise vs Startup A while ago we were recording a distributed podcast with Udi Dahan (not yet published, I believe). In this podcast I quite enthusiastically opposed everything Udi said. The entire beginning of the podcast (one or two dozens of minutes) had to be cut out and redone because of that. I personally believe in a good argument (and enjoy one), but we didn't want to make the podcast look non-constructive or scare all the other potential guests away :) This experience got me thinking for a while, making more aware of the arguments and collisions that happen in the software and business worlds these days; especially in parts of it, where new technologies and approaches meet something that has been working for decades. Opening dev center in Ufa provided some additional food for thoughts as well. As it turns out, I was not conflicting with Udi's approaches directly (this also applies to my other ongoing critics of Microsoft or any other enterprise-level entity or solution). I still extremely value their experience and ability to solve problems I would've given up on. Root of the conflict is a bit deeper - in the business environment and the way the problems are structured in it . Let us consider two extremes (exaggerating the slightly just for the sake of making the point). Corporate enterprise is on one of the extremes. It tends to have an extremely formal and regulated environment (it is the only way to keep big system from falling apart) with large budgets, slow processes and established teams. Such an organization usually wins its day if it keeps status quo. All software projects will be affected by such environment. Developers are generally restricted to time-proven technologies, formal architectures and long development cycles. That is exactly the environment, where complex N-Tier designs, GUI tools and multi-page specifications come from, backed up by established stacks like MS SQL + MSMQ + BizTalk/NSB and armies of paid consultants to help whenever there is a problem. Product managers in such conditions are restricted to time-proven technologies and approaches backed up by the books and solid authorities (nobody is fired for using Microsoft). Given the armies of stake holders and complex political games involved around any single project (they would make any Bene Gesserit jealous), it is safer just to do everything in a slow and controlled manner, making the efficient use of the resources available (and there are a lot of these). Small start-up environment is on the other extreme. Such companies do not have a long history, stability of established routines and big budgets; but they possess founders who were crazy enough to favor unpredictability of a new business to a soothing comfort of established company. Start-ups are usually small companies below the profitability state, so they must fight every day just to have resources to move on. Staying in the office at evenings or pulling weekends is not unusual either, just like nervous break-downs. Lack of resources and time are usually compensated by enthusiasm and a lot of personal collaboration. Such an environment has its own drastic effects on the IT projects that happen within. Business people tend to care less about all things that are \"time-proven\" and \"as explained on the training by X\". Keeping the company afloat and moving forward is what matters more. Hence IT departments (usually consisting of a single dev or even a founder himself) are allowed and encouraged to do whatever it takes to solve the problems at hand as efficiently as possible. Technology and methodology do not matter as long as they work. It is even considered to be acceptable to sign a contract with devil or dump the entire stack into the cloud, if it would help to cut down cash burn rate or achieve a milestone. We did that, by the way. Ok, this was just a brief overview of the differences in these extremes. There are much more complexities and details underneath the surface. Books are written about them. The point that I'm trying to make is: these environments have completely different rules of the play . Things that get projects delivered in one of these extremes, are not guaranteed to work in the other . For example, start-up might give up on using some complex clustering or replication, because it is too complex and expensive for the job at hand, while enterprise might take extreme care before rushing into all these cloud things or hand-made linear scalability. Same applies to all other things that are associated with a successful delivery of products, starting from management of development projects and up to selecting deployment and evolution strategies. So when we are talking about any complex buzz-word and practical approaches for it (CQRS and DDD being an example), it is recommended to keep in mind this point. It will help to both avoid the confusion while listening and help to deliver the message while explaining. If some explanation or approach does not highlight these specifics, then we can try to clarify the situation in our own minds by answering a few simple questions: what environment does the speaker come from, what are his financial interests, affiliations and expertise? what context does the methodology or approach target? What constraints does it have in mind (resources, budgets, time, risks, skills, regulations etc)? if some specific technologies are recommended, for which context were they developed? Which companies do have interest in them? What are the lock-down risks and costs? Just to make it clear, when I am personally talking about development efficiency and architectural approaches (for both cloud and on-premise environments), I'm doing that from the perspective of \"start-up mindset\" where: budgets and time are constrained; really constrained; business does not care about the specific tech or regulate development, as long as the job is done; it is not that hard to find and hire good developers (they tend to find you); motivation and reach inter-personal collaboration tend to replace formal planning and various management practices (as written in the books); there are not enough resources to allocate large teams to a project, one or two devs is all you can get (sometimes even less than one); development can affect the business in order to structure the problem in a way that it will be solved more efficiently (by breaking into smaller steps or realigning business priorities); there are no real requirements up-front, ever changing business environment dictates these, as the company moves forward; it is more efficient to \"reinvent the wheel\" internally, rather than ask for the paid support. I personally believe such an environment is much more efficient and capable of delivering interesting and exciting solutions, if played out properly. It has the advantage of using the full potential of recent massively enabling technologies. High pressure, scarce resources and lack of formal regulation are among the obvious downsides (if you look at them this way). A quick example might be helpful here to illustrate potential differences between the environments and mindsets. Corporate developer might consider data replication, backups and synchronization to be something really complex and extremely expensive (bringing thoughts here of SQL licenses, clusters and various sync frameworks). Another developer, that is used to event-centric architectures, would consider implementing data replication and streaming to be a boring task, that requires just a few lines of code and picking the properly certified cloud storage provider. Both viewpoints are valid, because they are based on the constraints and assumptions present in their respective environments. More than that, each of the developers has a decent chance of failing horribly in the other environment (unless they are flexible enough). My personal \"environmental\" beliefs, obviously, affect the way I structure problems or choose from various options and technology stacks, when given the luxury of choice. The same applies to all people that write, blog and preach these days. Please, keep contextual and mindset differences while listening and learning from anybody."}
{"url":"/post/move-forward-by-discarding-complex-tech/","title":"Move Forward by Discarding Complex Tech","text":"Move Forward by Discarding Complex Tech Good things are either well-forgotten past or a complete rip-off from the nature. It seems that at Lokad we are going all the way back in time ourselves as well. Over the course of the last few days we had really interesting times at Ufa office, while migrating entire event replication infrastructure to a new model . If you wish, you can call this infrastructure as bounded context of digital nervous system that is represented by green arrows in our context maps . This is a really interesting place for us, since it \"touches\" multiple other bounded contexts and actually crosses 2 clouds and 1 additional datacenter deployment-wise. Change shocks are mesmerizing to observe. Now, instead of a mixture of Azure queue delivery and ZeroMQ streaming, our applications just push large event streams over hand-made HTTP replication protocol . This effectively uses HttpListener and WebRequests, which are: rather performant; dead-simple and well understood; have minimal friction of introducing replication to new projects (ZeroMQ is pretty invasive here, if you go for Azure); can be debugged with a lot of HTTP-based tools. The design is rather simple, practical and works well for streams of half a million of events (albeit performance could be improved a lot). This was really important, since we have now a number of bounded contexts to integrate together and the volume of event streams just keeps on growing. It is curious, how our movement forward towards better and simpler designs happens concurrently with stepping back from complex technologies to much simpler ones . In other words, we gain by discarding things . Another example of such behavior is related to our recent decision to discard ProtoBuf as the storage format for large data objects , while replacing ProtoBuf+Gzip with TSV+Gzip . This applies specifically to bounded contexts that deal with big data . Reasons for that being: ProtoBuf by default loads all objects directly into the memory at once (imagine a dataset of 1 GB), while the default behavior of text files is streaming; For numerical data TSV+Gzip compresses better than ProtoBuf+Gzip, since archivers were initially designed and optimized specifically for handling text data; You can read and parse TSV dataset with tools on any platform, including scripts and Excel. While with protobuf, some intermediate dancing would be required. So, if I can reduce a number of technologies in a given bounded context, while making it more practical and performant, then that's a clear choice. As you can see, in certain scenarios , we are stepping back from cool and smart tech towards something more practical and simple. This \"stepping back\" actually enables us to solve certain problems that exist in this specific scenario. Surprisingly enough, this brings us closer to the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface. I certainly didn't expect to see this happening before, not even theoretically. However, in practice, there is a big difference between theory and practice . Caveats Please, keep in mind, that: we are aware of ProtoBuf capability to read items sequentially. we still will be using ProtoBuf for serializing messages, including events that are used for our event sourcing scenarios (leveraging for .NET development a wonderful library by Marc Gravell ) these examples just serve the purpose of illustrating possibility of cases, where you can move forward by discarding a technology. Specific decisions might not be applicable directly to your case."}
{"url":"/post/my-current-development-environment/","title":"My Current Development Environment","text":"My Current Development Environment A few years ago I wrote a blog post about the development environment for building xLim 2 applications. Ben recently uncovered it and asked a question if anything has changed since then. Yes, it did. 2 years are a huge time interval, especially when you try to learn as fast as you can. So let's talk about my current software environment around the PC workstation which is mainly used for the development (there also is a notebook for develop-while-you-travel and an iPad for learning and planning, but that's a different story which I'll tell later, if you're interested). Side story - what is this xLim? For those of you who are still confused with the whole xLim concept , here's how it showed up. 3.5 years ago, while working remotely as a Developer and Team Lead in an American company I got fed up with the absolute lack of progress in the project (it hit hard the complexity barrier) and decided that it should be possible to do better with much fewer resources, than 5-15 developers. Much fewer meant - just one developer or slightly more. When you have idea, passion and a bravery to follow it (which is sometimes referred as \"childish stupidity\" by more conservative part of the society), then two things are certain to follow: cool name (xLim standing for eXtensible Lightweight Information Management system); a bunch of hard lessons and disappointments. So they did show up on the horizon rather soon and were put to a good use. Practical experience was turned into the first \"lessons learned\" report, which actually opened me the way to the first remote freelance project and even more experience. You can still download the report and associated screenshots from the xLim page (just scroll to the very bottom). Lessons were continuously learned (or learnt in GB English) since then and recorded in form of the reports, articles and teaching materials for the Russian teams I happened to manage and train from time to time. Most part of them was republished as articles within the xLim series. Every time there was a decent paradigm shift in the perception of the problem at hand (and the simplest possible solutions that work in practice), xLim version number was incremented. Every next version incorporated the best of the experience from the previous steps, but reshuffled the whole vision and opened room for new paradigms and further movement forward. Current version is 4 - \"xLim 4: CQRS in the Cloud\" and I still maintain that you can build and run incredibly rich and beneficial applications without spending a lot of resources or time)) OK, let's get back to the point - how does my development environment look right now, in October 2010, while it matches xLim 4. IDE, Required Tools and Libraries Visual Studio 2010 (mine is Ultimate, but Professional is enough) Microsoft .NET Framework 4.0 (TPL and Rx for .NET are a must) ReSharper 5.1 (some things never change, just get better) NUnit for Unit Testing Mercurial (TortoiseHg) or Git (TortoiseGit) for version control (as long as this is not CVS, SVN, TFS or SourceSafe). Autofac IoC Container Protobuf-net for fast and efficient serialization, implementing custom storage formats. ASP.NET MVC with MvcContrib and a dash of jQuery (completely replaces DXperience in web). SQLite for file-based databases with incredible performance, SQL syntax and schema-tolerance (i.e.: you can put 4MB Blob into INT field). Reflector for .NET Putty SSH client . Gmail. RabbitMQ - preferred message server implementing AMQP (in cases when I don't have to use Azure Queues) As you can see SQLite is the only persistence-focused item in this list. There is no SQL Express, mySQL, Oracle or, NHibernate+FluentNHibernate+LINQ with all libraries it uses. I'm still using them on a daily basis because of the existing projects (or databases we need to integrate with). Plus NHibernate is featured in Lokad.CQRS (as a single library) because that's what people are mostly used to. However, in my personal mental model (and while evolving projects) ORMs are being rotated out along with the relational databases - with a great success, more flexibility and simplicity. Check out the article on CQRS, Reliability and Redundancy . It may not seem particularly simple in the theory, yet when it gets to the code (backed up by the reasoning), things do get more simple and efficient. In addition to a few things being removed or replaced, there is a new and critically important piece: cloud computing . Virtual Machines from Rackspace Cloud are used frequently. For example, when I need test against some version of DB engine, which I don't want to install on my development machine. That's 1-3 minutes to provision Ubuntu and SSH into it and a few more to apt-get and configure mySQL instance. That's a bit faster than Windows Azure (which takes just half an hour to update Worker Role) which I also happen to use at Lokad. Obviously, Cloud Computing, as perceived from the point of efficient development, is not just a quick way to get VMs to play with and pay pennies for that. Cloud storage and on-demand computing are actually affecting the whole concept of xLim 4 and allow an individual to build systems that could compete with corporations in scalability and flexibility (and still fit within a limited budget). The concept of Almost-infinitely scalable systems (as applied to CQRS) would be much more complex to achieve without cloud computing. Just check out the CQRS Roadmap for the better high-level picture. Recommended Tools Launchy (keystroke launcher that I also configured to open ticket shortcuts in various trackers). FSCapture - for making screenshots. Fiddler2 - the best tool to debug these AJAX Calls or REST APIs (esp. when you need to prove to Azure team that their APIs do not behave according to the documentation) JetBrains dotTrace profiler KeePass Password Safe - for keeping my passwords, logins and sensitive configuration settings. JungleDisk (Desktop Edition) - for backing up my files into the cloud (in encrypted way) and transparently syncing them between the machines. FoxitReader - because Adobe PDF reader is an overkill. Evernote - to gather small notes, articles and reference materials. 7-Zip - the Archive Manager Windows 7 64 bit - the best operating system that works with Visual Studio 2010. GhostDoc - VisualStudio Add-in for generating XML documentation stubs. Please note, that a few of these items create additional value outside my development machine. For example, Evernote and Jungle Disk frictionlessly sync important information between my development PC, travel notebook and an iPad. Development Environment Collaboration Strict development environment requirements from xLim 2 are getting obsolete. With the cloud and SaaS services available you can rent various wikis, issue trackers and version control repositories at rather competitive rates or even for free (for smaller projects). For everything else there is a cloud. Open source projects benefit significantly from offers like: Google Projects and Google Sites . At Lokad we are currently using CodebaseHQ (Mercurial hosting + issue tracker), Google Sites and self-hosted CruiseControl.NET (yes, this one is antique but it still does the job and even handles a bit of misuse by Lokad R&D) In short, you can get yourself rather flexible development environment for the distributed teams either for free or at the cost of a few dollars per month. Just add Skype to it. Reference Materials CQRS section aggregates various references specific to the architectural approaches of the current xLim version (and all the solutions I'm building). As for the pre-architecture requirements of xLim 4 (development practices, recommended reading, design guidelines, low-level building blocks and producing decent code etc) - that will be a rather long story to tell within this post (or probably even series). For references I just recommend to look through previous versions of xLim, discarding everything that does not fit into the CQRS approaches or does not scale in the cloud. NB: Composite application approach still holds, but all these Enterprise Application blogs or universal frameworks just do not work out. You can rapidly develop various Client applications on top of CQRS without any need for the complexity. Hardware Actual hardware is nothing special these days and hasn't been upgraded for a year. Yet it works for me (especially when compared with more painful configs). Let's start with the most important things: important! Microsoft Natural Ergonomic Keyboard 4000 (after more than 10 years of coding my wrists are quite touchy about keyboards and eventually start burning on any other keyboard) important! Microsoft Natural Wireless Laser Mouse 6000 (same reasons as the keyboard, although I would try to get one of Logitech's Trackballs these days). important! Ergonomic chair designed for workers-of-the-keyboard (I picked one at the local store) important! Two decent wide-screen LCD monitors. And they are better to be connected to DVI/HDMI ports (ATI Radeon HD 4800 in my case), unless you want your eyes to be hurting by the end of the day. Less important stuff: Intel Core 2 Quad CPU Q6700 @ 2.67 (would use i7 these days) 8 GB of RAM RAID 0 HD Setup for faster disk IO over the OS and project files (would use SSD these days) Decent power box to supply tower with all this hardware. Note, that I am not doing any over-clocking to \"get the bang out of the buck\". It is just not worth it. Tools For the Project Manager Microsoft Office 2010 (Word, Excel, Powerpoint) Microsoft Visio 2010 - for creating all these nice diagrams you see in my blog. As you can see we've lost MindManager, MS Project, NDepend and Source Monitor from our list. Complex management software is no longer needed (at least personally for me) due to work in environment that rapidly changes and sometimes requires new feature to be designed, implemented, deployed and stabilized within 1-5 days. At such pace well-thought projects become obsolete really fast (the next few days) and become just a waste of time. Instead, Amazon.com development style starts working better and better. Getting Things Done methodology, priority lists and OmniFocus (which unfortunately runs only on iPad or Mac) help to stay sane and keep all things under control even when there are multiple projects running concurrently with multiple tasks that have volatile priorities. Word, Visio and Moleskine help to do the actual planning, brainstorming and architecture design. Summary Now, locking back at changes within these couple of years I can clearly see reduced dependency on the libraries, frameworks or specific tools (aside from the Visual Studio + ReSharper). There's a clear tendency of getting rid of the relational databases and layers of persistence and using more capabilities offered by the cloud. Projects get smarter, more reliable and efficient not because of the building blocks used, but because of a lot of theory being put into them and simplicity enforced at all levels. Actually I believe simplicity to be the hardest target to arrive at. It easy to develop a complex solution. It's much harder to develop solution that achieves the same but in a more simple and straightforward way. This at least requires some experience and I feel like standing at the beginning of the road towards this experience. Despite these changes I still maintain that a single developer with limited resources can build and manage rather interesting solutions; especially if he uses these resources efficiently, keeps everything simple and thinks a lot before doing anything (obviously now the term \"interesting solution\" means a bit more, than 3 years ago)."}
{"url":"/post/new-case-study-on-cqrsguidecom/","title":"New Case Study on CQRSGuide.com","text":"New Case Study on CQRSGuide.com I've published final version of a new case study about one of latest Lokad's projects. This project had to go live on February 29th. This was caused by an unfortunate sequence of events with our cloud provider and immediate need to fail over to a different provider in order to keep some of the services running. Case study provides rough overview of the most important CQRS/DDD concepts that made their way into this project. Some of them are already presented in Lokad.CQRS Sample Project. Some other are yet to be ported into the open source. CQRSGuide.com Case Study: Lokad Hub These are kinds of things that enable us to solve everyday problems, help customers and try to push the state of the art. BTW, if you happen to be a fan of Lokad forecasting technologies - stay tuned. There is a really big thing being prepared by our awesome analytics team. Announcement will come soon."}
{"url":"/post/one-example-of-using-message-driven-design-at-lokad/","title":"One Example of Using Message-Driven Design at Lokad","text":"One Example of Using Message-Driven Design at Lokad Lokad Salescast is an inventory optimisation platform for retail , capable of dealing with big datasets. It gets inventory and sales information and does some number crunching . Produced reports tell when you need to reorder your products (and how much) in order to serve forecasted demand and avoid overstocking. One of the objectives of Salescast is to make it available and affordable for small customers. Hence we introduced \"Express Plan\", which is free for small customers, but comes without any support. Making software free is easy. Making software usable without support is much harder. So Lokad developers had to create complicated heuristics to help customers deal with the problems. TSV parsing is one of problematic regions. Even though the major scenario for big data transfer at Lokad is \"upload TSV-formatted text files to FTP\", there are multiple things that can go wrong with this simple setup. No matter how precise is tech documentation, people can always miss seemingly unimportant things that are critical for computers. Here are some examples: text encoding of files; culture-specific format of dates; culture-specific format of numbers; optional columns in invalid format; required columns missing; missing files; non-standard separators. Yet, we are trying to provide the best experience out-of-the-box even with improperly formatted data. This would require doing a lot of smart TSV analysis in code. Here's how an output of one analysis process would look like (latest log entries at the top): Message-driven design patterns help to develop and maintain such logic . Public contract of it in the code might look like a simple function (with complicated heuristic inside): static IMessage[] AnalyseInput(SomeInput input) { .. } Here messages are strongly-typed classes that explain return results of that function (unlike event sourcing, they are not used for persistence). For example: public class UsedNonstandardExtension : ITsvFolderScanMessage { public readonly string Extension; public UsedNonstandardExtension(string extension) { Extension = extension; } public virtual AdapterTweet ToHumanReadableTweet() { return new AdapterTweet { Severity = AdapterTweetSeverity.Hint, Tweet = String.Format(\"Salescast found Lokad TSV files using\" + \" non-standard extension {0}.\", Extension), }; } } Function would return one or more event messages . Various input scenarios might be unit-tested using given-when-expect approach , where we express test case as: given certain inputs ; when we invoke function; expect certain outcomes and assert them (e.g. verify that we get expected messages). Or in code: public sealed class given_compressed_files_in_txt_format : tsv_folder_analysis_fixture { public given_compressed_files_in_txt_format() { // setup all expectations in constructor, using helper methods // from the base class given_files( \"Lokad_Items.txt.gzip\", \"Lokad_Orders.TXT.gzip\" ); } [Test] public void expect_detection_with_extension_warning_and_compression_hint() { // assert expectations, using helper methods from the base class expect( new TsvFolderScanMessages.UsedNonstandardExtension(\"TXT\"), new TsvFolderScanMessages.CompressedFilesDetected(), new TsvFolderScanMessages.StorageDetectionSucceeded( TsvInputFile.Item(\"Lokad_Items.txt.gzip\").WithGzip(), TsvInputFile.Order(\"Lokad_Orders.TXT.gzip\").WithGzip() )); } } This is an example of a single test scenario. There could be many others for a single function, reflecting complexity of heuristics in it: Each of these test scenarios shares same \"when\" method and helpers to setup \"given\" and \"expect\", so they are pushed to the base fixture class, which can be as simple as: public abstract class tsv_folder_analysis_fixture { readonly List<string> _folder = new List<string>(); ITsvFolderScanMessage[] _messages = new ITsvFolderScanMessage[0]; protected void given_files(params string[] files) { _folder.AddRange(files); } [TestFixtureSetUp] public void when_run_analysis() { // this is our \"When\" method. It will be executed once per scenario. _messages = TsvFolderScan.RunTestable(_folder); } static string TweetToString(ITsvFolderScanMessage message) { var tweet = message.ToHumanReadableTweet(); var builder = new StringBuilder(); builder.AppendFormat(\"{0} {1}\", tweet.Tweet, tweet.Severity); if (!string.IsNullOrEmpty(tweet.OptionalDetails)) { builder.AppendLine().Append(tweet.OptionalDetails); } return builder.ToString(); } protected void expect(params ITsvFolderScanMessage[] msg) { CollectionAssert.AreEquivalent(msg .ToArray(TweetToString),_messages.ToArray(TweetToString)); } } If you look closely, then you'll find a lot of resemblance with specification testing for event sourcing . This is intentional. We already know that such tests based on event messages are non-fragile as long as events are designed properly. This additional design effort pays off itself really quickly when we deal with complicated heuristics. It makes development process incremental and iterative, without fear of breaking any existing logic. Step by step, one can walk around the world. In essence, we go through all the hoops of expressing behaviours via messages just to: express diverse outcomes of a single function; provide simple functional contract for this function; make this function easily testable in isolation ; ensure that tests are easily maintainable and atomic . Downstream code (code which will use components like this one) might need to transform a bunch of event messages into a some value object before further use, but that is a rather straight-forward operation. Interested to dive deeper into Lokad development approaches? We are looking for developers in Paris and Ufa. You can also learn some things by subscribing to BeingTheWorst podcast which explains development ways of Lokad."}
{"url":"/post/our-ultimate-technological-challenge/","title":"Our Ultimate Technological Challenge","text":"Our Ultimate Technological Challenge // beware: this article is really important for me, but might bore you to death. To make it worse, I will mention Justin Bieber down there. Our world seems to be in a complete mess these days . A few dozen of grown-up people chasing a ball on a grass field get much more attention than news from war zones on East. Students know more about personal life of Rihanna than about genocide in Rwanda in 1994. And amount of money wasted by Facebook on Instagram is equivalent to total funding of SpaceX company since it's creation till May 2012 (world's first privately held company to send a cargo to the International Space Station). Yes, (mass murder < pop singer) && (cute photos == space flight) . Given ever accelerating speed of changes, such disparities and problems are only to going to get stronger . Personally I don't give a damn about musical taste of rich countries, but when kids are dying from starvation and malaria in Africa - this is just not right. This is not the world I was promised back in soviet school and definitely not the world I would want to pass to my own children. Version of Matt Harding is better. When something does not go as it supposed, you give it your best shot to straighten things out. So, what can we change? Most importantly, how it can be changed? Let's start with the simple assumption. In order to make things right, we need to change how entire nations think and act : both rich nations with abundance of resources and poor ones. That's the very thing that Gustave Le Bon called \"soul of the nation\". It was supposed to be extremely hard to alter, nearly impossible. Fortunately, things improved a little bit since then. As it turns out, it can take as little as a clever excuse and a staged act to change one nation . For example, this was vividly demonstrated on Easter Sunday Parade of 1929 in New York. Edward Louis Bernays was doing a mission for tabacco companies which were upset by loosing huge potential profits to a superstition that women should not smoke cigarettes. Edward (who later became known as \"father of public relations\") simply combined ideas of Wilfred Trotter and Gustave Le Bon (with a few other sources of inspiration, including his famous uncle) and simply hired a few girls to start smoking torches of freedom as a protest against sex taboos. Of course, he also tipped local reporters about upcoming feminist protest and hired photographers to make sure, that good and plausible pictures are available. Faster than you know it, the entire USA fell for this act and started discussing. Needless to say, that tobacco companies were happy with the sales in the next years. \"Tools\" used in this change only improved since then: we've got internet, social networks and massive spread of smartphones and plain cellphones. These are the things that actually made it possible for the recent revolts in North Africa to take place (if you know Pieter Hintjens, the guy behind ZeroMQ project, check out his guide to Digital revolution ), and they keep improving over the exponential curve. If you should know one thing about exponential curves - that's the kind of thing that helped Chernobyl Disaster to happen (in addition to flaw in the design of control rods). So these days civilization has an enormous potential available at finger-tips of anybody who has access to the internet: starting from limitless power of cloud computing and up to world-wide penetration of mass media, cell phones (and their smarter brothers). Couple this with herd instincts baked into our DNA by thousands of years of evolution, and you have outstanding things taking place: the good, the bad and Justin Bieber. Combining digital and social resources can have an outstanding effect even in constrained situation. A couple of years ago, I played a small role in helping a few kids to start a social project to support orphans. It wasn't that easy, but MyDreamCity went worldwide and is still running. At the heart of the process was simple enthusiasm supported by some basic IT infrastructure. Take this to a higher scale, and you can probably have a shot in improving a nation. I'm thinking about two directions that can be exploited: education and efficient use of resources . The former - to make a positive change at the most vulnerable point of any society - kids; the latter - to actually provide some real foundation for these changes to stand upon (money is the blood of our society, as we've learned well by living through the crumbling Soviet Union). Fortunately, I'm working at a place, where we study optimization of resource consumption at all levels - from high-level organizations and going down to the level of individual households. Plus, there is some limited experience at teaching at my own university, coupled with a little bit of community work in the development field. This is not enough to make a real change right now, but enough to give unsettling feeling that something can really be done to improve the situation. So, imagine for a second that you have the ability to reach every human being in this world (including these kids in Somali and Zimbabwe), provide inspiration, support, answers to any common question and access to all knowledge of humanity. Don't worry about computing resources and don't worry that much about money. Technology can easily provide all that, especially today and tomorrow. The question and ultimate technological challenge is: What could be done with all this in order to improve our world within the next 10-20 years?"}
{"url":"/post/parameter-vs-argument/","title":"Parameter vs. Argument","text":"Parameter vs. Argument Parameter is variable in the declaration of function. Argument is the actual value of this variable that gets passed to function."}
{"url":"/post/people-dont-think-in-tables/","title":"People Don't Think in Tables","text":"People Don't Think in Tables Within the last few years I've seen major change in UI approaches in the systems I build. The change is caused by a shift from relational databases as persistence to something that does not require an SQL server (or DB server at all). Take a look at this old UI of mine. It's pretty powerful (huge amount of buttons proves that), is intuitive for geeks and scary for normal people. What's more curios, this UI is shaped by the relational persistence model that was created in mainframe days, when storage was expensive, memory even more so, and data had to be heavily normalized. However, technology has advanced slightly since then, and we no longer need to fit our UIs to something that is so spreadsheet-like. We can prevent user from overexposure to underlying data complexity, especially when data itself is simple, but just happens to be stored in a computer-optimized way. Think of it again. We turn UI experience of users in sudoku game just because we know how to store data in a way that was good 20 years ago . No wonder Apple is making so much money - at least they don't try to turn every UI into spaceship control panel: People don't think with tables . They would prefer something that is more simple, even if it does not benefit from a fancy styled UI framework. Something that does not involve solving sudoku puzzle every time you need to find a simple answer to your question or do your job. or where we have the most simple way to search complex structured info, as patented and proven to be useful by google: It is surprising, how underlying technologies affect what we build with them , is not it? As you can guess, first 2 screenshots depict UI that runs on some sort of SQL, while the other UIs were influenced by a simple NoSQL storage model (to be precise: CQRS/DDD+ES model that uses for UI persistence blobs for cloud persistence and plain files for on-premises deployments). BTW, I know that quite a lot of people are using Lucene to provide full text search capabilities for their read models. However in the simplest case (esp. with cloud deployments), it's easier to write a few lines of code as opposed to dealing with one more dependency. Another side effect is that with the change of underlying tech from SQL (and more importantly - thinking model), there is less need to resort to data-aware UI frameworks (mostly used on conjunction with desktop apps) in order to provide rich user experience. This allows to have web-based applications that don't force framework installations (good luck with getting .NET or Java on all machines in some large organization with a lot of momentum) or complex upgrade routines (i.e. ClickOnce auto-update or automatic MSI updates). I'm not talking about development simplicity and cloud-scalability - these come for free, and we don't really value things unless we struggle for them. Obviously, as you can see from screenshots above, my UI skills are still below that of a kid with a box of crayons but lately Bootstrup and jquery have provided a really nice experience, especially, when you can reuse some great ideas and frameworks. Wait, till we start getting into mobile experience with HTML5. It actually comes almost for free as well, since latest web layout frameworks provide a lot of mobile experience out-of-box. NB: In no way I'm implying that this second approach is anywhere near close to being reasonably simple. And it's obviously bad looking (especially UIs that I hack together), but I think this is going in the right direction. At least, less time is wasted on writing UI docs and providing support."}
{"url":"/post/practical-cqrscloud-samples-and-an-ebook-does-this-interest/","title":"Practical CQRS/Cloud Samples and an eBook&#58; Does This Interest You?","text":"Practical CQRS/Cloud Samples and an eBook&#58; Does This Interest You? One of the things I've learned from the production experience was the extreme value of getting feedback for your project as soon as possible. This might save you wasting a lot of effort, time and resources later on (or even save the project). There is thing that I wanted to share for the same reasons - to get an early feedback. As probably some of you know, in addition to my Lokad work (focusing on Windows Azure, Lokad.CQRS, high scalability and all sorts of fun stuff) I'm currently playing with various learning CQRS+Cloud projects in my free time (or of what remains of it). The goal is to learn continuously and keep personal development ahead of what is required by Lokad projects (this allows to work around potential problems and capture some nice opportunities). Besides, real-world experience also helps to straighten the theoretical conclusions by polishing them with the practice. Some things might look great on paper and diagrams but be completely useless in the production. The only way to find them is by trying. Preferably the first practice should not happen in high-importance production projects, so this is another reason to try and keep learning separate from production. Each logical step or concept that is figured out, helps either to make the next step in scalability, complexity and flexibility of your projects. Or you can just make your existing projects simpler and less expensive. Salescast project, that I've been mentioning a couple of times in this blog, benefits from both. Just like a few others. For example, compare concepts from xLim 1 of year 2007 (at the bottom of the xLim section ) and the latest articles ( xLim 4: CQRS in the Cloud ). I think there was some subtle progress made by figuring out the theory, field-testing it in production and learning from it. Right now I'm thinking about developing a few quick projects , that target some real-world business scenarios in a quick, dirty, but cheap and reliable way. Something like \"Practical Cloud with a dash of CQRS\". No complex ivory-tower theoretical constructs but rather implementations on top of this logic with hacks, where this makes sense. Eventually I'm planning to release these projects as open source . In parallel I'm also thinking about composing an eBook on the subject with detailed dive into these projects , explanations of how, why, theoretical references and recommendations on handling extreme scalability and complexity cases (backed by the actual Lokad production experience). Basically, a quick learning course about practical Cloud/CQRS the way I would've given it if I were back at the university. For the projects I'm deliberately trying to pick some niche where it's easy to get an application running and there is potential for lots of visitors coming for some free service (hence, creating real random stress, testing and verifying the underlying foundations of CQRS and Cloud development). It'll be probably non-Azure implementation as well (Azure is currently a perfect fit for Lokad, but for an ISV with extreme constraints on resources and friction it currently would not work out that well due to a number of reasons). eBook itself will come for a price and will be more convenient, detailed and guided alternative to finding your way through the xLim/CQRS articles as they jump through various topics (original articles and materials will still be published and shared in the usual manner). Again, just like a learning course. BTW, those that want to start learning CQRS/Cloud right now - you can just check out CQRS references section (links to various videos, blogs, articles and blogs that I've gathered), dive into first samples and tutorials from Lokad.CQRS for Azure and work you way through the xLim series (from bottom to the top). Greg's documents, video and project are probably the most valuable reference of them all . Now these are just thoughts. Nothing is in motion (except for the first learning project which was an inspiration and foundation for the latest articles on CQRS and xLim series). I wonder: What do you think? (even Yay or Nay would be helpful) What questions about practical CQRS/Cloud are the most pressing and important for you know? What do you think about projects? Would such samples (published as open source) help you out in your endeavors and how? Do you have any ideas about areas, where a simple CQRS application would help to solve some problem and would face a bit of scalability stress while dying so (preferably the stress would be dynamic)? There are quite a bit of materials on CQRS. Would you still be interested in such eBook that explains and walks over the theory, practice and experience in a structured and organized way fit for learning (focusing on practice and implementations)?"}
{"url":"/post/processing-big-data-in-cloud-la-lokad/","title":"Processing Big Data in Cloud à la Lokad","text":"Processing Big Data in Cloud à la Lokad Let's talk about a simple approach to visualise, model and deliver complex large-scale data processing tasks. Such tasks would deal with datasets that are so large, that they don't fit into the memory of a single machine and would also take ages to compute on a single machine. These datasets can often be referred to as \"BigData\". Such tasks would benefit from distributing out the work and storage load between relatively cheap machine instances that are made available in the cloud (either public or \"private\"). We would also want to optimize our consumption costs by get these resources only when they are needed and releasing afterwards. Let's also assume that such processing task, requires complex sequence of steps in order to complete (more complex than a mere MapReduce). Some steps must be processed before others can start, while others can work in parallel batches. Actual steps of the job are idempotent, messages can be delivered more than once or simply fail. Here is an example of how such processing graph could look like: That's how I would approach such problem in the situation, when: development resources are limited; data processing model is not formally established and is likely to evolve; team (or a single developer) is familiar with event sourcing. I would split the problem domain into two separate bounded contexts: Orchestration and Data Processing. Bounded Contexts Orchestration Bounded Context will be responsible for navigating data processing graph and orchestrating the individual jobs. Behaviors for that will be captured inside an aggregate root that uses event sourcing for persistence (AR+ES) for better testing and getting persistence mismatch troubles out of the way. Deployment-wise, this aggregate can live in a separate machine and would be configured in such a way, that all commands to this aggregate are synchronized and executed on a single thread (just a routing rule for messages). AR+ES just schedules batches of tasks that can be executed an parallel, and issues second batch only when the first one is complete. Should there be any message duplication (always a possibility in the cloud environments), AR+ES can easily track and drop duplicates by keeping hashes of already completed task batch identifiers. Data Processing Bounded Context will be implemented using a set of command handlers that consume work commands from an input queue and process them. These handlers would operate upon data that is stored somewhere in the cloud and is considered to be immutable for the duration of the specific big data process. Commands and events can contain meta-data, parameters and references to this immutable data. In essence, command handler is a function that will take as input a command (which could contain a reference to larde immutable data blob), perform certain operations and publish an event (optionally saving some large processing data into another data blob). Multiple instance of command handlers would be picking commands from the input queue in this bounded context. In essence, they would be competing for the jobs, just like clercs in the bank \"compete\" for customers standing in line (customer is handled by only one clerk). However, we would be better than a bank, since if we can always massively increase the number of command handlers handling the load, by instructing cloud fabric to provision more machines. Both bounded contexts subscribe to all important domain events of each other. Important: we should differentiate between actual data (which is so large that it does not fit into a single machine/process) and behavioral metadata. The former is accessed only by data processing bounded context, while the latter is passed within the messages between both bounded contexts. For example, number of time series in dataset is metadata, while actual values within these time series are actual 'raw' data. Orchestration Aggregate would use that metadata to make decisions that 'drive' the process through the graph. Flow of work Let's say we have a ProcessAggregate that contains orchestration logic for our complicated MapReduce process. When this aggregate starts, it simply publishes X events that say something TaskAScheduledEvent(processId = 1, taskID = guid) Note: there is more elegant way to do this, but that would require going deeper into DDD These events would be received by Receptor (or Port) in the second bounded context, which would translate them into instances of ProcessTaskACommand . These command messages would be passed into the queue from which multiple worker machines pick their jobs. When command handler finishes processing the task it sends TaskAProcessedEvent , which will get routed back to the ProcessAggregate as ConfirmTaskAResults(task ID = guid) Within the aggregate we: Mark task as confirmed (if it hasn't already been reported due to message duplication). If this task completes some batch and enables further processing, we schedule more tasks for cloud execution. We can also define a timeout view that simply lists all tasks that are currently running. Timeout manager (a simple process) regularly checks this view and sends \"TryTimeoutTaskX\" to the aggregate. Aggregate checks with it's internal state, and if task indeed has not been processed, decides to either reissue the task or terminate the whole process (yes, we essentially implement our timeout tracking as a business process within Lokad.CQRS architecture style ) Gotchas Advantages of this approach (esp. if aligned with Lokad.CQRS architecture style): no need to worry about persistence of complex object that represents our graph decision logic; orchestration logic can be explicitly tested with the specifications (and documented as such); we can easily migrate between multiple versions of the data process without downtime or stopping processes; process can easily be developed and debugged on the local machine, while being deployed to any cloud afterwards; we use same approaches and ideas that are used within Lokad.CQRS architecture style for modeling more conventional business concepts (this lowers learning barrier and allows to reuse answers to some common problems). Drawbacks of this approach are: It requires certain development discipline (and familiarity with cloud computing and AR+ES); At the moment of writing, there is no prepackaged infrastructure for event sourcing that would work out-of-the-box; performance of this approach would be somewhat inferior to finely tuned functional style map-reduce process implementation; this is a batch-processing approach, which is not fit for real-time processing (yet). In short, with this approach we trade some performance for development and deployment flexibility. This enables us to rapidly model and implement big data process (especially when requirements are still changing). After the process is formalized, we can always fine-tune and optimize bottlenecks. Although frequently you would find that it is cheaper to add another server (worth 100 EUR per month) than waste multiple development days of brilliant developers on performance optimizations. Heads up : the entire infrastructure does not need to be really performant with one exception - if you are doing hundreds of thousands of messages within a single process, then it's worth to invest effort in messaging infrastructure (e.g. direct communication with ZeroMQ), otherwise latency will kill everything. Event stream for actual process aggregate can be simply cached in memory. Deployment Options Below are some deployment variations that could be used within this approach. We can implement our core processing logic without any coupling to specific deployment environment and then deploy in various configurations. The latter would require just re-configuration and optionally providing some specific adapters implementations (for messaging, event sourcing and large BLOB streaming). Alternatively you can have the same project prepared for multiple deployment options from the start. Local development machine: orchestration bounded context runs as one thread; multiple data processing command handlers run either as parallel threads or as multiple instances of a single console app; file system is used for both message queueing, persistence of large binary files and event streams for aggregates. Windows Azure Cloud: Orchestration bounded context runs in a single worker role (e.g. instance of Lokad.CQRS-based engine); data processing handlers run as additional Windows Azure worker roles (you can configure them to run on X different threads within Y worker role instances); Large data is streamed to Azure Blob Storage, just like event streams for AR+ES entities; Azure queues are used for messaging. Amazon Elastic Compute: Orchestration bounded context is a single VM, while Data processing command handlers run within multiple replicas of another VM. We scale by adding or dropping instances of that second VM. Amazon S3 storage is used for persisting large binary data, while local instance of RabbitMQ is used for messaging; event streams could be persisted locally within orchestration VM. Obviously, these are just some of the few options. You can have completely different scenario, based on the specific resources, requirements, risks and constraints within you project. In each of these cases, elastic scaling can be done by implementing a simple task that would watch upon the amount of messages waiting in the command queue of data processing bounded context, adjusting number of command handler instances accordingly. Final Words This approach is in not a silver bullet. It just summarizes some limited experience gained while developing and maintaining non-realtime big data processes that could be hosted both in the cloud and on-premises. As such, it can have numerous applicability limitations (especially if you are working within constrained enterprise environment). Some alternative approaches and references worth mentioning are available in the reading list on Big Data . However, if you need to quickly deliver out some scalable multi-step data process with one person, no money for expensive software licenses and just a few weeks of time, then this approach might give you some ideas. If you want to read more along these lines, here are a few more relevant posts: Anatomy of Distributed System à la Lokad Bird's-eye view of a Distributed System - Context Map Software War Starts with a Map, Context Map DDD: From Reality to Implementation DDD: Evolving Business Processes a la Lokad"}
{"url":"/post/reading-list-on-big-data/","title":"Reading List on Big Data","text":"Reading List on Big Data This is purely theoretical blog post to summarize my last few days of studies into the big data (which were triggered by one homeless guy and sequence of highly unprobable events that actually took place). No fancy intro, just assuming that dealing with big data processing is really cool or at least has an outstanding financial reward potential (given the trajectory taken by modern IT and economics). Current Approach at Lokad Or at least - simple part of it, that is not touched yet by secret fairy dust of our outstanding analytics team. In short, big data is bulky, complex, requires a lot of CPU and does not fit in RAM. So we've got to break calculations into smaller batches . We want to process everything as fast as possible, so we push these batches to separate machines . Concurrent processing can cause a lot of problems with race conditions, so we cheat : Batches of big data are immutable once written (that should scare anybody who is coming from the old world where storage was expensive and SQL ruled the world). So we can share them. We keep bulk processing as dead-simple immutable functions, that are woken up by a message, consume some immutable data, do heavy CPU job and then produce output message (which could include reference to a newly created immutable block of data) These immutable functions construct a processing graph of (extremely simplified) map reduce implementation, that can actually do a lot of things. When graph elements need synchronization, we do that via messages that flow to aggregate roots controlling the process. They don't do any heavy-lifting (they don't even touch actual data, but just metadata level information), but encapsulate some complex behaviors to navigate execution through the computational graph. We don't have problems with implementing this seemingly complex part (or testing it), since Greg Young and Eric Evans provided us with CQRS/DDD toolset and some event sourcing . This is a poor-man's architecture that can be implemented by a single developer from scratch in a month. It will work and have decent elastic scaling capacities, provided that you have abundance of CPU, Network and Storage capacities (which are provided by cloud environments these days). This approach is explained in a lot more detail in Processing Big Data in Cloud à la Lokad Potential caveats: Domain modeling should be done carefully from the start. Lay out the computations and think through the algorithms. Cloud queue latency and complexity would be your bottlenecks in the cloud (all the other limitations are solved by adding more VMs) This is a batch-processing approach, which is not fit for real-time processing. Yes, this is a hand-made implementation of MapReduce. How can we improve that? So what are the limitations of the previous approach? Complexity Messaging latency Absence of real-time processing First complexity limitation can be handled by separating the system into separate elements. No, I'm not talking about layers (these will do more harm than good), but rather separate bounded contexts, that have clear: boundaries; language; contracts for exchanging data and communicating with other bounded contexts; whatever choice of technology that is fit. Second limitation of messaging latency can be worked by saying au revoir to any solution with man-in-the-middle architecture (in cloud these implementations are called \"cloud queues\" or \"cloud service buses\"). Broker-based architectures are a logical dead-end for the large-scale distributed computations (just like relational databases are for persistence). They limit our scaling capabilities. So we need to rewire our brains just a little bit and leave the world of ACID, transactions and tables (wake up, Neo): eventual consistency ; brokerless architectures ; gossip protocols ; Or, if you want a simpler path, just use Hadoop . Third limitation is lack of real-time processing . You see, this approach to big data is still good old batch processing. It grabs a chunk of data and then takes some time to process it. If we are working with real-life information, then by the movement we finish processing history, we'll have some fresh data that requires recomputation. If this were end of the road, Twitter would never exist. But they cheat. They provide real-time capabilities by incrementally processing new data using some crude and rough algorythms. They are not as precise as batch processing, can make mistakes, can handle only a little bit of data, but they are fast . So whenever a user loads up his twitter profile, he gets result that is composed from thoroughly processed batch data plus whatever changes happened since this slow process started. The latest bits might not be precise (i.e. they could miss a few mentions from fellows in the other hemisphere and also include a few spam-bots), but they are almost real-time, which is what matters. And actually as the time goes, results will be magically corrected , because a little bit later batch map reduce algorithms will get there and replace fast approximations with slow but thorough results. Twitter Storm incorporates end explains this dual approach to big data in greater details. Edge Cases People say that there are no silver bullets (which is actually wrong, you can buy that stuff on internet ) and hence one approach to big data will not fit all cases. Let's ignore total absence of logic in this statement and focus on potentially specific edge case of big data that might benefit from a separate approach. I'm talking about event streaming. Enterprise companies consider event streams this to be a rather complex scenario, when \"a lot of events come in a short amount of time\". Such complexity even created niche called Complex Event Processing which sounds like really complicated and expensive field. Partially this is justified because events often require proactive reaction to something that happens in real-time. Yet this reaction could depend upon the events that happened millions of events back (this means \"a lot of\" data has to be stored). From now on, please replace \"a lot of\" with the term that you will hear in 12th episode of distributed podcast which was recorded this weekend. Let's see if there is a simple poor man's approach to handle \"a lot of events\". We will need following ingredients. First ingredient is fast storage , that is dead-simple, provides immediate access to any data by key (at most two disk seeks for cases, where entire key index does not fit in memory). We don't ask for ACID or query language here, but ease of replication is definite plus. Basho bitcask and Google's SStable provide extensive guidance here. By the way, IndexDB in WebKit uses LevelDB, which is based on SSTable. Second ingredient is distributed architecture design that would support arbitrarily large amount of nodes operating in a single prepartitioned ring-shaped hash-space, where nodes can come, fail and go as they wish. Data should survive no matter what (and be easily available), while scaling can be elastic. If this sounds a bit scary for you, don't worry, there is plenty of material (and source code) from: Basho Riak Core Project Voldemort Father of them all - Amazon Dynamo Third ingredient is cloud computing , that would provide CPU, Network and computing resources needed to power that architecture, while not requiring any upfront investments. I've been blogging about this aspect for quite a few years for now. At this point in time we already have a number of cloud computing providers with some competition between them. For example: Windows Azure Cloud; Amazon AWS; Rackspace Cloud. Competition in this area already drives prices down and rewards efficient use of resources. Fourth ingredient is about lower-level implementation principles that fit into the distributed environments, append-only persistence and eventually consistent world that even tolerates conflicts. Great head-start in this area is provided by ZeroMQ tutorial and filled by the mess of patterns and practices united by the name of CQRS architectures . Since these CQRS-based principles actually provide a uniform mental model for dealing with events and events streams in distributed world (unless you are playing it old-school with SQL) along with techniques of designing and modeling such abstractions (known as DDD ), they already give solutions to most common problems that you might face, while bringing together in practice distributed architectures, fast streaming key-value stores and cloud computing. Fifth ingredient is the Force. May it be with you, in this new but certainly exciting field :) [2012021402321337237189] Memory dump complete..."}
{"url":"/post/recent-lessons-learned-in-lokadcqrs/","title":"Recent Lessons Learned in Lokad.CQRS","text":"Recent Lessons Learned in Lokad.CQRS As you probably already know, Gabriel Schenker has started a series of blog posts on How we got rid of the database . One or two concepts in Gabriel's approach originate from Lokad.CQRS Sample Project . It was really inspiring to see these ideas put to some practical use and validated outside of Lokad. It also provided a nice retrospective which helped to see, what changed in Lokad.CQRS recently. Given that, I want to share some of the changes that actually happened to Lokad.CQRS approach (mostly based on DDDSummit 2012, work with Vaughn Vernon and also recent experience in new projects). Automated projection management is a huge life-saver (rebuilding views that have a projection code changed or creating new views). It vastly simplifies life both in development and production, enabling new scenarios. For example, it becomes possible to get immediate feedback for your UI tweaks in development. Joining command handlers into Application services (similar to how we join multiple event handlers into a projection class) helps to make them more consistent. This makes solution more structured, when you get dozens of commands. Development workflow noticeably speeds up, when you auto-generate (and auto-update) handler interfaces for your aggregates and their states, based on some contracts DSL (provided you use event sourcing there). DSL sample in Lokad.CQRS already does that. As your projects grow (in complexity and load) it becomes apparent, that handling event streams as individual files/blobs might not be efficient. For instance, try copying 100000 individual event stream files from the cloud for some debugging (hint: latency will be the pain). So at Lokad we gradually shift towards operating with event streams per bounded context . These are equivalent to domain logs in older Lokad.CQRS or to transaction logs in other database solutions. In essence, bounded context stream is an aggregation of all event streams within a given bounded context, where individual event streams are identified by a name but are still stored together. This makes it extremely simple to replicate or to write (we can also preallocate some disk space to reduce enormous fragmentation). Reading individual streams can be problematic, unless you store in memory either full caches or at least indexes. In other words, we just slightly change storage approach to keep all separate event streams together. First draft implementations of this approach are covered in stream implementations provided in IDDD sample . They are based on Riak Bitcask. Since publishing them, we have slightly improved on that in out internal code. I hope to bring all these improvements back into Lokad.CQRS soon (when starting to implement proper DDD sample). These changes are are well aligned with both Greg's future event store (he's making awesome product, by the way) and future book by Vaughn. Having said that, per-file event stores are still a valid and nice approach for local development purposes and low-volume production scenarios. Domain Services become really nice tool to exchange information between different aggregates in eventually consistent way. Before, I would try to somehow wire all information from one aggregate to another via event-command chain (making them really bulky and sending around a lot of useless info). Now, I tend to define view projection that is used within core bounded context. This projection is auto-managed (meaning I can change it any way, and server will rebuild it as necessary) and is populated by events coming from some aggregates. Another aggregate accesses it via a domain service that knows how to query this view. Value objects become really important in expressing domain logic. It is amazing, how simply writing a descriptive concept class (i.e.: email, forecast settings, service level variable etc) makes the rest of the code more compact and robust. As you can probably see, majority of changes affect either DDDesign principles on top of Lokad.CQRS or some subtle implementation details underneath. Core abstractions and ideas (i.e. document storage, Aggregates with event sourcing) seem to stay relatively stable and practical."}
{"url":"/post/recommended-reading-on-big-systems/","title":"Recommended Reading on Big Systems","text":"Recommended Reading on Big Systems Here's a list of videos and resources I've been studying from recently (more valuable - on top): Immutability Changes Everything by Pat Helland at Ricon 2012 Real-Time delivery architecture at Twitter by Raffi Krikorian at Strange Loop 2012 Runaway complexity in Big Data, and plan to stop it by Nathan Marz at Strange Loop 2012 Condos and Clouds - patterns in SaaS applications by Pat Helland at ACM Webinar Advancing Distributed Systems by Eric Brewer at Ricon 2012 Other Ricon 2012 talks Big Data architecture at LinkedIn - an interview by Sid Anand On the overall, I'm really impressed by the amount of innovation and sharing that happens in the industry."}
{"url":"/post/redirect-tcp-connections-in-windows-azure/","title":"Redirect Tcp Connections in Windows Azure","text":"Redirect Tcp Connections in Windows Azure I've just published a quick and extremely simple open source project that shows how to redirect TCP connections from one IP address/port combination to another in Windows Azure. It is sometimes helpful, when dealing with SQL Azure, cloud workers, firewalls and the like. Lokad Tcp Tunnel for Windows Azure | Download Usage is extremely simple: Get the package. Configure ServiceConfiguration to point to the target IP address/port you want to connect to (you can do this later in Azure Developer's Portal). Upload the Deployment.cspkg with the config to the Azure and start them. Connect to deployment.cloudapp.net:1001 as if it was IP:Port from the config. If you are connecting to SQL Server this way (hosted in Azure or somewhere else), then the address have to specified like this in Sql Server Management Console (note the comma): deployment.cloudapp.net,1001 Actual Azure Worker config settings should look similar to the ones below, when configuring TCP Routing towards SQL Server (note the 1433 port, that is the default one for SQL): <ConfigurationSettings> <Setting name=\"Host\" value=\"ip-of-your-SQL-server\" /> <Setting name=\"Port\" value=\"1433\" /> </ConfigurationSettings> The project relies on rinetd to do the actual routing and demonstrates how to: Bundle non .NET executable in Windows Azure Worker and run it. Deal with service endpoints and pass them to the processes. Use Cloud settings to configure the internal process. Since core source code is extremely simple, I'll list it here: var point = RoleEnvironment.CurrentRoleInstance.InstanceEndpoints[\"Incoming\"]; var host = RoleEnvironment.GetConfigurationSettingValue(\"Host\"); var port = RoleEnvironment.GetConfigurationSettingValue(\"Port\"); var tempFileName = Path.GetTempFileName(); var args = string.Format(\"0.0.0.0 {0} {1} {2}\", point.IPEndpoint.Port, host, port) File.WriteAllText(tempFileName, args); var process = new Process { StartInfo = { UseShellExecute = false, RedirectStandardOutput = true, RedirectStandardError = true, CreateNoWindow = true, ErrorDialog = false, FileName = \"rinetd.exe\", WindowStyle = ProcessWindowStyle.Hidden, Arguments = \"-c \\\"\" + tempFileName + \"\\\"\", }, EnableRaisingEvents = false }; process.Start(); process.BeginOutputReadLine(); process.BeginErrorReadLine(); process.WaitForExit(); Tcp Tunnel for Azure is shared by Lokad in hopes that it will save a few hours or a day to somebody."}
{"url":"/post/replicating-events-to-remote-eventstore/","title":"Replicating events to remote EventStore","text":"Replicating events to remote EventStore At the moment of writing majority of event-sourced systems at Lokad run at simplified event store from Lokad.CQRS library. We are migrating to Greg's Event Store . In our setup it is hosted on Azure . Migration process must ensure that we don't break any existing systems. This means that we can't simply jump and switch all code to use new storage. Instead, the migration process has to be incremental , like navigating deep waters. After all, running single node on Azure-backed VM, is not supported scenario (although it works). So we implement change in the least important part of the system, deploy it and observe. There will be some failures and lessons learned. Then, we'd incorporate these into the code and make the next step. So far sequence looks like this: Replicate events from production to dedicated event store server on Azure; switch non-critical projections to chase streams on dedicated event store server; migrate aggregates to use dedicated event store server; where possible, discard server-side command handlers and host aggregate logic directly in Web UI. One of these initial steps involves replication of events from production event store to new server . This can be achieved by continuously polling local event store for changes and pushing any new message batches as individual events to remote event stream. For the time being to keep things simple I'll be pushing all events to the single stream. Later on we could relink them to individual event streams. Actual replication code runs in a separate thread. It is just a \"while(true)\" loop: using (var conn = Connect(_config)) { conn.Connect(); connectFailure = 0; var version = GetLastReplicatedEventFromRemoteStore(conn, stream); var lastReplicatedEvent = version.LocalVersion; var remoteVersion = version.RemoteVersion; while (!token.IsCancellationRequested) { if (lastReplicatedEvent == _store.GetCurrentVersion()) { // no work to do, so sleep and continue token.WaitHandle.WaitOne(500); continue; } var keys = _store.ReadRecords(lastReplicatedEvent, 10000).ToList(); var remoteEvents = keys.SelectMany(ConvertToNewEventFormat).ToList(); conn.AppendToStream(stream, remoteVersion, remoteEvents); lastReplicatedEvent = keys.Last().StoreVersion; remoteVersion = remoteVersion + remoteEvents.Count; } } In order to deal with possible transient failures of the remote event store (this is possible, since I'm running it as single-node for now), we put wrap entire replication process into another loop: void RunReplicationProcess(CancellationToken token) { int connectFailure = 0; while (!token.IsCancellationRequested) { try { var localStoreIsEmpty = _store.GetCurrentVersion() == 0; if (localStoreIsEmpty) { token.WaitHandle.WaitOne(TimeSpan.FromSeconds(30)); continue; } using (var conn = Connect(_config)) { conn.Connect(); connectFailure = 0; /* replication from snippet above */ } } catch (Exception ex) { if (connectFailure == 0) { SystemObserver.Notify(\"Event store connection failure {0}\", ex); } connectFailure += 1; token.WaitHandle.WaitOne(TimeSpan.FromMinutes(1)); } } } Once this code is in production and production event streams are replicated, we could migrate less important projections (like the ones backing up Admin Web Apps used by Lokad team members) to new event storage."}
{"url":"/post/roadtrip-2011-lessons-learned-at-eventcentric-in-kiev/","title":"Roadtrip-2011 Lessons Learned at","text":"Roadtrip-2011 Lessons Learned at In short - this spontaneous Event centric weekend in Kiev was totally awesome and rewarding . Check out tweets , if you are interested in what others feel (people are talking about sparkles in the eyes and #eventcentric cult :) This event was organized as a part of Road-Trip 2011 with Greg Young. There were quite a few other adventures that happened along the way earlier - too much to write about. So I'll just focus on the last weekend and lessons learned. Event Centric weekend in Kiev was a two day class held for free in Kiev. Greg Young was doing his well-known CQRS/DDD course, I've got to share some of my practical experience with such systems \"in the wild\" (including various cloud options). Room was packed with 90+ people from multiple Ukrainian cities and also from a few cities in Russia. Luxoft Ukraine helped us immensely by providing with the place to do that. Irina Odina of Luxoft deserves a medal (or two raises in salary) for staying with us through the weekends and helping with the coffee breaks. Mike Chaliy and really friendly Kiev Alt.NET community were helping with afterparties, logistics and coping with unexpected. Here's what I've learned during this event: One of the best ways to strengthen your understanding in a field is to speak and share . For instance, I've learned a lot simply by explaining aggregates and sagas to the audience and then seeing that not everybody gets a clear and coherent picture. Or, when you get an question from an audience, and you know an answer to that question, but can't explain this immediately in clear and concise language immediately. Then Greg comes in with his explanation and reveals missing pieces and steps in my own explanation and understanding. Showing my own hacks to the theory and practice (and sharing how everything works really nicely in distributed teams and cloud environments) is always a pleasure of its own. There were a few smaller realizations of its own along the way. The most important one of them was about more clear and focused picture of bringing multiple complex systems together in a way to limit complexity and facilitate development (obviously, with limited time and resources). Basically that's the a-ha understanding of how to separate elements of a distributed systems (bounded contexts, sagas, services, cloud scalability points), while keeping each element highly specialized (and relatively simple). In essence, this is a direct analogy with human body, where: Bounded contexts (matching aggregate roots in extreme) == neurons that are completely isolated from each other, but can exhibit complex behaviors and have a short-term memory; Sagas (stateful or stateless) == glial cells that protect and insulate neurons from each other, while connecting them; Stateless services (integration, number crunching, file processing etc) == specialized organs ) in a body that can get some job done, but tend not to have any memory on their own. They react to impulses and send back information about the events that happened to them. This analogy (coupled with the basic principles of biology) actually explained me better how to build and evolve really interesting IT systems, while keeping core cells simple. In short - that's just like how evolution does this. That was a really important realization since before that I simply could not understand how to bring together behavioristic CQRS-driven systems with elastically scalable functional elements of big data crunching or messy technology-dependent integrations. This understanding comes just in the right time - as we are starting yet another project at Lokad these days. I'll definitely blog about that more of that later. A few smaller realizations were coupled to Sagas (explaining them better, to be precise) and transition from domain models to Aggregate Roots with Event Sourcing via TryDo-Do logic (nothing essential to the production but still helps to see and explain better) By the way, the road-trip is not other yet. There is still a lot to learn and do. So far, thanks to everybody for helping to learn and have some wonderful time!"}
{"url":"/post/rule-of-time-limiting/","title":"Rule of Time Limiting","text":"Rule of Time Limiting I don't know who has passed this feedback, but I'm extremely grateful to this person. It's relatively easy to get a good feedback, but constructive ways of improving yourself are priceless. The cardinal rule of any opportunity to present is this: Thou shall stick to your time limit. If it is being pointed out that you are overtime, then you will stop immediately. Anything else shows disrespect and has the aura of \"I am more important than you, just suck it up\" Thanks, man. This will also help to stay focused with BTW Podcast."}
{"url":"/post/rx-works-nicely-with-ddd-and-event-sourcing/","title":"Rx works nicely with DDD and Event Sourcing","text":"Rx works nicely with DDD and Event Sourcing Yesterday I finally got to the point of trying to build Command-Query Responsibility Segregation solution for Cloud Computing environment that uses Domain Driven Design and Event Sourcing ( quick overview of these buzz words ). The whole purpose of the project is to learn, so there's a freedom to experiment upon the accepted patterns and practices. Mark Nijhof put together really thorough article on the Domain Events in CQRS. It has a lot of code and information packed in. Let's take it from there, but with a slightly different and highly experimental route. Usually Aggregate root implementations have all sorts of interfaces and methods helping the surrounding infrastructure to pick them up and handle. This makes the domain code look a bit complex for me. However it seems that these AggregateRoots are a native candidate for leveraging Reactive Extensions for .NET . AR might need to inherit and implement just a single interface (snapshots might require the second one, though): public class ProjectAggregateRoot : ISubject<IEvent,Change> Where ISubject is an interface from System.Reactive that merely says: public interface ISubject<in T1, out T2> : IObserver<T1>, IObservable<T2> How does it affect our aggregate design? Not much, we just slightly extend the \"Apply Event\" method, naming it \"OnNext\" and adding the ability to accept IEvent and publish Change : Subject<Change> _subject = new Subject<Change>(); public void OnNext(IEvent value) { EventInvoker.Apply(this, value); Interlocked.Increment(ref _version); _subject.OnNext(new Change(_version, value)); } public IDisposable Subscribe(IObserver<Change> observer) { var subscribe = _subject.Subscribe(observer); return subscribe; } Note: As you know, Reactive Extensions were designed for the asynchronous operations (i.e.: cloud computing interactions or UI events). Thus they work tightly with PFX. However, since I don't know anything about this integration and side effects, I'm dispatching and executing everything in sync so far. Change is just an event that was applied and thus has a version number: public sealed class Change { public readonly long Version; public readonly IEvent Event; public Change(long version, IEvent @event) { Version = version; Event = @event; } public override string ToString() { return string.Format(\"r{0:####}: {1}\", Version, Event.ToString()); } } What does this give us? We can drop all event subscription and management interfaces and functionality, letting Linq-2-Events to do all the handling in rich and tested way: LoadFromHistory methods are equivalents of IObserver[TEvent] GetChanges functionality is fulfilled by subscribing to IObservable[Change] What's more important, our aggregate can avoid referencing any custom interfaces, since both IObservable and IObserver are in .NET 4.0 BCL. This also provides a wide range of extension methods available. Given that, it becomes rather simple task to write store that works like this: // somewhere in IoC init var store = new HybridStore(serializer, storage); store.RegisterEventSource<ProjectAggregateRoot>(); // somewhere in the handler we perform atomic update // if there is a concurrency problem, service bus will be // responsible for reapplying changes later store.Update<ProjectAggregateRoot>(\"project123\", e => { e.AddTask(\"t1\", \"Satori Project\"); e.AddTask(\"t2\", \"Build Simple Domain\"); e.AddTask(\"t3\", \"Build Simple Sync reader\"); e.CreateTaskReference(\"t2\", \"t1\"); }); For the sake of consistency, here's how the actual persistence (highly prototypical) looks like: public void Write(Type type, object key, AddEntityDelegate addEntityDelegate, UpdateEntityDelegate updateEntityDelegate) { var item = MapTypeAndIdentity(type, key); Func<object, ISubject<IEvent, Change>> factory; if (_factories.TryGetValue(type, out factory)) { var condition = StorageCondition.None; var domain = factory(key); var changes = new List<Change>(); using (domain.Subscribe(changes.Add)) { try { item.ReadInto((props, stream) => { var source = (Change[])_serializer .Deserialize(stream, typeof(Change[])); foreach (var change in source) { domain.OnNext(change.Event); } condition = StorageCondition.IfMatch(props.ETag); }); } catch (StorageItemNotFoundException) { } var version = changes.Count; using (domain.Subscribe(_subject)) { updateEntityDelegate(key, domain); } if (version == changes.Count) return; } try { item.Write(stream => _serializer.Serialize(changes.ToArray(), stream), condition); } catch (StorageConditionFailedException ex) { var msg = string.Format( \"Record was modified concurrently: '{0}'; Id: '{1}'. Please, retry.\", type, key); throw new OptimisticConcurrencyException(msg, ex); } } } Since we are forced to deal with the event streams, it's easy to subscribe and do things like: using (store.Subscribe(Console.WriteLine)) { store.Update<ProjectAggregateRoot>(\"project123\", e => { e.CreateTaskReference(\"t2\", \"t3\"); }); } The listener subscription above will reveal us that creating task reference in this case actually results in two events: r5: Domain.TaskRemovedFromParentTask r6: Domain.TaskAddedToParentTask Since Reactive Extensions for .NET are built upon the event streams I think that doing all sorts of related operations operations (writing event denormalizers, merging streams, writing behavioral unit tests) might be simplified in Domain Driven Design with the Event Sourcing. Such abstraction allows to separate different concerns rather clearly. For example the actual implementation of the underlying storage passed to HybridStore could be FileStorageContainer or BlobStorageContainer for Windows Azure from Lokad.CQRS since both allow atomic updates and optimistic concurrency locking (or any reasonable RDB or NoSQL). It's interesting to note, that these storage container implementations were actually developed for the project without any hint of DDD or ES. Yet, since we have such a nice persistence ignorance in CQRS/ES architectures, they could be plugged in without problems. Also, it is rather simple to implement partitioning in this scenario, since all access to the storage goes with the entity key (identity). If you add into the mix cloud infrastructure capable of doing dynamic repartitioning (i.e. Windows Azure storage) and providing computing capacities on demand - you'll get foundation for building almost-infinitely scalable solution (although message queues still need proper support for transactional sending and ACKs). CQRS will provide high-level approaches for shaping the architecture and evolving it in scalable and cost-effective manner; while DDD holds the methodology for designing and managing the business core of the solution. All in all, future looks quite exciting , doesn't it?"}
{"url":"/post/salescast-scalable-business-intelligence-on-windows-azure/","title":"Salescast - Scalable Business Intelligence on Windows Azure","text":"Salescast - Scalable Business Intelligence on Windows Azure Yesterday we finally released first version of Salescast . There is an official announcement from Lokad. In this article we'll talk a bit about how it was built , focusing on the technology, Windows Azure and what this means to customers in terms of business value . What does Salescast do? This web application offers smart integration between various inventory and sales management solutions and Lokad Forecasting Services . Basically, if you have some eCommerce shop that you want to run through some analytics and get forecasts on the future sales, Salescast can help out and handle the routine. It will detect your solution, retrieve the data, process and upload it to Lokad Forecasting Services and assemble the results into nice reports. This decision support comes with some extra integration experience on top of that to automate and streamline the process further: Reduce inventory costs and over-stocks. Improve customer satisfaction. Increase overall sales. Ease relationships with suppliers. How was it Built? Salescast is running on Windows Azure platform and was architected to take full advantage of the cloud computing features it provides. For Lokad this meant coming up with the architecture principles, development approaches and frameworks that could allow to leverage all the potential efficiently. Principles were based on the adaptation of CQRS and DDD principles for the Windows Azure, great tools and frameworks that Microsoft provides with it. In order to fill some (temporary) gaps in this ecosystem, custom Enterprise Service Bus for Azure was created. We are planning to share experience, technological principles and frameworks with the community, just like it has been done with Lokad Shared Libraries , Lokad.Cloud and the other projects. For those, who are following xLim line of research and development , this will match with version 4: CQRS in the Cloud . At the moment, let's focus briefly on the business value all this technology creates for the customers. Handling any Scale Salescast has implicit capabilities of scaling out . It does not matter, how large is the inventory of customer or history of the sales. Starting from a few hundred products, up to hundreds of thousands and above - Salescast can handle it. So if you are a large retailer, you don't need to sign a large contract in order to just try how the solution works for your network of warehouses. Neither you need to wait for development teams to scale architecture and procure the hardware resources. It's all already there, powered by the scalability of CQRS architecture, ESB for Azure and virtually unlimited cloud computing resources of Windows Azure. Anticipating the Unexpected Salescast can work reliably and incrementally with any data sources. If sales and inventory management solution is persisted within SQL Azure - great. If it is an eCommerce shop running on mySQL in shared environment that tends to timeout from time to time - we are designed to handle it without stressing the other end. We understand that unexpected things can happen in the real world . More than 95% of possible problems will be handled automatically and gracefully. If something goes really wrong (i.e.: eCommerce server is changed or firewall rules - changed), then we'll know about the problem, will be able to handle it on case-by-case basis and then continue the process from where it stopped. Principles of enterprise integration and reliable messaging with ESB on top of Azure Queues help to achieve this. Designed for Evolution Salescast is designed for evolution. We understand, that customers might use rare systems, custom setups or unique in-house solutions. They could need some specific integration functionality in order to handle specific situations. In order to provide successful integration experience in this context, Salescast will have to evolve and adapt, get smarter. In fact, future evolution of this solution is already built into the architecture and implemented workflows. For example, if there is some popular eCommerce solution that we didn't think of integrating with, we'll teach Salescast how to handle it, for free. The next customer that attempts to optimize sales managed by a similar solution, will get it auto-detected and integrating instantly. This applies for the new versions and flavors of these systems as well. Basic principles of efficient development, Inversion of Control , pluggable architecture and some schema-less persistence helped to achieve this. Domain Driven Design played a significant role here as well. Cost-Effective Salescast is designed to be cost-effective . In fact, it's effective to the point of being provided for free. This comes from the savings that are passed to the customers. They are based upon: environment allowing to have efficient and smart development that is fast and does not require large teams; efficient maintenance that is automated and requires human intervention only in exceptional cases; elastic scaling out that uses only the resources that are needed; pricing of the Windows Azure Platform itself. Obviously you still need to pay for the consumption of Lokad Forecasting Services. But their pricing is cost-effective , as well (to the point of being 10x cheaper than any competition). So there are some tangible benefits for the money being spent. Secure and Reliable Salescast, as a solution, is based on the features provided by the Microsoft Windows Azure Platform. This includes: Service Level Agreements for the computing, storage and network capacities. Hardware reliability of geographic distribution of Microsoft data centers. Regular software upgrades and prompt security enhancements. Lokad pushes this further: Secure HTTPS connections and industry-grade data encryption. Redundant data persistence. Regular backups. Reliable OpenID authentication. Summary This was a quick technological overview of Salescast solution from Lokad, along with features and benefits it is capable of providing just because it is standing on the shoulders of giants . Some of these giants are: Windows Azure Platform for Cloud Computing and the ecosystem behind. Time-proven principles of development and scalable architecture. Various open source projects and the other knowledge shared by the development community. Lokad will continue sharing and contributing back to help make this environment even better. From this point you can: Check out official public announcement of Salescast and subscribe to Lokad company blog to stay tuned in for a company updates. Subscribe to the updates of this Journal on Efficient Development Check out xLim materials to see what is already shared within this body of knowledge on efficient development . I'd also love to hear any comments, thoughts or questions you've got!"}
{"url":"/post/scalable-and-simple-cqrs-views-in-the-cloud/","title":"Scalable and Simple CQRS Views in the Cloud","text":"Scalable and Simple CQRS Views in the Cloud In this article we'll talk briefly about the CQRS views (read model) implementation for the Cloud Computing environment. This material is based on a research and development at Lokad (which took more than a month for this case), followed by the actual production usage of the concepts. So this is something validated by real project. The article continues Cloud CQRS (xLim 4) R&D articles. Here are some of the previous highlights (see xLim page for a lot more references): CQRS Architecture and Definitions Command Handlers with Various Levels of Reliability Theory of CQRS Command Handlers: Sagas, ARs and Event Subscriptions CQRS Lifehacks from Lokad Production - Part 1 CQRS Lifehacks from Lokad Production - Part 2 CQRS Lifehacks from Lokad Production - Part 3 Let's get started with the basics. Basics CQRS Views are also sometimes called \"denormalized views\" or \"read models\" in the world of distributed systems. They usually represent data that is: stored somewhere (anywhere, starting from SQL tables to cloud-based key-value storage); optimized for reads from the perspective of simplicity, scalability, or both ; is populated by event handlers that subscribe to events coming from the domain (and hence is eventually consistent in most of the cases); is a logical dual of data caching mechanism (in classical layered architectures), with the primary difference being: here we don't retrieve information on first request, but rather pre-populate view store immediately after the corresponding event is published; this potentially leads to better performance and less staleness, as compared to caching; optimized for the reads and hence reduces performance and complexity stress on the write side . It is important to note, that in order to get these benefits, you don't necessarily need to employ messaging infrastructure or have full-blown CQRS architecture. For example, even in a classical 3-tier architecture you can do something like: have a separate set of SQL tables optimized for reading, that are populated along with saving changes to the data tables (implementation will be somewhat tangled and complex, but it might be worth to improve performance of a few select tight spots); push JSON files to CDN for direct consumption by the browser client; have a distributed cache that is kept up-to-date by subscribing to data change notifications being published from the write side. However, only by explicitly decoupling our business logic from the read models by domain events, can we achieve overall simplicity, while not sacrificing the scalability opportunities. Views with Cloud Flavor Cloud computing environment creates additional challenges and benefits for the distributed systems. We gain immense benefits of: development flexibility (ability to procure and deploy any combinations of systems); cost efficiency (paying only for what you use); elastic scalability (storage, queues and virtual machines can automatically scale); reduced operational burden (things like scaling, backups, configuration - are managed by the cloud service provider). This, for example, allows a small company like Lokad to do really interesting things with a tiny development team. The primary limitation of the approach - it requires a shift from the DB-oriented mindset towards the mindset of distributeds system in eventually consistent world . Fortunately, CQRS Architecture approach creates a simple foundation for such way of thinking. So, when we combine Cloud Computing environment with the CQRS Views (read models), one implementation option could be: Cloud Views are stored in Windows Azure Blob Storage (Amazon S3, or any storage with key-value capabilities) as simple deserialized files with arbitrary schema. View Handlers (domain event consumers responsible keeping views up-to-date) are just some running server processes subscribed to certain domain event messages. They are hosted within Azure Worker Roles, Amazon EC2 instances, Rackspace VMs or whatever. Client-side, in order to display cloud views, just needs to know view contracts and be able to read them from the storage (usually available via REST in the cloud environments). Technology: Azure, JSON, Simplicity In Lokad production scenario implementation details are: Azure blob storage, being populated by Lokad.CQRS Event handlers running in Azure Worker Roles. Clients, consuming cloud CQRS views(Web and desktop) perform user authentication on their own and directly access Azure Blob storage (more elaborate security models could easily be added as needed). Serialization format - JSON (just for the sake of readability and simpler debugging), it could easily be swapped for ProtoBuf for more compact encoding. Logically views are cheap and disposable. Development infrastructure automates complete and partial view regeneration (performance is not an issue here). Whenever a view structure is changed or a new one is added - just repopulate the corresponding data. Code implementation makes view handling logic part of the development language, it was designed to simplify automated view repopulation, maintenance and operations (just an attempt of proper OO programming that worked surprisingly well). Since cloud views are based on the key-value abstraction (i.e.: given a key, you get a view, but you can't list or query them), some concerns had to be handled explicitly. So secondary indexes and lists are just eventually consistent views as well. From the mental perspective, transition from NHibernate-hosted views resulted in: there is more code, but it is simple, clean and completely decoupled; development friction reduced significantly: no need to handle SQL update scripts for the read side in production/staging/development, simpler deployments, more freedom to refactor and experiment for simplicity; since views are disposable, and it is extremely easy to rebuild them for the production (a few mouse clicks), client UI is easier to evolve towards more usable implementations; less worries about operational costs (Azure blob storage is roughly 100x cheaper that SQL Azure for this kind of scenario); absolutely no worries about the scalability - Azure blob storage scales automatically and is optimized for the reads (plus we can always plug the CDN around the globe or replicate views across the datacenters, if needed); reduced complexity on the clients (no need to use some weird API or bring in NHibernate, it's just true POCO) and the server. The primary problem with the new approach - since now I have reduced development friction, I want to roll out production upgrades more often. 15 minute upgrade delay for Azure Worker Role becomes a frustrating experience. Implementation Details Actual implementation is dead-simple. It's not even fair to call it a framework - just a few interfaces and a simple wiring. Design and Windows Azure do the actual heavy-lifting. Although eventually \"framework\" will be included into the Lokad.CQRS for Azure open source poject, here are some details to give you better perspective. Views are defined as simple POCO classes , that can have any structure as long as they are serializable: public class UserDetailView : IViewEntity<long> { public virtual long UserId { get; set; } public virtual long AccountId { get; set; } public virtual string Username { get; set; } public virtual string Email { get; set; } public virtual string RegistrationToken { get; set; } } Base view interfaces are just a convenience for stronger typing and automated regeneration: /// <summary> /// View entity that has an identity (there can be many views /// of this type) /// </summary> public interface IViewEntity<TKey> : IViewBase {} public interface IViewBase{} /// <summary> /// View singleton (there can be only one document). /// </summary> public interface IViewSingleton{} Serialized view will look just like a file \"view-userdetailsview/150.json\": { \"UserId\": 150, \"AccountId\": 74, \"Username\": \"abdullin\", \"Email\": \"some email\", \"RegistrationToken\": \"some identity\" } View entities are stored in folder named after type of the view (derived automatically), singleton views (have only instance per view type) are grouped together in a single folder as well. Secondary indexes are just serialized dictionaries (simplest thing that works) saved as view singletons. If there ever will be performance problems with index sizes, there are multiple dead-simple improvements to make. So far we stick with: public sealed class UserByRequestIndex : IViewSingleton { public readonly IDictionary<Guid, long> Index = new Dictionary<Guid, long>(); } Event handlers responsible for populating views are dead-simple and auto-wired: public sealed class UserDetailViewHandler : ConsumerOf<UserCreatedEvent>, ConsumerOf<UserActivatedEvent>, ConsumerOf<UserDeletedEvent> { readonly IViewWriter<long,UserDetailView> _operations; public void Consume(UserCreatedEvent message) { var view = new UserDetailView { AccountId = message.AccountId, UserId = message.UserId, Username = message.Username, Email = message.Email, RegistrationToken = message.RegistrationToken, }; _operations.AddOrUpdate(message.UserId, view, x => { }); } Accessing views is dead-simple as well. Say, in Web client, powered by ASP.NET MVC 2, you have this hook to IoC Container and IViewReader: public static class AzureViews { static Maybe<TView> Get<TKey,TView>(TKey key) where TView : IViewEntity<TKey> { return GlobalSetup.BusInstance.Resolve<IViewReader<TKey, TView>>().Get(key); } static TSingleton GetOrNew<TSingleton>() where TSingleton : IViewSingleton, new() { return GlobalSetup.BusInstance.Resolve<IViewSingletonReader<TSingleton>>().GetOrNew(); } Then querying a view by a primary key just becomes a matter of: public static Maybe<UserDetailView> GetUser(long userId) { return Get<long, UserDetailView>(userId); } Or, if you need to access it by an eventually consistent index: public static Maybe<UserDetailView> GetUserByRequest(Guid requestId) { return GetOrNew<UserByRequestIndex>() .Index.GetValue(requestId) .Combine(GetUser); } If the syntax looks a bit weird - just check out Lokad Maybe helpers . For the sake of completeness, here's the design of view reader and writer interfaces (they are simple but took the most time to get them right, will likely evolve further): public interface IViewWriter<TKey, TView> where TView : IViewEntity<TKey> { void AddOrUpdate(TKey key, Func<TView> addFactory, Action<TView> update); void AddOrUpdate(TKey key, TView newView, Action<TView> update); void UpdateOrThrow(TKey key, Action<TView> change); bool TryUpdate(TKey key, Action<TView> change); void Delete(TKey key); } public interface IViewReader<in TKey, TView> where TView : IViewEntity<TKey> { Maybe<TView> Get(TKey key); TView Load(TKey key); } public interface IViewSingletonWriter<TView> where TView : IViewSingleton { void AddOrUpdate(Func<TView> addFactory, Action<TView> updateFactory); void Delete(); } public interface IViewSingletonReader<TView> where TView : IViewSingleton { Maybe<TView> Get(); } These interfaces make it extremely simple to implement automatic view discovery and wiring for the purposes of infrastructure and management: Here's how the tree was built: Inspect the assemblies for all event handler types. Since ConsumerOf[T] inherits from IConsume, it is just a matter of selecting non-abstract types deriving from the consumption interface. Select only view handler types (they have constructor with a single argument: strongly-typed IViewWriter or IViewSingletonWriter). Get the generic type argument out of these argument types - this will be the type of the view handled by the argument class. Actual interface implementations are simple. Inversion-of-control Container (Autofac) configuration just exploits generic registrations. Server-side module: public sealed class ViewWriterModule : Module { protected override void Load(ContainerBuilder builder) { builder .RegisterGeneric(typeof (AzureViewContainer<,>)) .As(typeof (IViewWriter<,>)) .SingleInstance(); builder .RegisterGeneric(typeof (AzureViewSingletonContainer<>)) .As(typeof (IViewSingletonWriter<>)) .SingleInstance(); builder .RegisterType(typeof (ViewContainerInitialization)) .As<IEngineProcess>() .SingleInstance(); } } Where ViewContainerInitialization is just a start-up task for Lokad.CQRS engine, that creates view folders, if needed. Client-side module: public sealed class ViewReaderModule : Module { protected override void Load(ContainerBuilder builder) { builder.RegisterGeneric(typeof(AzureViewContainer<,>)) .As(typeof(IViewReader<,>)) .SingleInstance(); builder.RegisterGeneric(typeof(AzureViewSingletonContainer<>)) .As(typeof(IViewSingletonReader<>)) .SingleInstance(); } } Concurrency How do we handle concurrency conflicts with the views: when the same view being accessed by more than thread simultaneously ? Actually we don't care a lot about reading and writing at the same time, since cloud storages providers generally ensure atomicity at this level. There are two major approaches for handling concurrency while updating views: simplistic and logical. The simplest approach is too keep updates of a single view entity limited to a single thread. For example, you can start with a single worker thread processing all view updates. As your application grows, increasing load and availability requirements, you can split updates of the different view types and entities between different threads and workers. In other words, you will partition views by type and/or view ID. Note, that we don't need to scale out actual view persistence, since it is generally handled by the cloud storage provider, to start with. However, such scaling tends to be limited by the world region (i.e.: North Europe) and we still might need to enable CDN or manually replicate data between multiple cloud centers. This is relatively easy to do just by streaming domain events to these data centers. As long as you pass entity ID in message headers (recommended in distributed systems), it will be easy to route domain event messages between different queues/workers. Just a quick clarification of terms. View Type is a message contract or the POCO class, while view entity is a single instance of this type, as identified and tracked by its unique identity. This identity serves as a primary key used to store and retrieve the actual view data. In the case of singleton views we have a single entity per type. Eventually you might encounter the need to allow multiple concurrent threads (whether hosted in a single worker or different ones) to be able to update the same view entity at once. This is where optimistic concurrency control comes in. We just need to modify our view writer implementations to keep track of blob's ETag (analogue of version ID), while downloading it. Then, after executing local update, we upload it back, while passing last known ETag back to the server. Azure Blob Storage (just like any other cloud storage) is aware of this feature and will update view data only if the ETags match. If they don't (somebody else managed to concurrently update our view) - update will fail and we'll get an exception. This exception will captured by the service bus, which will retry the actual update operation later. If such an update fails more than 4 times at once, this would mean some heavy concurrency issues probably coming from really complex update operations. Summary In this article we've briefly covered concepts of CQRS Views (also known as eventually consistent read models or denormalized views ) as applied to the Cloud Computing environment for the additional benefits of simplicity, scalability and cost-efficiency. There also were some high-level details describing experience of the actual implementation in the Windows Azure environment. This article is a part of CQRS in Cloud (xLim4) Research and Development series. You can subscribe to the news feed to stay tuned for more updates. All comments and questions are welcome and appreciated!"}
{"url":"/post/scenario-based-unit-tests-for-ddd-with-event-sourcing/","title":"Scenario-based Unit Tests for DDD with Event Sourcing","text":"Scenario-based Unit Tests for DDD with Event Sourcing I'm still in the process of learning Domain-Driven Design for CQRS coupled with the Event Sourcing. One of the things I really like about this approach is the ability to thoroughly and reliably unit test even complex domains. As I've tried yesterday in a learning project, a unit test is basically a simple scenario (txt) file that looks like this: Given .ContextCreated { \"ContextId\": \"10000000-0000-0000-0000-000000000000\", \"Name\": \"Name\", \"Rank\": 12 } When .RenameContext { \"ContextId\": \"10000000-0000-0000-0000-000000000000\", \"Name\": \"New Name\" } Expect .ContextRenamed { \"ContextId\": \"10000000-0000-0000-0000-000000000000\", \"Name\": \"New Name\" } Or if we are expecting an exception: Given When .RenameContext { \"ContextId\": \"10000000-0000-0000-0000-000000000000\", \"Name\": \"New Name\" } Expect .InvalidOperation { } Basically each unit test ensures that: Given certain events in the past (they determine state of the Aggregate Root) When we call a single command (that's the behavior we are testing) Expected outcome is expressed as either 0..N events or an exception. These events are determine both the state of the AR and what is published to the bus (for further consumption by CQRS architecture). These tests are generated by hand (or from the UI or by recording sessions). You just drop them into the folder of your liking inside the test project (this feels like a reduced dev friction as compared to .NET unit tests). There is even no need to launch Visual Studio and add item references. Build will pick scenarios up automagically via the means of of slightly modified project file: <Target Name=\"BeforeBuild\"> <CreateItem Include=\"Scenarios\\**\\*.txt\"> <Output ItemName=\"EmbeddedResource\" TaskParameter=\"Include\" /> </CreateItem> </Target> The output will be a nice Project.Tests.Scenarios.dll that contains all the scenarios as embedded resources . Afterwards, you just need to generate tests (one test for each resource). In NUnit you can do something like this (note how Rx makes things so elegantly simple): [Test, TestCaseSource(\"LoadScenarios\")] public void Test(Scenario scenario) { if (null != scenario.LoadingFailure) throw scenario.LoadingFailure; var observer = new Subject<Change>(); var root = new SolutionAggregateRoot(observer.OnNext); foreach (var @event in scenario.Given) { root.Apply(@event); } var interesting = new List<Change>(); using (observer.Subscribe(interesting.Add)) { DomainInvoker.RouteCommandsToDo(root, scenario.When); } var actual = interesting.Select(i => i.Event); ScenarioManager.AssertAreEqual(scenario.Expect, actual, scenario.Description); } TestCaseSource is NUnit native attribute that generates a unit test for each argument that is passed by the referenced collection factory. The latter could look like: public static IEnumerable LoadScenarios() { var assembly = Assembly.GetExecutingAssembly(); var clean = \"Kensho.Domain.Scenarios.Scenarios.\"; foreach (var name in assembly.GetManifestResourceNames().OrderBy(n => n)) { using (var stream = assembly.GetManifestResourceStream(name)) using (var reader = new StreamReader(stream)) { Scenario scenario; var testName = name.Replace(clean, \"\").Replace(\".txt\",\"\"); try { scenario = ScenarioManager.Parse(reader); } catch (Exception ex) { scenario = new Scenario {LoadingFailure = ex}; } var data = new TestCaseData(scenario) .SetName(testName) .SetDescription(scenario.Description); var failure = scenario.Expect.OfType<IFailure>().FirstOrDefault(); if (failure != null) { data.Throws(failure.GetExceptionType()); } yield return data; } } } Given all this, a folder like that: Would be translated into NUnit test suite like this: I'm still not sure if it is possible to affect NUnit tree structure via the attributes, making it more sensible and clean. A few notes: Latest Resharper does not work properly with these attributes (they probably haven't upgraded to the latest NUnit engine, yet). In this case NewtonSoft JSON serializer is used to persist messages for the scenario purposes. Adding new tests is really low-friction task. It's easy to have lot's of them or delete unneeded ones. Running unit tests is extremely fast and it does not require anything other than the AR assembly and message contracts assembly. Persistence ignorance is at rule here. Essentially unit tests form specifications for the domain here. Theoretically they could be captured with the help of UI (when some undesired behavior occurs) by people that are not familiar with the specific programming language. In essence we might be separating domain knowledge from the platform/language used to actually code this AR. This should provide some development parallelization and outsourcing opportunities that compliment native CQRS features in this area. AR unit tests in this case are not really fragile (they test only explicit behaviors) and should protect against regressions really well, while allowing to move forward incrementally and reliably with developing AR implementations for complex scenarios (something that I've been stuck with so far). All in all this experience is just based on my attempt to learn CQRS/DDD/ES by the means of running a learning project. You need to stay ahead of the current development requirements in a fast-paced environment like Lokad . Yet so far this scenario-based approach (inspired by using Fiddler recording sessions to unit test REST APIs) seems to look quite good, despite the fact that it was formalized and implemented just last night. Yet I'm really interested if there are any ways to improve the experience (esp. reducing friction-per test). So here are a few questions to the readers : Does anybody know any simpler serialization format for messages, than JSON? How do you test your domains? Are there more simple ways to organize, manage and run tests? What naming and organizing conventions to you use? How do you organize your tests (both code-based and scenario-based) and keep them in sync with the big-picture requirement descriptions? Any hints on improving this testing experience? PS: You can check out xLim 4: CQRS in Cloud series for any latest materials on this topic."}
{"url":"/post/self-improvement-process/","title":"Self-Improvement Process","text":"Self-Improvement Process Lately I've been doing a little bit of new and well-forgotten old things, while trying to become a better me. They also help me to reduce work pressure by offering something new for the brain to chew on. I just wanted to share these things (at least for the sake of writing down the links). We'll talk a little bit about: git; vim; LMAX and Martin Fowler; VerseQ; Lokad.CQRS and functional approach. Git I've started using git on a number of new learning projects. As I've mentioned it on a couple of occasions already, I start liking git more than mercurial, despite its steeper learning curve. By the way, Google Code has finally added git to their project support. Although they are way too much behind github both in usability and commits. Vim Subj, the good old editor coming from the terminal era. Exactly the one, where \"the best way to generate a random string is to ask a student exit vim\" (by the way, it is \":x\" to exit with save-if-modified). I find using it more and more for common text editing tasks. That's where I'm writing articles and blog posts these days. And I'm getting really tempted to start replacing IDEs with vim in non-.NET environments. Developer's mind is better than any IDE anyway, especially when he or she deliberately keeps projects dead-simple. Now, if you got interested in vim, then that's what your future will look like . If this didn't scare you, here's a quick intro from the eyes of guy from VS environment. LMAX and Martin Fowler It seems that all this CQRS/ES stuff is getting mainstream, since Martin Fowler wrote a bliki post on CQRS and really good overview of LMAX architecture: the latter is essentially an in-memory circular buffer that also acts as a queue for the event sourcing messages; processed by a single writer and multiple readers. All workers have access to each other location pointers and this avoid any locks altogether. Predefined buffer size and one-writer-only approach allows to achieve high hardware affinity (in other words, CPU likes it a lot). The study is highly recommended for a read . By the way: bliki is term for versioned blog/wiki coined by Martin Fowler a while ago. These days people just use github with markdown (for example, this Lokad.CQRS Study isn't a page, but just a markdown file in a github repository). Don't read Martin's pounding on branch-per-feature. I believe he got it wrong (and got quite a bit of pounding from the community afterwards). VerseQ To keep up with the learning process, I've stepped back on improving my keyboard typing skills. I believe it would be a wise investment of my time to improve on something that is so tightly related to the job and thinking as the typing is. This (when coupled with vim) feels like reducing VS development friction by introducing ReSharper. For this task I'm currently sticking to VerseQ , which was recommended to me by Vsevolod . VerseQ tutor simply keeps on generating pseudo-random strings for you to type. However these strings are not entirely random: they include chords common to the language you practice (i.e.: \"jg\" is not common to English, while \"li\" is more common); next random string is based on the statistics captured so far. It will force you to practice your slowest keys and key combinations (or most erroneous ones). Recommended to give it a try, if you are typing more than 50 lines of text per day (emails or code alike). Functional Stuff In the previous post I've mentioned that it is possible to get a whooping number of messages per second on a single thread in Lokad.CQRS using lambda dispatchers . This comes from a simple realization that everything is either an aggregate/saga or a function. For example, handlers are just functions partially resolved from the container and executed against the message. Now, to be true with this kind of development you can easily achieve much higher throughput, if you switched to some other platform. I'm currently looking at akka for the small specific cases where extremely high performance would be needed (i.e.: message routing for Lokad.CQRS or doing some high-frequency data transfer). Fortunately enough, envelope and data formats of Lokad.CQRS are relatively cross-platform (protobuf to the rescue). With akka you can get something like a few millions of messages processed per second . The latter might be another life-saving option for devs and start-ups (constrained on resources) who are using Lokad.CQRS and are hitting some throughput limitations on elements of a decoupled/cloud system that are a bit hard/expensive to scale (load balancers and routers, for instance). Obviously I'm not planning to switch anything in Lokad.CQRS to akka and dead-simple functions editable in vim (no matter how tempting this might look, there always are real-world costs associated with such dreams). Yet, where possible to simplify things (i.e.: by throwing another framework out of the window) the experience and ideas will be reused in the approach. Especially, if they provide a clear and simple path of massively boosting performance of certain nodes in a decoupled app by switching them to the non.NET stacks. PS: if you post a comment and it does not show up within a day, please drop me a line . I get the feeling Squarespace has messed up it's spam protection badly again."}
{"url":"/post/separation-of-contract-and-implementation-in-software/","title":"Separation of contract and implementation in software","text":"Separation of contract and implementation in software A bit earlier I posted a sample drawing with evolution options for a component in a big software application . This diagram is merely a convenience snapshot of performance optimisations and tech options available at any given point in time. Knowing about them in advance helps in planning future development. Each node on the tree of component evolution represents a design pattern that has its own advantages and costs. Costs almost always include increased complexity. Sometimes such design pattern also has certain affinity with other patterns, making it simpler to evolve component to the next level. For example, consider a component that happened to arrive at the following design during its evolution (after multiple iterations): We can say that this component has a public contract and an internal implementation. Public contract could say: Component implements a certain documented API with JSON, XML and ProtoBuf formats; these API interactions can be scripted or tested using tools on a variety of platforms; we can expert 99.9% uptime of the component; query response times under 50ms in 99% of the cases; commands are acknowledged synchronously in 500ms in 99% of the cases; we expect API to have throughput of 1500 transactions per second, if deployed in single-node configuration (all transactions extra would be rejected with 503 Retry Later ); Queries have eventual consistency of less than 1000ms in 99% of the cases. All of the above is easily achievable, for example, using .NET on Windows Azure with multi-worker deployment configuration and an efficient Event Store. Yet, please note, that the public contract does not say anything about the implementation details . It's normally up to the team to decide what these should be (better if that team is also follows the mantra \"you build it, you run it\"). This means, that at any given point in time, internal implementation might change in order to accommodate new requirements. Implementation might also change if requirements get relaxed and we can actually degrade the performance and get rid of some complexity in exchange. If component boundaries are defined well (as driven by strategic design and its evolution), then public contract will not change often. In such case development challenges are merely constrained to deal with the implementation details, shifting it along a well-known evolution path in order to achieve well-known benefits. We could actually fine-tune specific components to meet certain requirements . We might even say that with this natural approach, large scale software design emerges as a by-product of design process driven by two distinct feedback loops : evolving strategic vision which deals with business capabilities and how they are implemented by composing together components, defined by their contracts. fine-tuning component implementations to fulfil their contracts. However, while doing all this evolution, it is really important to break such process into small steps, which can be handled separately. Doing work in tiny bites provides you with opportunity to step back, acknowledge feedback, reflect upon the design at strategic and implementation levels. I think that one would find more opportunities for such separation if we: decompose software into small focused components; handle component contracts separately from their implementations; plan evolution process in advance, where possible."}
{"url":"/post/shelfcheck-cloud-scalability-challenge-for-cqrs-from-lokad/","title":"Shelfcheck - Cloud Scalability Challenge for CQRS from Lokad","text":"Shelfcheck - Cloud Scalability Challenge for CQRS from Lokad If you have been following Lokad.CQRS docs , you've seen that we openly publish use cases of internal projects that are based on this architectural approach. Another massive project is going to be added to this case library in due time. Shelfcheck, on-shelf availability optimizer has been announced by Lokad today. You can read the official news blog entry, or just bear with me for more technical details. Retail industry (all kinds of shops, stores, hypermarkets etc) has one long problem that has plagued it for years: out-of-shelf problems . You as a customer can frequently discover them while doing grocery shopping and discovering that store is out of your favorite brand of beer, preferred snack or some specific type of diapers for the newborn. This small frustration can cause you to get a substitute product, delay the purchase or even go to the competing store. That's a small problem that roughly accounts for 100 billion USD losses for the retail industry worldwide . Solving that problem, among all other options, requires tracking and analyzing data at every point-of-sale (shop, hypermarket, etc) to detect out-of-stock issues as soon as possible. Previously this seemed to be impossible due to the sheer amount of number crunching involved . In order to run almost real-time analysis for thousands of stores in a retail network, where each store can have inventory of 10k-100k products, you either need a huge data-center... or a cloud. Now, that's where Lokad comes in with Shelfcheck , on-shelf availability optimizer . We plan to deliver highly affordable service that will be provided as a subscription with pay-as-you-go pricing. Naturally, the project will be deployed into the cloud, in order to handle all the load; with elastically scalable architecture. This would allow really efficient use of resources to keep the operational costs and prices extremely low. Technologically the project does not seem to be extremely challenging, given Lokad.CQRS and Lokad.Cloud to support this endeavor. However it should provide better insight into designing, building and running systems of such scale in the cloud. I'm personally really excited about the scope of the problem we are trying to solve for the industry with Shelfcheck . Stay tuned for more details."}
{"url":"/post/sim/","title":"💊 Simulation","text":"💊 Simulation Once upon a time I got fascinated with an underrated piece of software called FoundationDB. This was before it was acquired by Apple, taken offline and then made Open Source again. You can read more about FoundationDB is cool in FoundationDB is Back! . One of the amazing things about FoundationDB was how well it was tested. The team was running the entire distributed system (cluster) inside a virtual environment. They could simulate network failures, power outages and other problems that are hard to reproduce. The entire test suite could run years of simulated runtime in minutes, while still allowing to deterministically reproduce that specific fault condition. I strongly recommend watching this 40 minute video by Will Willson. Needless to say that I got fascinated by the idea and wanted to learn and re-implement it. This section is a collection of exercises and materials that helped me to make some progress in this area."}
{"url":"/post/sim/actors/","title":"Simulate Ring of Actors","text":"Simulate Ring of Actors This is a naive implementation that explores the idea of simulating a ring of N actors that send the messages to each over M times. Concept of the ring benchmark and simulated actors is taken from the gist by Preetam Jinka. The implementation is quite simple and allows us to focus on core concepts: messages used for the communication; actors representing individual processes; simulation environment simulation loop . Implementation Let's start by defining a message and a simulation environment to hold a global inbox. public struct Message { public readonly int Recipient ; public readonly object Body ; public Message ( int recipient , object body ) { Recipient = recipient ; Body = body ; } } public sealed class Env { readonly Queue < Message > _messages = new Queue < Message > (); public void Send ( int recipient , object message ) { _messages . Enqueue ( new Message ( recipient , message )); } public bool GetNextMessage ( out Message msg ) { return _messages . TryDequeue ( out msg ); } } Actor is a class that has a recipient and can send a message to it: public class Actor { readonly int NextActor ; int _counter ; readonly Env _env ; readonly int _m ; public Actor ( int nextActor , Env env , int m ) { NextActor = nextActor ; _env = env ; _m = m ; } public void HandleMessage ( object message ) { _counter ++ ; if ( _counter <= _m ) { _env . Send ( NextActor , message ); } } } We can join everything together by the actual simulation loop. It sets up a ring of actors and then sends the first message. class Program { static void Main ( string [] args ) { const int n = 1000 ; const int m = 1000 ; var actors = new List < Actor > (); var env = new Env (); for ( int i = 0 ; i < n ; i ++ ) { var next = ( i + 1 ) % n ; actors . Add ( new Actor ( next , sim , m )); } env . Send ( 0 , new { hello = \"world\" }); var watch = Stopwatch . StartNew (); while ( env . GetNextMessage ( out var msg )) { actors [ msg . Recipient ]. HandleMessage ( msg . Body ); } Console . WriteLine ( watch . Elapsed ); } } Given N=1000 and M=1000 , the result on my machine is: dotnet SimRing.dll 00:00:00.0260680 Questions How would you implement a ring benchmark in golang? This .NET Core implementation is way faster than Erlang. What does Erlang spend time on? Why is this implementation faster than FoundationDB Flow? What extra work could the Flow do?"}
{"url":"/post/sim/async/","title":"Simulate with Async","text":"Simulate with Async This project demonstrates plugging into async / await capabilities of .NET Framework to run an application in a simulated mode. The simulation will inject random faults and fast-forward time. Source code is available on github . About SimAsync The purpose of the simulation is to automatically find expensive bugs that are otherwise hard to catch or reproduce. In order to fulfill this promise a simulation needs to exhibit a few specific properties. In the previous projects we've already explored Determinism and Time acceleration . This sample project introduces two more capabilities: Fault injections and Simulation of the parallel processes . For the purpose of this exercise we focus only on a single-node deployment without any complex storage or configuration. Determinism This simulation is deterministic . Two runs with the same Seed parameter will yield identical results. You can use random Seed to discover new failure scenario. You can also re-use Seed from the past to reproduce a simulation run. Time Acceleration The simulation fast-forwards time . Although the logic uses random delays to represent disk delays, freezes or network outages, the CPU doesn't actually need to sit idle during this delay. We control the scheduler and we could fast-forward to the next interesting moment in time. Years of the simulated time could pass in hours of real-time. Fault Injections Random faults are injected along the way to represent some of the bad things that could happen in reality to our code. After all, we want to capture, debug and fix some of these issues before the code is deployed to the production. Parallel processes Parallel execution is simulated (even though the simulation itself is single-threaded). Thanks to the state machines generated by await / async we could actually suspend execution of one execution path for some time, while other paths will continue running. For example, Actor 1 could experience a storage freeze for 10 ms while reading from the database. Scheduler will switch to the other pending tasks (e.g. message handling by Actor 2 ) before coming back to Actor 1 to continue execution. Assignments Questions This simulation reveals a problem with the application. What is the root cause and how would you fix it? This simulation uses an in-memory dictionary as a DB. What would it take to implement something real? What kinds of failures could you inject into a simulated storage? Tricky Questions How would you simulate: your favorite database? your favorite commit log or message bus? a load balancer? faulty disk controller? fail-over between two data centers with eventual replication? Bonus After fixing the first problem discovered by the simulation you are likely to hit one more (it might require running longer simulations). How would you handle it? Remember, that the any simulated logic will eventually be replaced by the production code. They have to match."}
{"url":"/post/sim/cluster/","title":"Simulate CQRS/ES Cluster","text":"Simulate CQRS/ES Cluster This is work-in-progress on running event-driven distributed systems inside discrete event simulation . The purpose of this simulation research - to be able to run a distributed application inside a deterministic simulation while bombarding it with various faults that are hard to reproduce in the real world (but are still disruptive for the production systems). This project builds upon the Simulate with Async project, extending it with more features (including a simplified networking stack). Try it out The project is available on github . This is a .NET Core 2.0. You should be able to open it in a IDE (e.g. in JetBrains Rider) and run Runtime/SimMach.csproj project. The output should be something like this: Alternatively, you could try launching everything from the CLI with something like: $ dotnet run --project Runtime Details This project builds up on the previous steps and introduces: Simplified simulation of TCP/IP . This includes connection handshake, SEQ/ACK numbers and reorder buffers. There is now proper shutdown sequence and no packet re-transmissions. Durable node storage in form of per-machine folders used by the LMDB database. Configurable system topology - machines, services and network connections. Simulation plans that specify how we want to run the simulated topology. This includes a graceful chaos monkey. Simulating power outages by erasing future for the affected systems. Network profiles - ability to configure latency, packet loss ratio and logging per network connection. Dive in To dive in take a look at the Program.cs . It generates a simulation scenario that is then executed. A scenario could look like this: public static ScenarioDef InventoryMoverBotOver3GConnection () { var test = new ScenarioDef (); // define network connections and provide network profiles for them test . Connect ( \"botnet\" , \"public\" , NetworkProfile . Mobile3G ); test . Connect ( \"public\" , \"internal\" , NetworkProfile . AzureIntranet ); // install services on the machines test . AddService ( \"cl.internal\" , InstallCommitLog ); test . AddService ( \"api1.public\" , InstallBackend ( \"cl.internal\" )); test . AddService ( \"api2.public\" , InstallBackend ( \"cl.internal\" )); // configure a bot that will create workload and verify results var mover = new InventoryMoverBot { Servers = new []{ \"api1.public\" , \"api2.public\" }, RingSize = 7 , Iterations = 30 , Delay = 4 . Sec (), HaltOnCompletion = true }; test . AddBot ( mover ); // define a plan for the simulation (who will control the machines) // this is optional, but a chaos monkey is cute... var monkey = new GracefulChaosMonkey { ApplyToMachines = s => s . StartsWith ( \"api\" ), DelayBetweenStrikes = r => r . Next ( 5 , 10 ). Sec () }; test . Plan = monkey . Run ; return test ; } Installer functions bring together the necessary dependencies and return an instance of IEngine : static Func < IEnv , IEngine > InstallBackend ( string cl ) { return env => { var client = new CommitLogClient ( env , cl + \":443\" ); return new BackendServer ( env , 443 , client ); }; } static IEngine InstallCommitLog ( IEnv env ) { return new CommitLogServer ( env , 443 ); } BackendServer is a simplistic event-driven server that has its own projection thread and a (command) request handler. It commits data to the CommitLog from which other server instances could get the same data. In theory, the same business logic should be able to run in the real world environment as well."}
{"url":"/post/sim/cpu/","title":"Simulate CPU Scheduler","text":"Simulate CPU Scheduler This is a reference project for simulating event-driven systems and understanding the passing of time. It is a part of the materials for two R&D projects: Deterministic simulation of event-driven systems (see High Availability and Performance ). Event-driven view engine (or \"event-driven mainframe\"). Source code is available on github . Domain Domain is based on the assignment: CS452 Project Job Scheduling Simulation . The task is to create a Round-Robin job scheduler for the system with one CPU and one IO-device. This isn't going to be a real job scheduler, but rather a simulation. processes are scheduled to arrive at a specific time based on the spec; each process has a sequence of CPU and IO tasks with varying duration; process always ends with a CPU task; the system has only one CPU and one IO device, for which the processes compete; if a process needs more CPU time than TimeQuantum , then the scheduler can interrupt its execution and put back into the queue, allowing the other processes to move forward; if multiple commands are to happen at the same point in time, then the order is: (1) handle new process; (2) handle process coming back from IO queue and (3) handle preempted (interrupted) process. Implementation In this implementation we explicitly model (spell out) following concepts: future and the passing of time (see Simulation.cs ) future requests as commands (see Commands.cs ) job scheduling algorithm running in a simulation (see Aggregate.cs ) A mixed workload with the TimeQuantum of 2 will be executed when you compile and run the project: 3 3 2 5 8 7 4 4 1 4 6 3 2 5 2 7 4 8 4 8 2 10 2 7 5 6 10 2 1 10 2 13 4 1 15 1 12 4 8 6 Each line defines a single process. For instance, first line is: 3 - process arrival time (always ascending); 3 - number of CPU bursts; 2 5 8 7 4 - CPU for 2 time units, IO for 5... CPU for 4 time units. Pay attention Time is an illusion , modeled by running the simulation towards the discrete commands we have planned in the future. Future is represented as a list of scheduled commands sorted by their planned execution time. Multiple commands could be scheduled at a single point in time. Domain logic ( Aggregate.CommandPriority in this case) determines the order of command execution. While handling a command, the domain logic could schedule another command to happen in the future. The simulation ends when it has no future left. Questions What is a fuzz testing? How could it be applied to this project? How would you reproduce and debug issues discovered by the fuzz testing? Bonus Points The implementation almost works. For some reason, while running the mixed job workload with TimeQuantum=2 : the expected result (according to the CS452) is: P0 ( TAT = 60 , ReadyWait = 18 , I / O - wait = 16 ) P1 ( TAT = 7 , ReadyWait = 3 , I / O - wait = 0 ) P2 ( TAT = 47 , ReadyWait = 7 , I / O - wait = 20 ) P3 ( TAT = 86 , ReadyWait = 24 , I / O - wait = 22 ) P4 ( TAT = 19 , ReadyWait = 5 , I / O - wait = 1 ) P5 ( TAT = 75 , ReadyWait = 7 , I / O - wait = 21 ) while the simulation actually returns : P0 ( TAT = 60 , ReadyWait = 18 , I / O - wait = 16 ) P1 ( TAT = 7 , ReadyWait = 3 , I / O - wait = 0 ) P2 ( TAT = 47 , ReadyWait = 7 , I / O - wait = 20 ) P3 ( TAT = 86 , ReadyWait = 24 , I / O - wait = 22 ) P4 ( TAT = 19 , ReadyWait = 5 , I / O - wait = 1 ) P5 ( TAT = 75 , ReadyWait = 5 , I / O - wait = 23 ) Note the different ReadyWait and I/O-wait values for P5 ."}
{"url":"/post/sim/logistic/","title":"Logistic Simulation","text":"Logistic Simulation The concept of deterministic discrete event simulations could be extended to model more interesting scenarios. Take a look at this imaginary map of locations (cities) and roads connecting them: Aside from demonstrating my slight fascination with steampunk and games, we could use this map to describe a domain: there is a world with locations that are connected by the roads ; we have trucks that travel on these roads, travel takes time . How can we find the fastest (not shortest!) path between any two locations: A and B? One way to approach that is to run a simulation. We will launching truck instances from the location A in all directions at once. At every intersection we will fork and create more instances in all unvisited directions. Time will move forward. As soon as the first truck instance arrives to location B, we can abort the simulation. The route that this winning truck took would be our fastest route. Another term for this algorithm is \"A* Search\". It is described well on wikipedia . Here is how we could implement the algorithm in Python: You can read more about this venue in the following articles: Implementing Basic Logistics Simulator Add speed model to logistic simulation And in following Transport Tycoon Software Katas (I'm linking only the relevant catas): S01E01: Write a program to print time it would take to deliver cargo S02E01: Find shortest path for travel on a map S02E02: Compute ETA with fixed speed S02E03: Mine historical data for travel speed S02E04: Connect speed model to simulation"}
{"url":"/post/software-war-starts-with-a-map-context-map/","title":"Software War Starts with a Map, Context Map","text":"Software War Starts with a Map, Context Map Let us continue revisiting the big picture overview of all DDD and CQRS things, based on the things I've learned recently in collaboration with Gregory Young and Vaughn Vernon . We will start with the most important things. Let's imagine that you are a software manager challenged with a new project to deliver. The project is really interesting and challenging, customer can talk about it for hours. Sometimes there will be references to similar projects to copy and cool technologies that should help. You and your team are already itching to start flush out some specs and start developing. Is it really where you start? Wrong. Complex software project is a war - it is unpredictable and brutal fight to hit moving target, while always running short on time and resources. NB: I'm currently talking about \"startup\" environments. Enterprise situations with available resources and preallocated budgets are a different story that we are not interested right now. You don't enter software war with low-level details. You don't start with detailed tech specifications or strategies (unless you want to have perfectly planned and failed blitzkrieg on your hands). First you need a strategic map with bird's-eye view of the current situation at hand. Maybe this war is not even worth entering? One of the approaches of getting such map is to make your way through \"Domain-Driven Design: Tackling Complexity in the Heart of Software\" by Eric Evans (just make sure to read it this way: chapter 11 - END; Beginning - END). Domain Driven Design (DDD) is a special way to look at core business concepts that exist in a real world and deeply connect them to evolving domain model, that could be easily implemented in software to solve specific problems. In other words, it can help to look at real world, understand it and divide into a set of separate territories that can be conquered separately, step by step. “One step at a time, ... I can walk around the world. Watch me.” (c) L.M.Bujold Separate territories are called \" bounded contexts \", while the big map is called \" context map \". We use DDD toolset (set of rules, hints and guidelines) to look at the complex mess of important business concepts and identify natural boundaries around some of them. These natural boundaries would be discovered in the real world and then transferred to our \"context map\". Area within the boundary will be called \"bounded context\" This process is equivalent to geography or geology, where explorers map the terrain and draw borders around areas that look the same according to some arbitrary criteria. Criteria for finding such natural boundaries can be extremely different and will depend on your situation (synonym to \"situation\" is \"context\"). As it has always been: drawing maps is a challenge, especially when half of the territory is terra incognita (or covered by the \"fog of war\"). Please keep in mind, that Context Map is not a plan of some distant future. It is merely a reflection of current situation that we have in our software project or a company. It is a map of terrain that has to stay updated and real, in order to be of any use. Let me be clear: you don't necessarily need to map all the terrain of software project on a context map, before starting to work on it (sometimes research and development is the only way to move forward). However, staying aware of the business situation (context), uncharted areas, and potential unknown risks, will make you more prepared for unexpected changes in the future. Being able to divide complex problem area into a set of separate bounded contexts will make it easier to think through the situation and approach it. The biggest advantage of separate bounded contexts is about their explicit separation from each other. Once we have clear boundaries, we can consciously focus on the situation inside in relevant isolation. For each specific context within a boundary, we can pick the most efficient combination of technologies, teams and development approaches. For example, given a SaaS battlefield and a new secret \"Product XYZ\", the latter could allow division into two separate bounded contexts. One specific bounded context (highly complicated and requiring special approach) can be handled by a special research and development team, while the other neightboring BC could be delegated to a team specializing in solving diverse and messy situations. As you can already see, our context map not only allows to start tackling problem space, but also enable considering the allocation of limited resources over a seemingly endless battlefield of possible projects to deal with. Same works with risk mitigation, time and long-term campaign planning. At certain moments in the course of software war, you can even identify situations, where certain projects could be worth a sacrifice, risky spike, massive refactoring (or all of these at once). It might be really boring to read all this text without any mention of cool technologies (e.g. self-rebuilding CQRS view projection host with unlimited scalability for reads), however technology choices are mostly irrelevant on a higher level . Technology or an architecture style can be an enabling factor or even a strategic asset, but they are never the ones that drive the campaign. One of the best places to learn such things in practice is by working in a startup. Such places tend to have rapidly changing battlefield and rather low tolerance for stupid decisions. In war of startups you win or you die."}
{"url":"/post/some-side-thoughts-on-the-development/","title":"Some side thoughts on the development","text":"Some side thoughts on the development Empirical observation from the Project Management field of knowledge tells us that planning ahead of time saves a lot of resources down the road (entire PMBoK thing is basically just about this and the common dictionary thing). As the level of complexity increases, projects require more planning in order to be efficient and succeed. Large-scale construction projects is one of the most vivid examples of that. So, why do not people use similar principles in the development widely? Then, we could improve our success rate and save the resources by implementing certain functionality just with the component interactions (vertical solution architecture) instead of hard coding it. And I'm not even talking about constant quality, efficiency and risk management. Reasons like \"you just do not have the crystal ball to know about all the requirements\" or \"development is an art, and you can neither plan or control this\" just do not seem to be valid. A. You can know about all the requirements that matter or that could matter (you just need to concentrate on these). All the information is available. You just have to work hard with it. B. Development is purely about logics (although efficient logical compositions are beautiful, just like it is with the math equations). If the same unit-test being run three times could fail once due to some unpredictable magic, then this would be art to develop something stable in such environment. PS: the answer to the \"Why\" question is simple - the development looks too easy to start these days, while planning requires some effort. However, as the project goes - they swap places. PPS: There actually are development situations where non-deterministic test results are possible (i.e.: sampling performance of neural network \"learning\" speed or the capability of evolution algorithms to avoid local extrema). Dealing with these could be considered an art. Fortunately, every-day development projects rarely have to deal with these."}
{"url":"/post/some-tips-on-writing-event-handling-code-in-c-net/","title":"Some tips on writing event handling code in C# .NET","text":"Some tips on writing event handling code in C# .NET Shahar has posted on using EventArgs.Empty . Let's be a bit more efficient with the events than that. How often do you write repetitive code like this? public delegate void MessageEventHandler(object sender, MessageEventArgs e); [Serializable] public sealed class MessageEventArgs : EventArgs { // some class contents } // ... event MessageEventHandler OnMessageArrived; private void RaiseOnMessageArrived(string message) { // BTW, what about thread-safety of this call? if (OnMessageArrived != null) { OnMessageArrived(this, new MessageEventArgs(message)); } } Let's drop the entire RaiseOnMessageArrived method, MessageEventArgs class, the delegate and replace them by: event EventHandlerFor<string> OnMessageArrived = (sender,e) => {}; OnMessageArrived.Raise(this, message) In order to do that we have to follow simple rules. 1. Apply good citizenship rules to events, too and always initialize them with empty handler. This rule is optional, but new syntax makes this a breeze with (sender, args) = {}; 2. Leverage generics were it is appropriate (i.e. see EventHandlerFor{T} and EventArgs{T} from the Lokad.Shared library within Photon.NET ): /// <summary> /// Represents the method that will handle a typed event. /// Associated data is read-only /// </summary> public delegate void EventHandlerFor<T>(object sender, EventArgs<T> args); [NoCodeCoverage] [Serializable] public class EventArgs<T> : EventArgs { /// <summary> Read-only data associated with the event </summary> public T Data { get; private set; } /// <summary> /// Initializes a new instance of the <see cref=\"EventArgs{T}\"/> class. /// </summary> /// <param name=\"data\">The data.</param> public EventArgs(T data) { Data = data; } } 3. Use extension methods to streamline your code. One of these extensions might look like this one (there are more extensions in Lokad.Shared): public static void Raise<T>(this EventHandler<EventArgs<T>> handler, object sender, T data) { Enforce.ArgumentNotNull(handler, \"handler\"); handler(sender, new EventArgs<T>(data)); } Note: if you do not initialize your events with non-null empty block, the Enforce call has to be replaced with null check. Update: here's one thing reminded by Denis. If you follow the good citizenship principle and initialize your event with non-null empty block (and do not do null assignments), then you are pretty much thread-safe. Enforce statement can be ignored from the thread-safety point of view, since it is just an sanity check to help you enforce good citizenship code. It even could be dropped from the production code by simply adding conditional attribute on the method. But if you allow your event statements to be null and use the event field directly, then you have to take possible threading issues into account. By the way, there is an open source Lokad Shared Libraries project that contains these helper routines (as well as lots of other things that help me in my everyday C# development)"}
{"url":"/post/space-travel-and-infinitely-scalable-solutions/","title":"Space Travel and Infinitely Scalable Solutions","text":"Space Travel and Infinitely Scalable Solutions Recently, I’ve been re-reading Pat Helland’s paper on infinitely scalable solutions in the same time interval with finishing A second chance at Eden follow-up to Night’s Dawn trilogy by Peter F. Hamilton . Something clicked in my mind and Sci-Fi concepts of space travel got strongly locked with the scalable architectures. It became much simpler to understand cloud enterprise development and challenges of infinitely scalable solutions . I strongly recommend at least glancing through Helland ’s paper (just 10 pages of extremely useful information and thoughts) before proceeding with this journal post. And if you are a big fan of Science Fiction, then Night’s Dawn is another recommendation (thanks, Joannes!). Basically the concept is simple: while building our scalable systems we treat aggregate roots as space ships (ARs are similar to entities in Helland’s paper) . In the context of the repository pattern: aggregate roots are the only objects your client code loads from the repository . In Domain-Driven Design: the root is the only member of the AGGREGATE that outside objects are allowed to hold references to . There are a few rules to start with: Human race always starts with a Solar system and begins expanding outwards. Universe might have multiple star systems. Space ship is obviously smaller than a star system and always fits in there. As human race expands across the galaxy, it builds more and more ships. Although a single ship can always fit and be sustained by star system (even if the ship is as big as Death Star), the entire human fleet might not fit or be sustained. Now same rules apply to the aggregates or entities as well: Scalable application usually starts with a single machine. Cloud fabric or data center might have multiple machines available for the app. Aggregate root (entity) can always fit into a single machine (or a small cluster). As application grows it gets to handle more and more aggregates; they are re-distributed towards the new machines as needed. Although a single AR fits on the machine (or a small cluster), entire application might not (hence the need to expand). So far - so good. Let’s explore the universe. Ships are small and relatively safe. When you need to talk to the crew member - you call. It takes milliseconds to for the connection to be made. Space is a large and unpredictable place. While calling from the ship to the ship, you never know how far your recipient is going to be. It can be a few light seconds away, a few minutes away or it could have traveled to the other side of the galaxy. It is usually possible to send a message from ship to ship via the hyper-space relays, but you never know when you’ll get the response. Message might even need to chase the ship for a little bit. Since space is a large and unpredictable space, hyper-space relays need to be redundant, sending the message via a few routes. This guarantee that it will eventually get through. Although a ship might get a few copies, this is not a big deal, since it’s trivial to look up the correspondence with the sender and discard the duplicates. There are always exceptions. When it is really needed and one has credits at hand, it is possible to buy quantum entanglement channel for the ships. It will be fast and reliable but extremely expensive. Besides it locks ships together (it’s hard to have a reasonable talk when one ships is in the normal space and the other accelerates towards the speed of light) Let’s see how these future principles apply to the modern world of distributed systems: Aggregate is a native consistency boundary. Since it fits into the memory of a single machine, you can always ensure that everything is rolled into the transaction. Events could be propagated within the Aggregate Root instantly and reliably. Each event that goes outside - will take an unknown amount of time. Usually queues do not guarantee that the message will be delivered only once and in the order (although they try their best at it). That’s the common to Azure Queues, for example. It is the responsibility of the recipient entity to track and handle duplicates, restoring proper ordering where this is important. Activities (sagas), managing entity-2-entity partnerships, are usually responsible for such operations. There are always exceptions. If it is really needed and one has development resources at hand, some sort of direct messaging could be established (i.e.: based on the TCP abstractions), but it is rather unusual and expensive. Same is with the transactions - although they are usually better to stay within the AR, it is possible to bend the rules, if it is worth the effort and increased complexity. We usually can't be sure about the the exact delivery time and state of it's recipient, until the reply or some notification come in (which might take some time). During this interval of uncertainty state of the recipient is uncertain to us. We need to consider and design on this fact. Lessons learned for me: build your space ships small and ready for the travel around the known space as it expands. In other words: Infrastructure should be capable of evolving in order to handle scaling and repartitioning. Business logic and entities should avoid doing anything that anchors them to each other or a specific partition. Otherwise, when time comes to move and scale, it will hurt. Message-based architectures allow building scalable and decoupled architectures. Yet they bring a degree of uncertainty and eventual consistency into the solution. We need to start learning from the real world and our past. Our ancestors were building long-running interactions, transactions and vast organizations with sail mail, telegraph and various analogues of Pony Express. Message delivery was indeed slow and unreliable back then, taking months instead of fractions of the second. This article builds upon and continues xLim 4 (CQRS in the Cloud) research and development series, attributing to the CQRS body of knowledge being gathered within this Journal."}
{"url":"/post/specification-testing-for-event-sourcing/","title":"Specification Testing For Event Sourcing","text":"Specification Testing For Event Sourcing When you no longer need to worry about persistence of A+ES entities, their captured behaviours tend to get more complex and intricate. In order to deliver software reliably in such conditions we need non-fragile and expressive way to capture and verify these behaviours with tests, while avoiding any regressions. A+ES stands for Aggregates with Event Sourcing . This topic is covered in great detail in episodes of BeingTheWorst podcast . In other words we need to ensure that: tests will not break as we change internal structure of aggregates; test should be expressive to capture easily any complex behavior; they should match mental model of aggregate design and be understandable even by junior developers. One solution is to focus on specific use cases using “specifications” or “given-when-then” tests. Within such tests we establish that: given certain events; when a command is executed (our case); then we expect some specific events to happen. Primary difference between specification and normal unit test is that the former explicitly define and describe a use case in a structured manner, while the latter just executes code. Each A+ES specification can be executed as a unit test, while the reverse is not necessarily true. Due to strong synergy with DDD and no coupling to internal structural representation of A+ES entity, these tests capture intent and are not affected by internal refactorings (something common to CRUD-based Aggregate implementations) In C# you can express such test case as: [Test] public void with_multiple_entries_and_previous_balance() { Given( Price.SetPrice(\"salescast\", 50m.Eur()), Price.SetPrice(\"forecast\", 2m.Eur()), new CustomerCreated(id, \"Landor\", CurrencyType.Eur, guid, Date(2001)), new CustomerPaymentAdded(id, 1, 30m.Eur(), 30m.Eur(), \"Prepaid\", \"magic\", Date(2001)), ClockWasSet(2011, 3, 2) ); When( new AddCustomerBill(id, bill, Date(2011, 2), Date(2011, 3), new[] { new CustomerBillEntry(\"salescast\", 1), new CustomerBillEntry(\"forecast\", 2), new CustomerBillEntry(\"forecast\", 8) }) ); Expect( new CustomerBillChargeAdded(id, bill, Date(2011, 2), Date(2011, 3), new[] { new CustomerBillLine(\"salescast\", \"Test Product 'salescast'\", 1, 50m.Eur()), new CustomerBillLine(\"forecast\", \"Test Product 'forecast'\", 10, 20m.Eur()), }, 2, 70m.Eur(), -40m.Eur(), Date(2011, 3, 2)) ); } Test above is based on Lokad's version of A+ES Testing syntax, which was pushed to the master branch of Lokad.CQRS Sample Project . Look for spec_syntax class there. Please note, that these specifications test A+ES entities at the level of application services (they accept command messages instead of method calls). This means that any Domain Services (helper classes that are passed by application service down to aggregate method call) are handled by the application service as well. In this case we can use test implementations of domain services, configuring them via special events. Such events would be generated by helper methods (e.g.: Price.SetPrice(\"salescast\", 50m.Eur()) or ClockWasSet(2011, 3, 2) ). This allows us to reduce test fragility and also gain implicit documentation capabilities. Specifications as Living Documentation There are a few more side benefits of using specifications for testing business behaviours. First of all, specifications can act as a living documentation, which is always up-to-date. For instance, rendered documentation for the specification above would look like: Test: add customer bill Specification: with multiple entries and previous balance GIVEN: 1. Set price of salescast to 50 EUR 2. Set price of forecast to 2 EUR 3. Created customer Customer-7 Eur 'Landor' with key 29c516fb-bdaf-48f5-a83d-d1dca263fdb6... 4. Tx 1: payment 30 EUR 'Prepaid' (magic) 5. Test clock set to 2011-03-02 WHEN: Add bill 1 from 2011-02-01 to 2011-03-01 salescast : 1 forecast : 2 forecast : 8 THEN: 1. Tx 2: charge for bill 1 from 2011-02-01 to 2011-03-01 Test Product 'salescast' (1 salescast): 50 EUR Test Product 'forecast' (10 forecast): 20 EUR Results: [Passed] This can be achieved by merely overriding ToString() methods of event and command contract classes. Open source SimpleTesting sample can provide more details. Detailed documentation of AR+ES behaviours that is defined in form of specifications, always stays up-to-date and in sync with the code changes. Specifications as Design Tool If we push this concept of living documentation further down the road, specifications can be used to communicate with business experts upon the use cases, using Ubiquituous Language and domain models. You can either express use cases in text as “Given-When-Then”, have junior developer code them as unit tests and then ask domain experts to implement functionality. Additional practical usage scenarios for specifications include: You can print out all specifications as a really thorough list of use-cases for signing off by project stakeholders. Specifications can easily be visualized as diagrams and graphs. They could help in better understanding of your domain, finding non-tested or complicated spots and driving development in general. For instance, such diagram could look like: Hope, this helps. I plan to cover this topic in greater detail in upcoming episodes of BeingTheWorst podcast ."}
{"url":"/post/stop-overdoing-things/","title":"Stop overdoing things","text":"Stop overdoing things During the course of the last years, software development somehow took a wrong turn and got vastly overcomplicated. These days we just tend to over-design and overdo things . Consider projects that never get shipped because stakeholders keep on drawing UML diagrams for the requirements. Consider web sites that never surface the public web because of ever changing UI designs. Consider implementations so convoluted that development teams spend their entire lives playing \"hot potato\" with JIRA bugs that keep on showing their ugly heads up. Or consider software architectures so entangled that even ivory tower architects have their eyes bleed when they see implementations of them in the real world. Things don't need to get complicated simply because everybody else seems to do that. Break down complex into simple, release early, iterate quickly and learn from feedback. Here's an example of a web site that just works : A blog theme that just works : Here's an example of IDE that just works : There is no point in trying to be perfect from the very start. You can get a better chance of succeeding if you just get the smallest possible thing done, make it real and then iterate from there. Sometimes you might even find out that good enough software works good enough and does not need to be any more perfect."}
{"url":"/post/store-and-forward-for-event-streams/","title":"Store and forward for Event Streams","text":"Store and forward for Event Streams As we are getting deeper into event centric architectures at Lokad , more and more event stores show up in our systems. Most of them sit around in Azure Cloud store, while some are replicated to different cloud and across local machines. See BlockBlobTapeStream in Lokad.CQRS code for latest implementation details of event streams for Azure. This is different from previous version of BlobTapeStream but generates a format that matches the one of FileTapeStream. Essentially event stream file (or blob) is just an append-only binary structure that contains envelopes with some properties and binary payload. Envelopes include signature bytes, hashes and length specifiers by default. Simple stuff that fits in a few hundred lines of code per storage type. It also works. There are certain quirks but the job gets done. Problem shows up at different level - replication of streams stored in Azure Blob Storage. Don't get me wrong, simple Azure Blob scenario would work for 90% of our scenarios out-of-the-box, and it is extremely reliable (as reliable as Blob Storage). Yet, when you are pushing certain primary nodes to their limit, they become the bottleneck in the system because of the latency (caused by multiple REST requests to Azure cloud storage just to perform an append). Since event store is the essential building block of event centric systems, I would like to have something that could be relied on and dead-simple. Just store-and-forward service for event streams that can push through at least 1000 events per second for a single stream (I know, that's not too much to ask). Ideally, it would be platform agnostic (meaning that it works same way on local machine and remote server). I'm willing to relax reliability constraints (of an individual node) a bit in favor of performance (it's my job to make sure that overall system is cloud-proof and can withstand the test of \" Azure VM got currupted \" or \" sudden restart came in \"). When I'm saying \"building block\", it actually means: a set of polished classes that are copied from project to project (binary references only, if really needed); no flexible configuration syntax; no general purpose functionality; no complex threading logic. no server (only embeddable block). Essentially, a dead-simple stuff that is plugged (in the code!) into the topology and gets the job done. Just like in nature we have simple cells acting as building blocks for complex creatures (even politics). Currently I'm considering (as a mental exercise) following design of that store-and-forward block. This replay service is just a TPL-driven daemon (engine process in Lokad.CQRS) that stores and forwards binary buffers in completely decentralized manner. These will be plugged in the code to other elements, generating redundancy, replication and decent throughput (even for low performance of our messages). So any client of that service will just: come online; subscribe to real-time notifications (put them to in memory queue); request history from the last known version; start consuming notifications; By client I would mean \"projection host\" (the one that handles read models) or local audit tool, or interested and occasionally connected sub-system. A few additional constraint relaxations: I'm willing to accept occasional message duplication. I'm willing to accept risk that some information might be completely lost within that node (if surrounding environment goes corrupt). If really needed I could probably tweak the same code to feature any degree of immediate redundancy (as in \"write to store is considered to be complete only after it successfully persists data to itself and X slave stores down the stream\"). I don't care about the performance since it will be more than adequate here even with the brute-force implementation of mine (I've been saying these words far too often last days). Primary concern is the logical approach. Am I on the right track with it? Any thoughts or experience with similar architectures? by Jeremie : If you have several streams appending concurently on the same disk, pre allocate the files... or you'll achieve maximal fragmentation.. In the end, it can really really hurt your perf. I don't say you'll need it, but keep it in mind, so that you don't wonder why your server is reading/writing so slow when file size grows. I usually split stream in 10 - 100Mb files. Update : I've tried a quick prototype that pushes messages through in a really naive way. I'm getting throughput of 1000 persisted messages per second (each message is 15 bytes) on my small VM running inside MacBook Air (disk is flushed after each write). Update 2 : this article continues in Thoughts on Event Streams in Elastic Environments (and in a few articles from now I'll have 20000 messages in durable mode :)"}
{"url":"/post/stress-testing-the-stability/","title":"Stress testing the stability","text":"Stress testing the stability You think that your distributed system is stable and ready for the production, do not you? So did I, before trying out this simple \"How to break your distributed system\" recipe: Get fresh dataset for your database (it should have the size comparable to the production data, or even larger) Prepare simple command-line agents that emulate user activity (CRUD actions against different entities). Take 10-100 of these agents and let them boil in stress mode (1-5 sec. or no delay between actions) Fire up all distributed automation/processing services that you have in the picture (obviously, in the stress mode, as well) Optional: continuously stir connectivity to the Database and Application Virtual Machines Let everything cook for some time My first unhandled exception (it was a deadlock) bubbled up within 30 seconds after firing this whole thing up. And it is really to reproduce this one - you just have to restart everything and wait for a minute or so. The system would be called relatively stable if it can survive 24h in the stress mode (and validation proves that all the scheduled tasks have been properly completed)."}
{"url":"/post/structure-of-dddes-solution-in-visual-studio/","title":"Structure of DDD+ES Solution in Visual Studio","text":"Structure of DDD+ES Solution in Visual Studio Kerry Street has recently raised really interesting question: Speaking of vocabulary, how do you view and speak about the first \"D\" in DDD? \"Domain\" itself can be overloaded. Yes, indeed, I tend to use domain interchangeably. This can lead to potential confusion. The primary meaning of \"domain\" is just \"some problem space that has to be addressed via modeling and then expression in software\". When I start a project, it usually has a single \"Bounded Context\", which matches the problem space. Hence, it seems natural to call this BC \"Domain\" simply because they match in this specific situation. However, as solution grows, new BCs are discovered and added. In the end, this leads to rather ambiguous solution structure: Now, that I think about it, \"Domain.BC\" in this case would be better named \"Orchestration BC\" or at least \"Tenants BC\". Sorry for the confusion and thanks for brining this issue up. I'll need to correct samples and my own projects to clear this up. While we are at this topic, image above, displays Visual Studio solution structure for the second version of Lokad Salescast (big data inventory optimization platform). As you can see, there is nothing really peculiar there and it matches pretty closely structure in Lokad.CQRS Sample. The only non-obvious tidbit is that \"Worker\" project is both an executable console (using file system for local runs) and a WorkerRole implementation (used for Azure deployments). Like-wise, both web projects would feel natural running locally (using file-based event streams and persistence) and in Windows Azure."}
{"url":"/post/tech-layer-of-cqrs-systems-pushing-it-further/","title":"Tech Layer of CQRS Systems&#58; Pushing it Further","text":"Tech Layer of CQRS Systems&#58; Pushing it Further Let's see how we can extend previously mentioned model of simple event store to support partitioning of event stores, along with re-partitioning and fail-over scenarios. We bring in the Service-Oriented Reliable queuing ( Majordomo Pattern with heartbeat support) as the middle-man between clients (aggregate roots) and actual event stores. This router will know about available event store partitions (even if there is only one) and their fail-over replicas. Actual stores could be either straight append-only files that are preallocated in large chunks (with one writer and multiple readers). Alternatively we could use circular buffer (resembling LMAX approach ) for short-term memory, that is eventually replicated to something more durable. Note, that here we assume that we could host multiple readers on the same event stream: Publishers (to replicate events downstream). Replayers (to provide random access to persisted events, including full replays). Saga and projections hosts. Each of these readers has benefit of being able to process pending events in batches, simplifying \"we need to catch up on a lot of work\" scenarios. Once again, just like it is in LMAX. For the purity approach we can just keep publisher and replayer running on that even stream, while pushing the rest of the readers further down-stream. In order for some things to happen properly, event store must add it's own envelope (or a frame) to recorded messages, including sequence number , that is incrementing and unique within the store. Fortunately it is easy to do, since we have only one writing thread (the one, that supports more than 20000 messages per second on commodity hardware). Why would we need sequence number? Imagine a case, where a few projection hosts are subscribed to event stream in real time. Let's also assume that we have so many projection hosts over the network, that UDP-like broadcasts become beneficial for us. There is a slight problem though - we can't guarantee reliable delivery of messages out-of-box with UDP. This is where sequence numbers come in - they can be used to detect missing, duplicate or out-of-order messages. When this happens, we will ask for replay or just throw the exception to support. How do we handle fail-over of event stores? High-availability pair could be used here ( Binary star ). We will start another store replica in a different availability zone, connected via the dedicated network connection to the master. We'll configure store clients (our majordomo broker in this case) to fall back to the slave store if the master is not available (this can be detected via the heart-beats). Slave will take on as master, if it starts getting requests AND master is not available. Tricky part here is to avoid network configuration that would increase chance of \"split brain\". How do we handle failures of central broker? We don't need to. We can have multiple instances running on the network and have clients configured to pick the one available. How do we handle repartitioning of stores? That's an interesting one. Articles are written about repartitioning (i.e. when applied to search indexes). Brute-force approach is to create a replica on a new machine and let it catch up with the primary store. Then, shut down writing thread of the master and switch broker(s) to the slave. We'll configure stream readers to start working on new partition as soon as readers fully catch up on the former masters. Interesting possibility is that similar setup could also be used to create persistent queues for \"command\" messages. The same message store could also be used to implement Disconnected Reliability ( Titanic Pattern ) on top of Service-Oriented reliable queuing provided by Majordomo . How much sense does this make? NB: Naturally, there are a few notes: Presented setup does not need to fit 100% of event centric systems. It's just something that an architecture could be based upon. Specific sections and subsystems could be tuned from there (especially as we go into custom setups and various outliers). Setup provides enough decoupling and flexibility to work in both intranet scenarios (or even in a single machine) and in the flexible cloud scenarios. This does not affect the architecture, just a few specific choices around topology and infrastructure. ZMQ guide provides an extensive overview of reliability and scalability patterns mentioned above. If something does not make sense, please try to read the guide first. I don't plan to bring these approaches into production any time soon. At the moment of writing this is just a theoretical exercise that should serve as a backup plan. This is just in case I would need to increase throughput of our message queues by a factor of 1000 or more per partition. Given specifics of our work at Lokad , this possibility should be handled upfront - hence this research."}
{"url":"/post/technology-demons/","title":"Technology Demons","text":"Technology Demons Manuel posted an interesting question to post on Design Observations on Big Data for Retail : Ok, but if you delete all of these technologies from your design, what technologies you'll use ? and how you substitute them? The answer is two-fold. First of all, below a quick list of technologies that I try to avoid at all costs in my projects lately. Only when there are strong external forces, I agree to resort to these demons: SQL Databases (instead: plain files and noDB) NoSQL Databases (instead: plain files and noDB) DTC and anything that requires it (instead: design eventually consistent systems) SOAP and XML (instead: binary formats, JSON and text) Windows Communication Foundation (instead: messaging, HttpListener or sockets via ZeroMQ) IoC Containers (instead: design systems to avoid all need in IoC Containers) WPF and desktop apps in general (instead: HTML5 + CSS + javascript) Windows Workflows Foundation (instead: proper domain-driven design) anything non-Git for distributed version control (instead: git) Aspect Oriented Programming with code weaving (instead: design the software properly) Mocking frameworks (instead: use simple strongly-typed code; Jeremie wrote post ) N-tier architectures (instead: shallow systems partitioned along boundaries of bounded contexts ) frameworks like log4net, AutoMapper, ELMAH etc (instead: write a few lines of code tailored for your situation). Second , I don't hold anything against these technologies (except for the cases where tech is being marketed as silver bullet, but that's what demons in all religions are expected to do anyway). I just happen to believe in value that is gained by designing my systems to be independent of these them. After all, technology should be relevant to the design only when the core problem absolutely necessitates going into this detail. For example, reducing transfer and storage costs via extreme compression of big data or enabling new business scenarios via elastic scalability in cloud). If however, we are doing something that is not particularly peculiar, then bringing technology to the table (context map) would just complicate everything. I consider to be non-peculiar cases to be, for example, when you have under 100 transactions per second in a single partition, under a few GB of total data for random reads and a few hundred GBs on top for BLOBs - essentially things that you can have deployed at the cost of a few hundred USD per month (including replication and load-balancing). I believe, vast majority of the business scenarios fit this description pretty well. Yes, this means that vast majority of businesses can easily run on a smartphone (or a cluster of smartphones, if you need continuous replication off-phone) So, in cases, where tech is not important, why should we couple our designs tightly to the most expensive and complex options among the available ones?"}
{"url":"/post/the-best-way-to-learn-cqrs-ddd-and-event-sourcing/","title":"The Best Way to Learn CQRS, DDD and Event Sourcing","text":"The Best Way to Learn CQRS, DDD and Event Sourcing The best way to learn something by heart is to try to reinvent it yourself. You'll either give up or will learn all the problems and trade-offs really well. That's what I'm planning to do, while sharing my thoughts and practical experience with you. To catch up with Command-Query Responsibility Segregation you can check out the Getting Started with CQRS (includes videos, sample project references and articles by brilliant people like Greg Young and Udi Dahan). Ultimately my own articles, thoughts and notes from the production will go into the xLim 4: CQRS in Cloud series. Ok, let's get started. So far the learn by doing approach had been working extremely well for me. At the moment I'm learning how to make CQRS solutions leverage their potential by going the path of the Domain-Driven Design and Event Sourcing and while keeping in mind constraints of the infinitely scalable systems by Pat Helland . I'm also trying CQRS outsourcing potential (or rather potential for efficiently distributing development effort in parallel) and attempt to figure out methodology of continuously integrating DDD/ES solutions with the legacy CRUD systems. Latter is really important since, that's what I've been doing a lot recently in Salescast . All this goodness just requires getting at least one more developer, Mercurial repository and spending a few evenings building your spike project and evolving it. This teaches a lot. Yet sometimes development just happens to deviate away from the common design patterns (for \"common\" - see Reference Implementations of Greg and Mark in the CQRS section ). In a strange way. That's how my command handlers look like: public sealed class ProjectHandler : IHandleCommands { readonly EventStorePartition _store; public ProjectHandler(EventStorePartition store) { _store = store; } public IDisposable Subscribe(IQbservable<DomainCommand> observable) { return observable .WherePartitionIs(_store.Partition) .Subscribe(Handle); } void Handle(DomainCommand d) { _store.Update<ProjectAggregateRoot>( d.AggregateId, r => DomainInvoker.RouteToDo(r, d.Command), d.Command is CreateSolution); } } Or if we need to ensure for this AR that commands should not be based on stale data, then Handle transforms to: void HandleWithConcurrencyCheck(DomainCommand d) { _store.Update<ProjectAggregateRoot>( d.AggregateId, r => { if (d.Version != r.Version) throw new InvalidOperationException( \"Uhm. Root was changed since client last saw it\"); DomainInvoker.RouteToDo(r, d.Command); }, d.Command is CreateSolution); } Event handlers suffer in the same way. Yet they potentially benefit from IQbservable even more, since we could theoretically filter interesting events at the server , by using .NET Observable Query Provider capable of instructing AMQP server to send us (this specific event handler running within this partition) only specific events for this partition. Another interesting thing is how event store looks like, when developed in .NET 4.0 with the observable goodness: // read events from file within partition // ... skipped... var subject = new Subject<Change>(); var aggregateRoot = (TAggregateRoot) factory(subject); // apply events to domain foreach (var change in history) { aggregateRoot.Apply(change.Event); } // subscribe to any changes produced by our actions var newChanges = new List<Change>(); using (subject.Subscribe(newChanges.Add)) { update(aggregateRoot); } // nothing changed if (newChanges.Count == 0) return; try { // turn changes into domain events capable of crossing partition boundaries var events = newChanges .Select(c => new DomainEvent(id, c.Event, c.Version, DateTimeOffset.Now)) .ToArray(); // naive approach to persisting event history for now. Easy to improve var allEvents = history.Concat(events).ToArray(); storageItem.Write(stream => _formatter.Serialize(stream, allEvents), condition); } catch (StorageConditionFailedException ex) { // normally this should never happen, since commands are processed by a single // processor per partition, but just in case... var msg = string.Format(\"Record was modified: '{0}'; Id: '{1}'\", type, id); throw new OptimisticConcurrencyException(msg, ex); } Yet one more interesting thing is how easy it turns out to turn CQRS solution with a message server (Erlang-powered RabbitMQ, for example) into a desktop application with in-memory event bus. You just need to swap messaging libraries with ConcurrentQueue from .NET 4.0, swap partitionable event and view storage for file-based and ask .NET 4.0 TPL to keep pumping events and commands, while the application is running: var commands = new ConcurrentQueue<DomainCommand>(); var events = new ConcurrentQueue<DomainEvent>(); var data = new FileStorageContainer(\"data\").Create(); var views = new FileStorageContainer(\"views\").Create(); var cts = new CancellationTokenSource(); var sender = new DelegateSender(commands.Enqueue); var viewStore = new ViewStore(views); using (var form = new Form1(sender, viewStore)) { var tasks = new List<Task> { Task.Factory.StartNew(() => ProcessCommands(commands, events, data, cts.Token, form.Replay)), Task.Factory.StartNew(() => ProcessEvents(events, views, cts.Token)) }; Application.Run(form); cts.Cancel(); Task.WaitAll(tasks.ToArray()); } Essentially this proved to me that CQRS architecture (or some portions of it) can be bundled into a single Windows Forms Application (or whatever desktop UI you are using), scaling everything down to a single process and keeping event bus, event and command handlers inside. What's more important, all these processes still run in different threads (effectively leveraging multi-core capabilities to pre-render views), yet they are simple to understand and isolate (I didn't have a single UI threading issue, which tend to show up frequently whenever I start developing any async UI). Theoretically CQRS for the desktop is the same MVC pattern with additional explicit constraints on organizing controllers and distributing their logic in async manner. PS: for the latest articles in the CQRS/DDD series see xLim 4: CQRS in Cloud ."}
{"url":"/post/the-power-of-simplicity/","title":"The power of simplicity","text":"The power of simplicity Answer to a comment by Jeremy Gray has started getting really long, so I'm putting the primary idea in a separate post. As we all know, NUnit is one of the first unit testing frameworks for .NET. It is old, stable and quite conservative. At the same time, it's feature set is inferior to the functionality provided by some newer frameworks (or even frameworks for handling unit testing frameworks, like Gallio). However, I'm still trying to stick to NUnit in my projects. Reasons for that are: It does not make reading tests challenging for new devs (unit tests are normally point of entry into the code, if they are present) Unit tests are often considered to be usage samples for the code they test. We'd prefer to keep these simple, would not we? The most important reason for me: simple unit testing framework forces developer to write unit tests that are easily testable (just like TDD forces some good logical separation between the components) Imagine some really complex method that accepts quite a number of arguments and performs some black-box magic inside. If you have a data-driven unit testing framework at hand, you can simply create CSV file with valid entries and test this method against that. But if you do not have this functionality at hand, then you'd need to refactor the method to make it more simple and testable. I like the last scenario. Even if it is not fun at the very start, it does pay off in the long term."}
{"url":"/post/top-10-reasons-to-do-cqrs-in-a-pdf/","title":"Top 10 Reasons to do CQRS (in a PDF)","text":"Top 10 Reasons to do CQRS (in a PDF) Here's PDF that maps my top 10 reasons to use Command-Query Responsibility Segregation in your development. These reasons are about benefits of CQRS and things that it enables to do : Domain-Driven Design, Event Sourcing, Cloud Computing etc. This map is made in a form of research/skill tree (just like the one in Master of Orion 2 or Diablo 2 ) and shows my take on the dependencies between these different architectural elements along with the benefits they provide. In essence these are potential paths of evolution that your system might go through as it matures and faces new scalability, complexity or business challenges. Download PDF | Download Hi-Res PNG | Permalink This CQRS roadmap was inspired by the question of Samuel Jack on CQRS applicability in small systems that don't require massive scalability. There were a few other similar questions as well. Apparently, by pushing CQRS to the theory of almost-infinitely scalable systems , I've made an impression that scalability is all that is out there. I think, large scalability is not the only reason to try CQRS architectures (and any of the other features down the \"research tree\"). However if you discover that you need to reduce complexity, bring up the scalability or add smarter business intelligence - these paths will still be open for you in a rather straightforward way. In fact, another inspiration for the outline was the current process of jump-starting yet another Lokad project on top of Lokad.CQRS for Windows Azure. This project is bound to be simple, robust and flexible enough to handle new business requirements as they come - a perfect fit for CQRS . You are welcome to download this CQRS \"research tree\", share it, print out as a reference (it should scale to 2 A4 sheets by default), use to persuade your boss or colleagues about some long-term refactoring investment or do pretty much what you like. Do you like it? PS: this post is a proud member of xLim 4: CQRS in Cloud series, but it's applicability is not limited by the cloud."}
{"url":"/post/troubleshooting-azure-deployments/","title":"Troubleshooting Azure Deployments","text":"Troubleshooting Azure Deployments Let's compile a list of common Windows Azure deployment problems . I will include my personal favorites in addition to the troubleshooting tips from MSDN (with some additional explanations). Missing Runtime Dependencies Windows Azure Guest OS is just a Virtual Machine running within Hyper-V. It has a set of preinstalled components required for running common .NET applications. If you need something more (i.e.: assemblies), make sure to include these extra dlls and resources ! Set Copy Local to True for any non-common assemblies in \"References\". This will force them to be deployed. If assembly is referenced indirectly and does not load - add it to the Worker/Web role and set Copy Local to True. Web.config can reference assemblies outside of the project references list. CSPack will not be aware of them. These need to be included as well. If you use some assemblies linking to the native code, make sure that native code is x64 bit and is included into the deployment as well. For example this was needed for running Oracle Native Client or SQlite on Azure Worker Role. It's 64 Bit Again, Windows Azure Guest OS is 64bit . Make sure that everything you deploy will run there. You can reference 32 bit assemblies in your code, but they will not run on the cloud. You might encounter case that Visual Studio IntelliSense starts arguing badly while you edit ASP.NET files referencing these 64bit-only assemblies. This is understandable since devenv is still 32 bit process. Well, I just live with that. Web Role Never Starts If your web role never starts and does not even have a chance to attach IntelliTrace, then you could have a problem in your web.config. Everything would still work perfectly locally. This could be caused by config sections that are known on your machines, but are not registered within Windows Azure Guest OS VM. In our case this was coming from uri section required by DotNetOpenAuth: <uri> <idn enabled=\"All\"/> <iriParsing enabled=\"true\"/> </uri> This fixed the problem: <configuration> <configSections> <section name=\"uri\" type=\"System.Configuration.UriSection, System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089\"/> Windows Azure Limits Transaction Duration If your transactions require more than 10 minutes to finish, then they will fail no matter what settings you have in the code. 10min threshold is located in machine.config and can't be overridden from the code. More details This is a protective measure (protecting developers from deadlocking databases) coming from the mindset of a tightly coupled systems. I wish Microsoft folks were more aware of the architecture design principles that are frequently associated with CQRS these days. In that world deadlocks, scalability, complexity and tight coupling are not an issue. Temporary Files can'be larger than 100MB If your code relies on temporary files that can be larger than 100Mb, then it would fail with \"Disk full\" sort of exception. You will need to use Local Resources. If you launch a library or process that rely on temporary files, then they could fail, too. This did hit me, when SQLite was failing to compact 2GB database file located within 512GB empty disk. As it turns out, the process used TEMP environment variable and needed ability to write to a large file. More details are in another blog post. Recycling Forever Cloud fabric assumes that OnStart, OnStop and Run from \"RoleEntryPoint\" will never throw exceptions under normal conditions. If they do, they are not handled and will force the role to recycle. If your application always throws an exception on start up (i.e.: wrong configuration or missing resource), then it will be recycling forever. Additionally, the Run method of a Role is supposed to run forever (when it returns, the role recycles). If your code overrides this method, it should sleep indefinitely. BTW, if you consider putting Thread.Sleep, then I strongly encourage you to check out Task Parallel Library (aka TPL) within .NET 4.0 instead. Coupled with PLinq and new concurrent and synchronization primitives, it nearly obsoletes any thread operations in my personal opinion. Lokad Analytics R&D team might not agree, but they have really specific reasons for reinventing PLinq and TPL on their own. Role Requires Admin Privileges I personally never hit this one. However just keep in mind, that compute emulator (Dev Fabric) runs with the Admin privileges locally. Cloud deployment does not have them. If your code requires Admin rights to do something, it might fail while starting up or executing. Incorrect Diagnostics Connection String If your application uses Windows Azure Diagnostics, then for deployment make sure to update the setting to HTTPS credentials pointing to a valid storage account. It is usually a separate setting named \"DiagnosticsConnectionString\". It's easy to forget that, when you usually work with \"MyStorageConnectionString\" or something like this. SSL, HTTPS and Web Roles In order to run a site under HTTPS, you must pass the SSL certificate to Windows Azure, while making sure that private key is exported and PFX file format is used. By the way, if you applied to \"Powered by Windows Azure\" program logo, then make sure not to display it on the HTTPS version of your site. That's because the script is not HTTPS-aware and will retrieve resources using non-SSL channel. This will cause browsers to display warnings like the one below, which will be scary for the visitors. NB: as I recall site owners are not allowed to modify this script and fix the issue themselves. So we would probably need to wait for a few more months of constant email pinging till 10-line HTML tracking snippet is updated to use HTTPS when located within HTTPS, just like GA does. I know it's a tough task. What Did I Miss? That's it, for now. Some of these less common issues cost us quite a bit of time to figure out and debug. Hopefully this post will save you some time in .NET cloud computing with Windows Azure. I'l try to keep this post updated. Did I miss some common deployment problems that you've encountered and were able to work around?"}
{"url":"/post/types-of-messages-and-events-in-reactive-cqrs-world/","title":"Types of Messages and Events in Reactive CQRS World","text":"Types of Messages and Events in Reactive CQRS World In the previous article I've started pondering about CQRS from the perspective of Reactive Extensions and almost-infinitely scalable systems. Let's continue by formalizing types of messages and events available in this environment. These building blocks are affected by the Reactive world, Domain Driven Design with Event Sourcing and Partitioning for the scalability. NB: dear reader, if you are reading this article far in the future (i.e.: after September, 2010), keep in mind that views expressed here might already be outdated. In this case CQRS section will have something more late and solid to point you towards. Command Message (D for Do) has following features: It is a message that instructs specific entity (identified by it's identity) to perform certain actions. Messaging infrastructure is aware of this command's identity and uses it to route command to the specific partition. Command message is an entity itself - it has an identity. This allows the recipient to stay idempotent and filter out seemingly similar commands that were duplicated by the messaging infrastructure. This message contains contains one or more entity commands. All entity commands in the same message are directed to the entity. They are expected to fail or succeed together (essentially this is just a construct to avoid generating composite commands). Entity commands are named like DoSomething, they instruct the target entity to do something that might result in different outcomes or fail. In the DDD world command message determines unit of work, while specific entity commands map to the methods called upon the target aggregate root. Essentially command signature could be represented as: public interface IEntityCommand {} public sealed class CommandMessage { public readonly Guid EntityId; public readonly IDomainCommand[] Commands; public readonly long Version; // constructor } where IEntityCommand could be: public sealed class RenameProject : IEntityCommand { public readonly string Name; // constructor } Time tick is merely an event produced by the scheduling infrastructure within the partition. It does not have an identity, but is used to trigger processes, track timeouts and fuel temporal queries (i.e.: buffer, throttle, sample etc). Event is described by: Event is something that happened in the past. It is the history that can't be changed. This intent is expressed in its naming: SomethingChanged. Event is associated with the sender's identity and it's version. These two properties together uniquely identify each event and allow subscriber to handle duplicates possibly generated by the scalable infrastructure. Event does not have a recipient. It is merely published to the messaging infrastructure. The latter will be responsible for routing the event to it's subscribers (there could be zero or more). It will do so using event's identity and type (first allows routing event to a specific partition, second filters out only the events that the subscriber is interested in). Failure is the generic unexpected exception that is the result of an attempt to process command. It indicates that something could not happen in the past and is equivalent of Reactive OnNext(Exception) translated to the domain environment. Just like in the case of observations, failure is the last item in the sequence (it terminates the reality). Failure includes information about the exception and the command(s) that caused it. Since failure includes command information, it logically includes entity identifier and unique message identifier. One command could logically fail one or more times (i.e. when message bus tries to handle transient error). This means that in order to uniquely identify a failure (and ensure idempotent operation, where it is needed), each failure should have it's own identity. Failure"}
{"url":"/post/update-on-cqrs-tour-around-xussr-ukraine/","title":"Update on CQRS Tour Around xUSSR - Ukraine","text":"Update on CQRS Tour Around xUSSR - Ukraine Апдейт про поездку с Грегом по xUSSR. I apologize to all my readers that don't speak Russian. I promise to keep non-English posts to minimum :) С CQRS поездкой по России на эту осень пока не складывается по времени (отдельное спасибо особенностям визового режима для иностранцев). Поэтому сейчас планируем Украину на вторую половину октября (2 недели после 17 октября). Самая главная часть - презентации и курсы в Киеве, которые хотим сделать максимально полными (несколько дней). Плюс решили сделать их полностью бесплатными и открытыми (как в слове CQRSBeers). Любая помощь в организации мероприятий (включая спонсорскую), естественно, приветствуется. К примеру о стоимостях. 3-х дневный курс Грега в Европе стоит 1700 фунтов, а 5-дневный курс Уди Дахана - 3000 фунтов. Так что для россиян, которые заинтересованы в тематике, могу порекомендовать приехать на это время в Киев. Оно будет стоить того. Потихоньку формулируется более четкая программа о чем бы хотелось поговорить: у Грега - Event-centric системы, CQRS/DDD (от проектирования и организации команд разработчиков и вплоть до тестирования и интеграции; как серверные системы, так и occasionally connected clients). у Рината - Practical CQRS in Cloud (начиная с опыта и специфики использования CQRS на полную катушку с облаком и вплоть до организации эффективного процесса разработки сабжа и выживания стартапов в наших условиях). Что скажете?"}
{"url":"/post/updates-on-lokadcqrs-iddd-branch/","title":"Updates on Lokad.CQRS - IDDD Branch","text":"Updates on Lokad.CQRS - IDDD Branch I've just finished pushing some code to Lokad.CQRS (iddd) branch . Some of the changes are listed below. Last stable version of embedded event store . As you know, we've been working on one for our purposes. Lokad IDDD sample included beta version, which was polished and cleaned a bit since then by our production experience. That's what went to Lokad.CQRS This event store (and underlying infrastructure) are based on bashio bitcask model in a simplified way (full in-memory caching, immediate persistence commits, SHA1 checks and crash-tolerance). New event store is more maintenance-friendly for cases when you have hundreds of thousands of event streams (they are no longer stored in individual files, but are rather interleaved within the same transaction log). This store has two implementations out-of-the box: file system and azure blob storage (the latter uses efficient blob range posts to append to log files). IDDD sample also has SQL-based implementations. This event store is multi-thread friendly (and tested to be so). Simplified Lokad.CQRS core , which is used in new projects. I dropped a lot of things to simplify infrastructure and support future scenarios. Among these changes, message envelopes got really simplified and got proper SHA1 checks for their content. Startup projection rebuilder got a little bit smarter. This core is also a better fit for rapid DDD prototyping of persistence-ignorant systems. Improved performance (a little bit), getting rid of two-phased-commit (aggregate events are appended to the store and get published in batches by a separate process, which no longer needs to access router). This isn't the end of the work on this branch)since I have to get rid of complexity in \"SimpleTesting\" (reusing simpler one-file approach from IDDD sample) and drop a lot of complexity from SaaS sample (turning it rather into the project template). However, I wanted to bring these changes to the attention of people that were asking for latest changes in Lokad.CQRS, compatible with the concepts from IDDD book. So, here you go ."}
{"url":"/post/use-cqrs-when-constrained-on-developers-and-time/","title":"Use CQRS When Constrained on Developers and Time","text":"Use CQRS When Constrained on Developers and Time Udi Dahan wrote a funny article recently, talking about When to avoid CQRS . So, when should you avoid CQRS? The answer is most of the time.. I deeply respect Udi's immense experience. Yet, based on my limited experience, potential surface for applying CQRS Architectural patterns and practices is much bigger, than outlined is his article. In essence, synergies within CQRS work, whenever you need to: tackle complexity; distribute teams; scale out under massive loads. All this could happen, when you have roughly 1.2 people (that's less than one-and-a-half-developer) per project in a fast-paced development environment with limited time and resources. Of course, this kind of \"scaling\" is a somewhat extreme case (although, I believe, we should be able to do even better at Lokad ). Non-startup organizations don't need to fit these constraints. However wild my guess is: if CQRS approaches worked greatly for us in such situation (where N-tier, ORM, relational, DTC stuff and all their friends failed badly for scalability and complexity reasons), then they would work in less extreme situations. Another important thing is that the CQRS architecture and development approaches seem to work consistently under a diverse variety of conditions, such us: Simplifying and speeding up complex entangled legacy systems. Developing complex integration platforms reaching out dozens of unreliable systems around the globe. Rapidly implementing simple tactical solutions with teams distributed globally. In these very conditions classical approaches failed for us. The latter could be attributed to the fact that we (I mean \"I\" here) didn't have knowledge and experience required to master SOA, N-Tier, ESB and all the other things. This was complex stuff, potentially requiring years of learning, expensive courses and large teams. Yet, for some strange reason, the mental model of CQRS provided much friendlier and faster implementation route here (despite the fact that there is not a single book published on the subject, yet). Diverse solutions being delivered to production, share similar architecture, development principles, reduced complexity levels and teams. They just work. No dedicated training courses or expensive consultants are needed for us to handle various scalability challenges, because all of them already have clear solution paths. The fact that we develop for the elastic cloud environment (which is less stable and predictable than on-premises systems), does not help to save the situation and make it less boring. Having said all that, if you are new to the CQRS, you have two options to take: Sign up for Udi's course on distributed systems for mere 2500 EUR. Check out CQRS Starting point and referenced learning materials and articles. For those who are familiar with CQRS and my work, here are some good news. I've \"accidentally\" decoupled core functionality of Open-Source Lokad.CQRS project from Windows Azure. After the v2 release, it should theoretically be possible to run it under Linux for embedded and cloud solutions (version .NET 3.5, if using stand-alone TPL library)"}
{"url":"/post/use-software-design-to-aid-your-development-process/","title":"Use Software Design to Aid Your Development Process","text":"Use Software Design to Aid Your Development Process Each project is unique and requires different development methodology (a repeatable process for delivering software aka \"whom to blame if deployment breaks and how to avoid it\"). At Lokad we adopted a number of different methodologies for different types of projects. In each specific case the choice was an attempt to balance between: complexity of adding new features; cost of failure and number of people that can be allocated for testing; number of developers working together on the project; required speed of development; It was interesting to see different methodologies (basically just a collection of rules) emerge and evolve, as our understanding of the process grew and project specifics changed. On one extreme we have relatively rigorous process , where almost all changes are tested as they go through the scheduled regular releases to Windows Azure. Development teams have multiple developers (even though they can be moving between projects on an hourly basis) and some testing people. This happens for projects where a missed bug could seriously hurt the business. On the other extreme we have some projects with almost no testing and continuous deployment to Windows Azure , where each commit to the repository is immediately pushed to the production (web sites are updated and backend services are redeployed). This can be applied to supporting projects, where a bug would be a mere inconvenience. There are some projects at Lokad which are located somewhere between these extremes (e.g.: somewhat important). During their lifetime, projects can change their location on this scale as well (e.g.: prototype turning into a commercial product or commercial product being discontinued). In each case, software design and methodology serve as a powerful enabling factor , which reduces risks and costs at multiple levels. Here are some tech-specific examples : append-only storage (as in event sourcing) significantly reduces risk of loosing data in case of faulty deployment. There is still some risk of getting corrupt data in, but this problem is usually limited by the short time window. append-only storage does not mean that data grows indefinitely. It just means that during the normal operation data can only be appended and never erased. by applying basic techniques from object-oriented software programming, messaging and domain-driven design, one can build rather decoupled software that is relatively simple to reason about. Low coupling will lead: reduced risk of cascading failures; lower cost of development - developers don't step on each others toes while changing code or merging changes; less expensive teams - if certain portions of the code are rather simple and decoupled, then you don't need a brilliant and expensive development team to handle them; usage of domain-driven design and event sourcing can simplify data persistence, especially managing it between software upgrades. In the very extreme, software would have either event streams (transaction logs) schema of which rarely changes or persisted views (cached read models), which can be cleared upon the deployment. This can simplify deployment process to the point of reliable continuous delivery. No fear of irreversibly breaking your database with an SQL upgrade script. changes in tech requirements that would otherwise be relatively expensive are now taken for granted (e.g. moving software from on-premises to the cloud or back, scaling out client UI or backend processing). This allows software to evolve such turning points without noticeable spikes in complexity and development effort (which would otherwise require change of development methodologies). Having that said, good software design is merely one of the necessary factors required for successful delivery. Good team collaboration, strategic analysis and adequately disciplined development methodology are also good-to-have."}
{"url":"/post/using-redis-for-event-sourcing-and-much-more/","title":"Using Redis for Event Sourcing and much more","text":"Using Redis for Event Sourcing and much more Over the last week I've been thinking about high-scale production setups for event-centric architectures. Something that can handle retail networks in realtime while providing cost-effective solution to deal with business amnesia . Obviously there is Greg's event store (to be released tomorrow), however having multiple deployment options is even better. Here's a quick overview of implementing event store with Redis . Redis is an Erlang C key-value store with configurable reliability guarantees, master-slave replication and a diverse set of server-side storage primitives. ServiceStack developers use Redis extensively for caching. They have even developed ServiceStack.Redis for C# Using immediate persistence (fsync after each change) and eventual replication you can easily get thousands of commits per second on a simple machine. This is way less than specialized event store implementations, but could be good enough for a low-cost production deployment. Besides, you can speed things up by doing fsync after each second. See more benchmarks or check out series of articles on ES with Redis and scala . Event Storage Primitives We can use following primitives for event storage persistence: Hash - provides fast O(1) get/set retrieval operations for individual events List - can store associations of events to the individual streams (fast to add) Store individual events in hash structure (allows O(1)) operations: > HSET EventStore e1 Event1 Where: EventStore - name of the hash to use for storing events (might as well be one store per riak DB) e1 - sequentially incrementing commit id Event1 - event data You can get number of events in the store by > HLEN EventStore (integer) 8 In order to enumerate all events in a store, you simply ask Redis to return all hashes given their IDs, for example: > HMGET EventStore e1 e2 e3 e4 1) \"Event1\" 2) \"Event2\" 3) \"Event3\" 4) \"Event4\" Individual event streams are just lists which contain references to individual commit IDs. You can add event(s) to a stream by RPUSH . For instance, here we add events e2, e4, e7 to list customer-42 > RPUSH customer-42 e2 e4 e7 Version of an individual event stream is a length of corresponding list: > LLEN customer-42 (integer) 3 In order to get list of commits that are associated with a given list: > LRANGE customer-42 0 3 1) \"e2\" 2) \"e4\" 3) \"e7\" In order to achieve fast performance and transactional guarantees, we can run each commit operation as server-side LUA script, which will: Provide concurrent conflict detection Push event data to hash Associate event with a stream Publishing and replays Redis provides basic primitive for PUB/SUB. This means, that we can push event notification to zero or more subscribers immediately (in the same tx) or eventually: > PUBLISH EventStore e1 e2 This means that in order for the projection host (or any event listener) to have the latest events we: Get current version of event store: HLEN Enumerate all events from 0 to length by HMGET Subscribe to new events, if there were new events since we started replaying (or read the new batch otherwise): SUBSCRIBE Additional side effects First, since Redis is a key-value store, we can also persist within the same setup: Aggregate snapshots Projected views Second, capability for message queues can be handy for load-balancing work commands between multiple servers. Third, server-side capability for associating events with event streams (individual event stream is just a collection of pointers to event IDs) can be handy for event-sourced business processes."}
{"url":"/post/utility-classes-are-lesser-evils/","title":"Utility Classes are Lesser Evils","text":"Utility Classes are Lesser Evils Generally, code reusability (and Don't Repeat Yourself principle) are a big evil, if used without constraint. They can couple together really distant concepts, while increasing overall complexity of the code. Did you ever have an utility class, which was introduced for a single purpose, but ended doing up everything at once? That's why at Lokad we try to avoid utility classes. Especially if they have only a few uses spread across different parts of the project. Such code can be inlined instead. However, sometimes, utility method is relatively complex and big. Plus, it can be used in too many places inside a single project. In such case we agree to introduce utility class (which is internal for this project) and put it into well-defined location (folder and/or namespace) called \"LesserEvils\", while adding \"Evil\" suffix to the class names. Name somehow allows us to stay more conscious about abuse of common utility classes and potential coupling they can introduce into a project. Lesser of evils is still an evil."}
{"url":"/post/vienna-cqrs-notes/","title":"Vienna CQRS Notes","text":"Vienna CQRS Notes Vienna Pro.net conference was a lot of fun. I want to share my deepest respects for the organizers and community around for making this such an interesting event. I enjoyed both talking and having some good CQRS Beers afterwards. I was making 2x45min talk on Practical CQRS after 2x45min talks of Greg Young on CQRS/DDD. Coupled with some really nice latenight CQRS Beer discussions, this created really exciting and fruitful atmosphere at the event. Here are just some notes and lessons learned. During the Copenhagen conference we came out with a nice Decide-Report-Act (manufacturing plant) analogy for drawing parallels between real world processes and CQRS designes. It was accepted quite well for the explanation in Vienna. More than that, we figured out the extension of this logical model to explain sagas (even stateless ones). Nothing really new (just a reference to document-based sagas in Greg's talks), but I didn't realize this before. During slightly more geeky discussion there was a parallel drawn between biological neural networks and complex distributed systems (with sagas as well). It gives slighly different viewpoint on the problem. Creating presentations works nicely when you use KeyNote for styling and Visio for the diagrams. Here's what I made for the talk . I remain really impressed by my Mac Book Air. It has to go through one more test before being considered \"production worthy\" - full disk encryption. Specifications (behavioral tests for aggregates with event sourcing ) turn out to be a really nice linguistic tool for capturing behaviors in code in a non-fragile way. I can almost see how CRISP specifications can make domain development almost as boring as writing view projections. Sagas still remain more complicated topic (some joint calculus is needed). Political aspect of CQRS was quite interesting to people as well (politics here is just about leveraging cost-efficiency and flexibility of CQRS/Event-centric systems in organization in order to have more options for delivering project successfully or achieving other goals of the company). I wish I had time to actually show code and really down-to-earth aspects of running CQRS systems. Further improvements - plan talks better, don't loose eye contact with audience by checking up on the presentation plan, more slides and may be even photos from Bali :) Apparently this phrase of mine caused some funny looks from .NET developers: \" This project was designed to run on Windows Azure, was developed locally without Azure SDK and deployed to staging in Rackspace Cloud. \" If a get a spare bit of time, I'll try to push some work in Lokad.CQRS just to be able to continue the phrase some time later: \" oh, and the last deployment went to Linux \". Once again, Vienna was awesome (even though I didn't get to see the city in the daylight :) Thanks to everybody who helped to make it this way."}
{"url":"/post/was-your-software-career-like-this/","title":"Was Your Software Career Like This?","text":"Was Your Software Career Like This? A week ago I've published CQRS \"Research tree\" post. It had a PDF that was mapping together benefits of the methodologies that compose together CQRS-based development and Cloud computing (in my personal perspective) along with the directions that you can drive your software project through, once you get inside this territory. After getting feedback from various sources (thank you, folks!), I've decided to add a few more steps to this road-map. These are the things one might encounter before stalking into the CQRS territory. NB: Just a side note. I understand that CQRS is a buzz-word. Yet as a buzz-word it conveniently packs multiple time-proven concepts under a single mental model , that is easy to work with (and share the knowledge further). As such, it works for me. Below you will find image of how the draft looks like right now. It all starts with Personal Motivation , Unit Testing and Source Control . Download Hi-Res PNG The Roadmap is based on my personal experience coupled with the insights I've been getting from the development communities. This overview might be different from yours, since I didn't have any formal CS education (administrative economics instead). NB: Later on I'm planning to add detailed descriptions with the \"enables\" section, just like the first roadmap had. Does it resemble your software development career? Are there any important steps that I've missed?"}
{"url":"/post/what-is-your-project-like-titanic-or-a-street-dog/","title":"What is your project like - Titanic or a street dog?","text":"What is your project like - Titanic or a street dog? One of the things I've learned from the production experience was the extreme value of getting feedback for your project as soon as possible. This might save you wasting a lot of effort, time and resources later on (or even save the project). Actually getting feedback for the project should start even before the design and architecture were started. However talks about proving business value and verifying customer expectations are slightly outside of this blog, so we'll skip it (there are a lot of excellent materials on the subject, starting with the classical Project Management), assuming that this was done and project either solves some current problem or the one that is bound to show up shortly. It's impossible to build some perfect design that will last and stay unchanged through the life-time of your project (at least I'm not capable of doing so in the rapidly changing business context I'm working, which is one of the most enjoyable parts of the job). Life always has some unexpected discoveries, challenges and opportunities waiting for you down the road. If you miss challenge or fail to manage a risk - they will result in potential losses and wasted resources. Likewise, failure to leverage an opportunity results in some potential benefits that you failed to reap (which could count as a failure if your competition was not that sloppy). Ok, so the world is ever changing and unpredictable place. Some of it's areas are even more volatile and potentially rewarding than the others. Obviously this leads that any software project in the area will be the subject of rapidly changing requirements. Project vision, roadmaps and actionable items could change within months, weeks and even days (especially given peculiarities of human mind to miss some bits of information, have lags in communications and decision making). How can you deliver software in such environment and be successful? Here are just some ideas. Continuously learn about the environment - technological and business conditions constantly change, evolving from potential probabilities to something that has already happened. There is not much literature (I haven't seen any so far and would love the pointer) about embedding knowledge about probable threats and potential opportunities into your project. However just keeping the big picture of the environment (and potential future) in mind while making decisions - helps a lot. Besides it brings this wonderful feeling, when a plan comes together. Continuously learn about your project - as project environment changes along with the usage patterns, previously known and stable components might exhibit strange and unpredictable behaviors. This becomes even worse in the modern world of OOP, elaborated versions of DLL hell and ever-present issues of concurrency and coordination. Things that might help: real-time indicators, profiling, warning notifications, data-mining etc. If you push this far enough, you might be able to fix unpredictable issues before anybody really notices them. Have extremely low friction in your systems - fixing bug or adding a quick feature, updating a stable branch and deploying - should ideally take less than an 15-40 minutes. If it takes more - friction will be created, reducing actual rate of production deployments. This will result in ever growing disparity between real world and the ideal world that the current version of the code targets. From my experience, high-friction projects tend to be like Titanic - they have ideal architecture, passing unit tests, rich lists of completed features and resolved issues. Releases happen once a few months (or even less frequently) and are feared because there will certainly be a large number of critical issue reports afterwards. There is another type of projects that I've had recently pleasure of encountering. Such projects rarely have a pure and straightforward design and long releases. They start as a hack for solving some immediate business need and go into the production right away. There is no real need for tests - codebase never gets really stale and is usually covered by the best unit tests you could ever dream about - real-world users. Make everything as automated and smart as possible (every bit of reduced friction Be smart with the automation and don't try to handle 100% of the cases. Automate 80% that require 20% of the effort. Leave remaining - to the real people. Don't you hate this? Real-world project feedback does not necessarily include only some sort of pre-alpha deployments with real users (although such experience is extremely valuable as well). You can also test and verify: technological ideas and concepts - by building prototypes and spikes early to justify decisions and verify them; architectural logic - by building system-wide prototype that spans multiple components/layers and has almost all real"}
{"url":"/post/when-not-to-use-cqrs/","title":"When NOT to use CQRS?","text":"When NOT to use CQRS? CQRS is an approach to structure, design and deliver a small and focused part of a system. It is a set of patterns and techniques that help to deal with the entire vertical stack of such focused part in a relatively simple way, starting from persistence and up to the continuous UIs. While doing that we can also opt-in for some practical benefits like cloud-capabilities, scaling of development effort and low friction. I wrote a Lokad.CQRS framework in the past. If you are interested in the painful details, there is a Lokad.CQRS Retrospective . However, if you don't have your system decoupled into small and focused parts, then don't rush into CQRS right away. Instead of helping you, it is likely to increase complexity without any noticeable benefits. My recommendation is to start by finding clear boundaries along which a problem can be broken into small and manageable pieces. In the old times this divide et impera strategy was practically applied with some success by people like like Caesar and Napoleon. In my opinion, the best approach to find these boundaries and manage them is called Domain Driven Design (that's the blue book by Eric Evans). CQRS patterns evolved in the world of DDD practitioners and produce the best results when applied to problem space that is explored and explained with DDD methodology. There is simply too much synergy. If we continue with military analogy: Strategy explains how to wage a war. DDD is one of the viable strategies in software war and it is particularly good in dividing and conquering complex domains. Tactics deals with winning a battle (there could be a lot of these in a war). CQRS is a flexible software tactics that is particularly fit for the DDD strategy. Obviously, you can try to take CQRS patterns and apply mechanically to any system . Sometimes this might work and sometimes this might be as awkward as using SQL Stored Procedures for business logic. Either way you have a better chance of success if this system is decoupled into separate focused pieces, that can be handled individually on case-by-case basis. So if you have a large monolithic application that can't be clearly divided into small focused slices, then trying to apply CQRS would have lower probability of getting benefits (and higher chance of actually hurting badly). So. Don't use CQRS in your system, if you have more important things you can do first . Like breaking it apart into small and focused slices that can be dealt with separately. Likewise, if you are using CQRS to determine architecture of your entire system (and if the system is complex enough), then you are heading into some problems. CQRS is a tactical implementation detail, don't confuse it with strategy."}
{"url":"/post/why-even-bother-trying-linux/","title":"Why Even Bother Trying Linux","text":"Why Even Bother Trying Linux I've got really great question in comments from Alex , when he replied to Linux Setup Tweaks of .NET Developer with SSD . Before I knew, I started typing a reply that actually deserved a short post. Thank you very much for it! So here's the question: I cannot really understand what you're trying to achieve here. What is the point of installing Linux? Just something new play with? Something to remind you of days when you spent a lot of time making a computer work, as oppose to use it for work? :) Because all I can see here is a lot of troubles to go through - install Linux, figure out how all the little tasks are done there. You still use Windows for your main tasks - development. You still use VS and .NET. But now you need to create a VM for that. Why not just have your laptop running Win7 with everything setup? Alex, it is just about forcing my own mind to learn new concepts and ideas in their native environments. Microsoft stack is powerful, and I will be sticking to it for the years to come. However, at the same time, this stack might have it's own weaknesses that the other environments do not have (and vice versa). More than that, a lot of ideas, we benefit from on everyday basis in Win7 and .NET, originally come from the other stacks, where they originated. For example, distributed version control systems, IoC containers, unit testing, non-blocking web servers etc. There could also be the other ideas that didn't make this transition, yet. By being exposed to non-Win on everyday basis I'm forced to learn things that I previously could have neglected due to my ignorance. For example, I'm forced to use terminal, stick to simplicity and rethink my everyday workflows. There are new opportunities as well. This approach also simplifies learning some new things by immensely reducing friction of trying them out. For example Marc Gravell currently mentions a lot using Redis with ProtoBuf-net for key value store in StackExchange (which we all know for really efficient StackOverflow). Should I want to give a try to this combination, installing redis is just: yum install redis Not only this affects my take on things, it also helps to think further in terms of CQRS and cloud architectures . Being Technology Leader at Lokad, that's the part of my job, to try to think and learn beyond the current \"research level\" of the company (which is quite bleeding-edge with all this cloud computing stuff in production). Of course, only a few percent of what's learned will get into the production in our projects, but it will be good stuff to further save resources, reduce risks or increase development efficiency: ideas, tricks, architecture simplifications etc. Please, don't get me wrong. Mixing operating systems is not a simple thing and I would not advise it to a developer who just wants to be efficient with his everyday tasks. I still have crashes of VS from time to time. However, this is just like with building cloud and distributed systems - you are forced to build reliable and efficient things out of potentially failing interactions and systems. Surprising enough, this Linux + Win7 combination, does not slow me down. More than that, aside from providing new ideas, it actually speeds up my everyday work. Partially this could be attributed to: VM flexibility; Native terminal; Linux workspaces; plethora of packages available; sheer speed of Linux, that does not slow down even when there are a few VS2010 instances with ReSharper. And I'm really tempted to give a try some day to CQRS system built in the cloud for the cost of peanuts with RabbitMQ + Redis + RoR + .NET Server on Mono. So for me running Linux as a primary OS for Win7 is just about forcing myself to break down the fence of ignorance I might have built during my previous years of focusing on one stack. It's learning, as if I still were in the university. PS: BTW, if my laptop faces a horrible death, I'll be able to start working on a new one in less than an hour. That's the time needed to install Linux, grab my primary dev VM from backup and fire it up. Or, it could be a MacOS on a new MacBookPro with it's 7+ hours battery life span :)"}
{"url":"/post/why-is-cloud-computing-important-for-us/","title":"Why is Cloud Computing important for us?","text":"Why is Cloud Computing important for us? This post is a reply to an extensive comment by Sean Kearon to the previous Cloud Computing post . I agree that the availability of decent distributed computing resources is not crucial to success. Yet, having opportunity of leveraging cloud computing could save a few resources for the customer. And sometimes this is a success factor. For example, one of my past projects had to be developed in a situation where customer’s resources were extremely constrained. That was actually one of the reasons why I was interested in that project back then – scarce resources can give you plenty of ideas for the highly-efficient project development and delivery. It was a simple implementation of xLim approach with modular desktop Smart Client and a web dashboard. Production version was living fine on a plain web hosting account. We didn’t even need to scale out to multiple hosting accounts (although this option was embedded in the architectural design as a cheap way of expanding capacity). In the end everyday costs for the customer were quite low. Yet, eventually there was a need for a customer to go to the dedicated server for two reasons: documents data store grew beyond 4Gbs of size (we hit a hosting limitation) and an automation engine had to be introduced. This has obviously raised monthly bill for the customer. Windows Azure (if it existed in a mature form at the moment) could’ve saved from these unnecessary expenses simply by providing a scalable storage and a worker role on the pay-as-you-go basis. This kind of expenses optimization is obviously possible in more complex and costly projects (at a larger scale). Although there always will be specific situations when it is cheaper and more efficient to invest into owned and hosted computing resources, they do not make it into a high percentage. One of the points of the cloud computing as a business model is optimization of the expenses. This happens from two sides of the story: Consumers of the cloud computing services don’t have to invest into the resources up-front (where resources could be from a dedicated server to a data center) in order to have the capacities and functionality meeting their demands. Neither do they have to over-invest in order to have the capacities meeting their peak demands (i.e.: having these extra few servers in a rack just to handle holiday sales spikes). Consumers are also saved from the maintenance-related expenses and gain the flexibility of getting rid of the resources, should such a business need arise. Getting rid of a data center is a bit harder. Providers of the cloud computing (any large hosting provider with his data own datacenter is a possible candidate for that) could utilize their existing computing resources more efficiently by shifting them into the cloud and linking with flexible pricing strategies. For example, say, we have a lot of idle CPU cycles scattered around the data center on weekend nights in USA. Would not it be more efficient to sell them for the computing purposes (at a flexible rate) instead of just letting the resource sit around? This ability to optimize usage of resources at various levels of the picture is one of the primary factors why the cloud computing hype is more than a temporary buzzword ( energy bills represent the second factor). Basically one could say that the cloud computing is a development of the virtualization story that could work at a higher level of detalization. Utility computing , that allows (or would allow in near future) efficient distributed computing on-demand for CPU-intensive or data-intensive operations, is just one of the extreme examples of the opportunities provided by the idea of cloud computing. It has its benefits but certainly 90% of the companies don’t even need it at the moment. High entry barrier in front of this technology (especially for .NET scenarios) further lowers its usability. Hosted development environments could obviously also be considered as some form of cloud computing that helps to optimize developers' expenses. This market niche is evolving quite fast. For a $7-$15 per month one could get the following pre-configured setup: Issue Tracker Version Control System Wiki and documentation management For $60+ per month one could get a brand new Windows Server 2008 virtual machine that could host all these above and an integration server. Comments, thoughts? If you are interested about development of cloud computing solutions under the .NET stack - check out xLim 4 . Related Posts: Microsoft is Reinventing CQRS for Windows Azure, but without DDD DDDD, CQRS and Other Enterprise Development Buzz-words"}
{"url":"/post/windows-azure-storage-can-be-confusing/","title":"Windows Azure Storage Can Be Confusing","text":"Windows Azure Storage Can Be Confusing I'm currently working on some blob storage abstractions needed for the Lokad.CQRS project . This involves writing some unit tests, which happen to be producing really strange results , while using conditional headers for the BLOB operations . Conditional headers are part of HTTP RFC: if-match if-modified-since if-none-match if-unmodified-since These headers are really important for implementing efficient storage operations (i.e.: caching large blobs locally) and performing reliable atomic updates (when write operation checks, if record was updated since the read). Azure Blob Storage Rest API supports conditional headers. .NET Storage Client supports them as well in form of BlobRequestOptions that could be passed to methods. That's the theory. In practice things get really confusing and tend to waste your day. Let's examine a single method OpenRead , which opens a stream for reading blob's contents. Look at this snippet: var options = new BlobRequestOptions() { AccessCondition = AccessCondition.IfMatch(cachedTag) }; using (var stream = _blob.OpenRead(options)) { read(stream); } What would you expect the outcome to be? Documentation does not say anything special about the behavior of the BlobRequestOptions passed to the OpenRead. Here's how it works on my machine: If item is stored properly in Azure Blob, then: IfUnmodifiedSince results in exception, which makes sense. IfModifiedSince is ignored (which might be step away from what is defined in RFC ) if blob (or the container) does not exist, then: IfNoneMatch with non-existent ETag results in 404 (Not Found) IfMatch with non-existent ETag results in 412 (precondition failed), which is a step away from the RFC . Now, since .NET documentation does not help us much, we could do some debugging and figure out the actual REST operations being performed underneath. This leads us to understanding that OpenRead , among the many other things, calls Get Block List method . Documentation says: This operation also supports the use of conditional headers to read the blob only if a specified condition is met. For more information, see Specifying Conditional Headers for Blob Service Operations. However, if we look at the Operations Supporting Conditional Headers , then GetBlockList operation is not even listed there. So we've got a few potential problems here: something could be completely wrong with my machine, producing constantly misleading results; REST API documentation for Windows Azure Blob Storage might be a bit outdated and confusing; Azure Dev Storage might produce really weird results depending on the type of the header passed; .NET documentation for the StorageClient does not say a word about how methods are in fact supposed to work. And that's just a single method; there are more. I've started creating questions on MSDN forum, but quickly gave up, since the next step should've been debugging into the server-side API implementation)). Update: it gets even more fun. Here's how a simple unit test suite for a single method (wrapper around blob reading) looks on the development fabric: Now if we switch the credentials to use real Windows Azure Fabric: As you can see, Windows Azure Dev Storage and Production Storage have behavior that differs . This should be accounted for, while developing and deploying applications (ensuring that the proper retry policies and delays are applied to give production storage some time for processing the operations like recreating container with the same name). Hopefully Microsoft will clear up the situation. Meanwhile, it's recommended to make sure to debug and double check every single method . Or, as L.M.Bujold has said: Check your assumptions. In fact, check your assumptions at the door."}
{"url":"/post/you-can-build-a-house-with-big-lego-blocks-can-you-add-a-nic/","title":"You can build a house with big LEGO blocks. Can you add a nice porch?","text":"You can build a house with big LEGO blocks. Can you add a nice porch? Company buys some business framework that seems to match their needs in the short-term perspective. It is easy and fun to use it initially - you just need to work with some big building blocks and the framework will do the rest for you. As the time goes, company gets to met new requirements that steer to them from the probable future that they've never considered (this always happens). Every encounter will be like a gamble. You've got 80% that you can deliver solution that meets this requirement using just the provided building blocks. And there is 20% chance that you will just will not have LEGOs for a trivial task of simply adding a nice porch to the house (and if you keep on gambling, you will always lose, unless you control the game). Company could start asking the LEGO manufacturer for the \"porch building block\" accidentally spiking the community discussion about it. It will turn out that a lot of other players also want to add the porch to their house, but with different decorations and functionality. So they will gather and try to ask for the \"Flexible porch building block\". Surprisingly enough this discussion would lead to the definition that this building block should be able to transform into different shapes to accommodate for the houses that were already built: starting from the office center in Washington and up to the small cottage down in Siberia. And this all-transforming block has to be big on the outside (otherwise it will not plug into the existing building blocks with large connection joints) and have all the required functionality on the side. Now, let's have a look from the toy manufacturing company's point of view - how much design and effort will it take to deliver this new building block that fits with the bigger ones but is required to take many different shapes. Or they can start working on delivering and supporting ten different porch blocks (one for mountain house in Tibet, another for the flower shop in Sydney etc). And here comes another question how many other house elements can you name? These could need customization, too. So the logical solution is to give everybody smaller building blocks to build porch as they see fit (that's what happens to kid's LEGOs as they grow up). However there is a side effect - the old houses are not really compatible with the new construction sets, since they have the different joint sizes. So you would either need to throw them out or buy special adapter blocks. What are the obvious solutions in this situation, if you know that your kid is going to grow really fast up and he loves LEGO? Either get ready for the ongoing expenses or get him a set that is composed of preassembled building blocks that look like the big ones on the outside. As the time passes, he will learn that he can actually take any of these wall pieces apart and turn it into the porch. Well, he actually might invent some really cool design and share the idea and/or his blocks with his friends."}
{"url":"/pretty-ui-goes-to-trash/","title":"Pretty UI goes to trash","text":"Pretty UI goes to trash I'm just figuring the whole problem-driven design thing out. Maybe I'm going in the wrong direction, but discoveries on the road are quite interesting. So we are tackling a product for the customer (data science department in an international logistics). I've been with that customer for more than 3 years now, they are more like a dear friend now. Our normal line of work is around a mix of consulting and software engineering. We do workshops, get requirements, build a solution, deploy and maintain it. This time things are different. The customer asked us to treat the project as a product. Think about the long-term impact past-the deployment, think about solving the real problems and improving the data science processes throughout the organisation. I wanted to treat it as a product. So we accepted the mission and started doing customer interviews, talking to the data scientists, operations, data ops, heads. We didn't pitch any solution in the interviews simply because there was a mental block and no solution to propose. Just follow the rules outlined in Mom Test : listen, focus on the past experiences and dig in . There were a few surprises. I have worked with some of these people for more than 3 years, but never had time to sit down and actually listen. There were nuances to the known problems. After soaking up enough problems a new solution has surfaced. The one with fewer risks, faster iterations, and a slightly higher chance of success. The presentation of our findings was interesting already: \"Remember the solution that we originally proposed and you liked? Here is why and how it was going to fail. Let's do something completely different instead!\". We had approval from the customer. Today I'm presenting the first tangible artifacts. It is a tiny mess of a few Python scripts, a Flask web server, a potential UX prototype in Figma, and a Roadmap. It is the most minimum valuable product that tries to attack the core problem, while de-risking major integration points . No CSS styling, plain HTML . Only a single UI screen (affordances on that screen are thought-through, though). There are other UI screens possible and potentially important in that project, but I didn't even think about them, so I will not be able to present anything outside of the one single screen. I'm extremely proud of that! The reason is - we focus on solving problems from the long-term perspective. Most of the important Data Science and Machine Learning problems for that customer could be handled by one giant HTML table (sourced via a number of rules from different source code repositories). This is a theory we need to test in the real world as soon as possible. Everything else is irrelevant. If we try to prettify things - this will make the first deliverables nicer, but will not add anything substantial to the list of the problems solved. It will just delay the test results. This is a waste. If we ponder about the secondary screens - that would be a premature optimization and a waste of time. We don't know if the primary screen works well enough. We have to validate it and the core product idea as soon as possible. Everything else is a waste. It could even lead us to solving the wrong problem . This is counter-intuitive to what I've been doing before but feels very liberating in a sense. Focus on delivering long-term value to the customer. Ignore everything else for now. I came up with a heuristic for starting new products. It is counter-intuitive. If Web UI in first iterations has custom CSS styles, fonts, and UI elements, then I didn't have my priorities straight. I wasn't rigorous enough in identifying the root problems, coming up with a solution, and the leanest possible way to validate it. Instead, I probably got distracted and wasted time on the UI. Pretty UI in the first iteration - then to /dev/null the project goes. Pretty UI means that my discipline has slipped and product validation wasn't the focus. Bigger mistakes could be lurking beneath."}
{"url":"/privacy-policy/","title":"Imprint / Privacy Policy","text":"Imprint / Privacy Policy Imprint / Copyright / Privacy Policy Imprint / Legal Notice Company Name : Rinat Abdullin , e . U . Address : 1220 Vienna , Gisela - Legath - Gasse 5 / 1 , Austria Email : biz @abdullin . com Member of the Austrian Chamber of Commerce ( WKO ) This legal notice also applies to my social media profiles. Copyright Notice The content published on https://abdullin.com and its subdomains, including but not limited to articles, graphics, and other materials, is the intellectual property of Rinat Abdullin (further in the document: the Author) and is protected under Austrian copyright law. Users are permitted to repost or share content from this website on social networks or other platforms, provided that the following conditions are met: The content must be clearly attributed to its Author with a direct link back to the original source on the website or the Author’s social media profile The reposted or shared content must retain its original form and must not be altered or used in a way that suggests endorsement or affiliation with any product, service, or entity without explicit written consent from the Author. Reposting or sharing should be done in a manner that respects the integrity of the Author's intellectual property rights and does not infringe upon those rights. The reposted or shared content is originally from a publicly available section, and is neither a part of a password-protected area nor part of a purchased course. Any use of the content from www.abdullin.com beyond the scope of these conditions requires prior written permission from the Author. Unauthorized use, reproduction, or modification of any content on this website is strictly prohibited and may be subject to legal action. This copyright notice also applies to social media or broadcasting platform profiles of the Author. Privacy Policy 1. Introduction This Privacy Policy is applicable to the services offered by https://abdullin.com and its subdomains. A strong commitment to privacy protection for clients and website visitors is maintained. This document outlines the practices concerning the handling of data in relation to the use of the website, including the collection of email addresses for account creation, access to certain restricted areas, and purchase history. 2. Data Collection This website allows visitors to browse publicly accessible pages without providing personal information. However, to access certain restricted areas or services, users must create an account, which requires the collection and temporary storage of email addresses. Additionally, server access logs are recorded for technical and security purposes, but they do not contain personally identifiable information (such as IP addresses). Email Address : If you choose to create an account to access restricted areas of the website, your email address will be collected to provide access to these services. This email will also be used to manage your account and communicate important service-related information. Purchase History : In the event of purchases made through the website, a record of your purchase history will be stored under your account to allow you to review past transactions and for business record-keeping purposes. 3. Use of Data The email address and purchase history collected are used solely for the following purposes: Account Creation and Authentication : To allow users to securely access restricted areas of the website that require authentication. Purchase Records : Purchase history is stored to provide users with access to their transaction history and for internal record-keeping, ensuring compliance with business and legal obligations. Account Management : For maintaining user accounts, including storing timestamps related to account creation for audit and security purposes. Service-related Communication : Emails may be used to communicate important information related to the account, such as security updates, purchase confirmations, changes in terms of service, or privacy policy updates. No personal data is used for marketing, profiling, or analytics unless explicitly consented to. 4. Data Retention The email address provided for account creation and access to restricted areas, as well as any purchase history , is retained only as long as necessary to manage user accounts and provide the service. If you choose to delete your account, your email address, purchase history, and any related information will be securely erased from our systems unless retention is required for legal or regulatory compliance. To request deletion of your personal data, please contact the author of the website. Server access logs, which lack personal identifiers, are retained for a limited period to ensure the website's security and optimal performance and are then disposed of securely. 5. Data Protection Rights Under GDPR In accordance with GDPR, you have rights regarding your personal data, including access, correction, deletion, and objection to processing. To exercise these rights or make inquiries, please contact rinat@abdullin.com. 6. Third-Party Links The website may contain links to other websites not operated by this business. Visitors are advised to review the Privacy Policy of every site they visit. No responsibility is assumed for the content, privacy policies, or practices of any third-party sites or services. 7. Changes to This Privacy Policy This Privacy Policy may be updated periodically to reflect any changes in data handling practices, including those related to the use of email addresses for account creation, access to restricted areas, and purchase history. Any changes will be communicated by updating the Privacy Policy on this page. Regular review of this Privacy Policy is recommended. 8. Contact For questions, requests, or clarifications about this Privacy Policy, or to request the deletion of your personal data, please reach out to biz -bot @ abdullin . com."}
{"url":"/problem-driven-design/","title":"Problem-driven design","text":"Problem-driven design I love software engineering: designing prototypes, writing code, building beautiful solutions. It is a way of thinking and expressing yourself. I spent more than a decade in this area, only to realize that I've been approaching that from a completely wrong angle. There is nothing wrong with building a beautiful software solution. However, the beauty doesn't necessarily have anything to do with the usefulness and practicality of the software. How well does your software solve the problems? How to doom a project? Let's say that I want to build something useful for others (for money, reputation, or karma). I spend time to make it perfect. Code should be clean, CI/CD pipelines should be convenient, APIs - understandable. Plus, I may want to exercise a design pattern or two that I've learned. There is a big reveal of my project after weeks of hard work. Then, the world suddenly gives a cold shoulder. Friends don't use it, and customers don't fully comprehend it. Without positive feedback, there isn't much enthusiasm to continue developing. The project gets abandoned. I failed to make a difference. Perhaps, the world wasn't ready for my solution? Let's not dig into the details, but try something new. What about a link aggregator or a new highly scalable event sourcing library? Ooh, great idea! I'll make it open source and user-friendly. Maybe a large corporation will start using it. Then people will ask for support, and I could offer a commercial offer. There are so many exciting options that I need to start writing them down! I can already tell that this second project is also likely to be doomed . It is already in love with its failure - the solution. Rooting the project in problems Outstanding software engineering doesn't necessarily lead to a successful software product or project. The project will be successful when it is practical and rooted in real problems. I ignored the problems. The solutions were well-engineered but eventually withered and died. Focus on the real problems, and the solution will have a greater chance of survival and growth. What I should've done: study the real world, listen to people's problems, and dig underneath them. Then, find the ways to validate my solution before writing too many lines of code. That, I think, was the main issue with my projects in the past. I focused too much on engineering, while I should've focused on identifying problems and iteratively shaping the solutions. Coding should start after the solution design. Design should start after talking to people and identifying the problems. All three should iterate. There is an immense field of study around \"Product Design\" and \"Customer Development\". By ignoring it for many years, I've learned about the cost of building solutions that aren't rooted in people's problems. That is a good thing. Now there is a lot to catch up :) Update There is a follow-up article: Problem-driven vs Domain-driven"}
{"url":"/problem-vs-domain-driven/","title":"Problem-driven vs Domain-driven","text":"Problem-driven vs Domain-driven Marco Heimeshoff responded to the blog post about Problem-Driven Design . He triggered an interesting discussion in LinkedIn and asked in the end: ... where do you see the difference in Problem Driven Development and DDD? Here is my take. Domain-driven design Domain-Driven Design (DDD) works with established domains. It is driven by the domain, as the name implies. We work with stakeholders and domain experts to reach shared understanding, to clarify a map, manage complexity and deliver solutions. In the words of Marco: The major focus of the Domain-Driven Design community has been on experimenting, developing and adopting methods to understand the purpose-space of the customer and all the stakeholders needs. The map may be uncertain, but the business already exists, and it is a viable one. The exploration happens within these boundaries and is directed inwards. From this perspective, the problem space is defined. DDD is well-equipped with the tools to tackle complexity and deliver business value in such environment. DDD helps to build the best solution possible . Problem-driven design Situation can also be completely different. Tom Janssens from the Belgian DDD community expressed this well: In DDD we tend to assume that the problems the customer share are a given. When you are running a business this turns out to be not true... Imagine that you need to write a new book, build a sustainable product or assist a company in exploring new business opportunities. The domain is uncertain. There is a myriad of options to choose from. Some would be viable and some would not. It is chaos, full of possibilities. The purpose of problem-driven design is to venture out into this completely new \"there be dragons\" territory and establish a new viable domain. \"Viability\" could mean anything: support a family, build a reputation, do social good, create a big business. These are the constraints. Given these constraints, we talk with potential customers, discover their problems, and dig into these discoveries. We identify and pick the problems that could be solved in a viable way . Then we shape a solution and find the shortest path to validate it. The earlier we find flaws in our approach - the better. This means, we save hours and dollars from the wrath of reality and get to work on a new problem-solution. At early stages we could employ methods from the lean startups , iterative development, data-driven product design, and the art of conducting customer interviews. In short: Coding should start after the solution design. Design should start after talking to people and identifying the problems. All three should iterate. If all that works out - we may have just carved a piece of reality into a brand new domain. Domain-driven design could assist to grow it further. Virtual Sales Lab by Tom Janssens would be a very good example here. The product is about building sales and marketing tools that leverage 3D technologies. The domain itself didn't really exist a decade ago. It is growing now. In the next decade, 3D sales might evolve into a well-known and established domain. Why such optimism? Aren't current 3D solutions a bit ugly? Younger generations prefer to perceive the world through 3D. Huge parties in Fortnight, kids of all ages playing Minecraft, tech advancements are tell-tale signs within the problem space that show the inevitability. The solutions aren't defined, but deep currents within the problem space will lead to something. Update There is a follow-up article: Solving the wrong problem ."}
{"url":"/robotics/","title":"🦾 Robotics","text":"🦾 Robotics At some point I got interested in reimplementing low-cost high-precision robotic arm from Haddington Dynamics. That product was great because: Open-source everything (aside from the FPGA code blob) Much cheaper than existing commercial robot hands, but comes with a unique precision and repeatability. You could buy most of the components yourself. The company was acquired by Ocado Group in 2020, putting their open-source efforts on hold. Still, the remaining digital footprint is big enough to try reproducing this project. Spoiler alert! I didn't get to build the robot hand. With the help from my friends, I got quite far on the learning path, though 😃"}
{"url":"/ruby/rails/","title":"How did learning Ruby on Rails go?","text":"How did learning Ruby on Rails go? 7 months ago I've started an [[L31 ruby-why-learn-it|experiment in learning Ruby and Ruby-on-Rails]]. It was a challenging and interesting experience. The challenge was because I wanted to learn the ruby \"the right way\". This is completely subjective, but I decided to set up a remote Ruby dev environment on a dedicated server. The idea was to get as far as possible from my comfortable setup: a rich visual IDE on a desktop with intelligent completion and smart refactorings. An alternative approach was to run vim on a remote server inside tmux (the screen multiplexer) while using mosh for a low-latency connection from any device. A passable knowledge of tmux and vim, along with a more stable dotfiles collection were the first ongoing benefit of running the experiment. While going through the \"Eloquent Ruby\" by Russ Olsen, I've discovered another side benefit: the pleasure of using iPad to actively read books by annotating them with a stylus. I had a setup that alternated between iPad in screen mode with a mechanical keyboard and iPad in hand. It was more convenient and distraction-free than using a laptop for the studies. The biggest disillusionment of the experiment - Ruby didn't \"click\" as that beautiful dynamic language that lends itself to a prose. It is nice and Lispy, but I felt no magic between us. Ruby-on-Rails (RoR) wasn't as magical as I've pictured it, as well. Perhaps, I was just unlucky to start learning it at version 6 (not as lean as the earlier versions). Using a freshly introduced ActionText module - didn't exactly help to fall into \"the pit of success\", either. Yet, despite the rough start, Rails journey was a success. Somehow I was finally able to build a personal information system that I always needed (in essence, just a glorified private blog with a few outlining features from the org-mode). I attribute that success to the fact that Rails are optimised for the developer productivity and fast iterations . I was able to apply dozens of tiny tweaks so quickly that I stayed in the creative flow all along. To be more specific, here are a few things that helped to move forward: webpack is configured and bundled in; scaffolding generates the project structure from the start; Rails is so old that there are dozens of StackOverflow answers for pretty much any question; there are various flavours of rails generators can setup MVC components, individual models and tests; runtime dynamically reloads all changes from the files, there is no need to restart; rails console - is an interactive REPL that lets you try snippets of code, explore the database and modify it; it also picks up any changes to the code; ActiveRecord module is extremely powerful, albeit limited to CRUD; ActionText module is a great rich-text web editor that just works, supports attachments and mentions; Rails has a lot of things already solved and packed. This makes it more bloated and opinionated, but it saves time and preserves focus during the development. There were a few things that I had to make peace with: Relative performance is slow (compared to .NET or golang). Rails 6 is a complex thing, one step in the wrong direction and you are debugging a webpack config or looking into the sources. With a CRUD system like ActiveRecord you have to solve puzzles in order to evolve your schema (unlike an event-driven solution). In Ruby nothing is what it seems. Modules could augment and change classes on-the-fly. Methods could be created on demand. The development experience is very different from a classical strongly typed language where the object is mostly frozen at the compile-time. It worked out nicely in the end. I've dipped by toes in Ruby on Rails. It helped me to build a software that I've been using daily for the last months. Resulting codebase is tiny, compared to what I would get in go, Python or .NET. I don't see any use of Ruby or Rails for the projects at work. There isn't enough synergy to justify it. However, I'll be likely using it on a weekly basis to tweak and evolve my journaling solution. It is so easy."}
{"url":"/ruby/why/","title":"Why Learn Ruby?","text":"Why Learn Ruby? Yesterday I mentioned to Aigiz that I was taking a slow dive into Ruby and Ruby-on-Rails on my own time. \"Do you think it is worth it?\" - he asked. That is a very good question, warranting a proper response. So here we go. Short answer would've been \" Fernweh \" To provide context to the question: Aigiz is my colleague at Softwarepark. However, we had been working together long before that - probably for 15 years or so, across different projects. We've been through the beginning of ASP.NET at idocuments. We worked with early Azure, Lokad.CQRS and big data in retail at Lokad. We've scaled together Lokad.CQRS at SkuVault. For the last two years we were knee deep in Python, golang and data science ecosystem at Softwarepark. At this level of shared experience, we have a similar perception of how things work and how peculiar software decisions work out in development and hurt in production. Among the other things, together we figured out applicability rules for general-purpose languages that work for us: C# and .NET Core - powerful, flexible and verbose. It can scale to large codebases, but requires smart tooling and IDEs to maintain. Good fit for good old boring business applications. Compared to other languages, it requires so much ceremony to start writing things, that we rarely touch it these days. golang - performant language with outstanding concurrency primitives. It is simple to start with, but becomes verbose if you try to capture a complex business domain in it. Golang worked great for us in the infrastructure, middleware and small command-line utilities. JavaScript (with dialects) - reigns supreme in the browser. Outside of the browser, there are better options. Python - perfect language for prototyping with a rich ecosystem for data science and pretty much everything else. It reads and writes like poetry when applied to scripting and configuration. Python has a performance of interpreted language. Its parallelism is limited by the Global Interpreter Lock. We love Python. I'm leaving out of this page: lisp dialects (Clojure, Scheme, CL), beautiful Erlang, Rust, F# and Haskell. They are good on their own, but we've never worked in the niches where these languages were a compelling choice. Given all that, why bother with a niche language like Ruby ? It isn't going to make things run faster. Being an interpreted language, it performs on par with Python. It has a similar global interpreter lock (GVL), so parallelism isn't feasible within a single process (not until guilds are released, at least). It also appears that market demand for Python is at least 5 times greater than for Ruby. The answer is about searching for comfort, happiness and a new experience. I want to find a language that is even more fun to develop applications in. Application can mean anything these days. In order to simplify the problem, let us define the goal: deliver web applications efficiently . Efficiently here means saying no to unfit technologies, no to waste of time or slow iterations. Companies with teams and budgets can afford inefficiencies in software design and development. Personal projects are far more sensitive - a few dumb decisions and the project grinds down to a halt because of all the technical debt. Let's break this down to a few specific constraints: Applications need to have a web UI to interact with people regardless of their OS. Application scalability isn't a big problem; hardware is cheap these days. No single-page applications, client-side rendering or smearing the logic between backend and front-end. No to microservices or serverless architectures. At this point we are searching for a language (and ecosystem) that will make a silly me productive and happy while building web apps in a very opinionated way. Given the narrow constraints, that could be a very sharp tool, were it to be found. That's how we come to Ruby, the language. Ruby seems to be one of these niche languages that chose to focus on developer productivity and happiness instead of performance. It tends to run slower than Python on benchmarks. Ruby-on-Rails (the web framework on Ruby) is considered by some to be the most productive web dev environment. Folks behind the RoR are the same who built Basecamp, wrote \"ReWork\" and have strong opinions in favor of software monoliths. At this point, I would feel myself uncomfortable if I didn't try learning Ruby and RoR. If I'm lucky, I'll also find a tool that is a pleasure to use for the specific tasks. Update 7 months later: How did learning Ruby on Rails go?"}
{"url":"/running-on-a-real-fpga/","title":"Robotic R&D: FPGA and ML Pipelines","text":"Robotic R&D: FPGA and ML Pipelines We've made progress in replicating a high-precision robotic arm (from Haddington Dynamics) since last April. Before moving forward, let me recap the functional components of our robotic joint (with feedback and high precision): processor sends controlling impulses to a motor driver; motor driver modulates high-voltage electrical current to rotate a stepper motor ; motor spins a speed reducer gear ; an optical encoder (a wheel with slits that pass through a pair of IR light beams) translates angular position of a motor into a pair of voltage-modulated signals; optical interface board captures voltage levels, drops some noise and feeds it to the analog-to-digital converter from which the processor could read. In April we had the following goals in mind: switch from 3D-printed cycloidal speed reducer to a 3D-printed planetary gear; design and print an optical encoder to match the new gear; print and solder an optical interface board, replacing a tangling mess of wires on the breadboard; switch processing from Arduino Uno to a MiniZed Dev board (ARM+FPGA); switch from L293D -based stepper motor driver to A4988 . All these goals are complete now . In addition to that, there is an emerging environment and tooling for moving forward with machine learning in this project. To clarify a question frequently asked: this isn't a work project , just something we are doing on our free time. This is a practice in running R&D projects within a new space with a small team, limited budged and even more limited time . All while learning new skills and trying to build something interesting with a smart software that compensates for the hardware shortcomings. Optical Interface Initially our optical interface was just a mess of wires, capacitors and resistors on a bread-board. Based on the original schematics from the Haddington Dynamics, I eventually managed to finish the routing and get a PCB design that looked correct. JLCPCB has a good reputation for printing out circuit boards, so that's where we placed the order: 5 boards and one SMT stencil (helps to apply solder paste). We could've just reused Gerber files from Haddington Dynamics but this approach would miss a lot of experience, critical for building the intuition, essential for this project. Within the next couple of months we received the necessary parts and \"good enough\" equipment, then destroyed a practice board while learning SMT soldering with a heat gun. As it turns out, soldering, per se, is easy . The tricky part is in placing solder paste and sticking little chips without accidentally scattering everything across the table and the floor. Planetary Speed Reducer Aigiz printed a planetary speed reducer (from an existing design) that used airsoft balls as bearings. We designed an optical encoder in Fusion 360 to fit this speed reducer. It was modular to allow rapid iterations without reprinting everything all the time. Modular design paid off quickly. As it turns out, component manufactures don't always stick to precise dimensions, so a lot of fiddling and gluing was necessary. After connecting everything together we got an optical encoder with good repeatability but low precision. Arduino Uno controlled it at this point. That's when the data started flowing in. ML Pipelines Getting the design to work involved fiddling with different elements: capacitors, resistors and micro-stepping. Somewhere along the way a collection of data samples started getting out of control . The project is running in small iterations of 30-45 minutes. There isn't much time for \"getting into the flow\" and any cognitive overhead kills productivity. At this point, it made sense to pause and invest effort in setting up the necessary software tooling. Just like most software projects, having the right tools could reduce toil and enable steady iteration rythm . By a coincidence, two peculiar things happened at this time, helping to shape this tooling: Zhamak Dehghani published an article on How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh . Among the other things, she explored the notion of domain datasets and collaboration in data-driven organizations. I've stumbled upon an incredibly rich lightning talk by Stephen Pimentel on machine learning pipelines at Apple . So after throwing a bunch of Post-It notes on the wall, I've hacked together a minimally usable environment for organizing datasets in Jupyter : As it turns out, Event Storming works nicely with User Story Mapping . Event Storming helps to capture concepts, transitions and abstract details within the business logic. User Story Mapping helps to organize and prioritize potential features, while aligning them with the bigger picture. The biggest benefit was in enabling rapid and focused iterations (as in 2-4 hours per iteration ) while deferring everything \"not-so-important\" to a later stage. This environment stores and structures datasets, models and notebooks in an opinionated way , associating metadata with every bit of information and state change. This makes dependencies and relations explicit, helping with versioning, documentation and collaboration. MiniZed FPGA Somewhere along the way, a MiniZed FPGA finally arrived. It took some time to make sense of the toolchain and board details. Eventually (after multiple failed attempts and banging of my head on the desk), ARM Cortex-A9, Artix-7 Programmable logic, Xilinx Analog-to-Digital converter and our own mess of wires started working together (in a naive way). Here is how our design looks at this stage: This upgrade allowed us to replace Arduino Uno, while increasing sampling precision (11 bits) and frequency (10k-100k samples per second). Next Steps For the next month, I'll be traveling with my family, away from the hardware. I will make advantage of it by focusing on the software : migrate ML Pipelines environment to event-sourced design (to evolve design without worrying about data schema upgrades); explore programmable logic in a simulated environment for absolute positioning (with Verilator); explore models for compensating jitter in the speed reducer (perhaps, even give a try to hashed perceptron in PL). Meanwhile in Ufa, Aigiz will be working on a new design of the speed reducer gear."}
{"url":"/schema-guided-reasoning/","title":"Schema-Guided Reasoning (SGR)","text":"Schema-Guided Reasoning (SGR) Schema-Guided Reasoning (SGR) is a technique that guides large language models (LLMs) to produce structured, clear, and predictable outputs by enforcing reasoning through predefined steps . By creating a specific schema (or structured template), you explicitly define: What steps the model must go through (preventing skipped or missed reasoning) In which order it must reason (ensuring logical flow) Where it should explicitly focus attention (improving depth and accuracy) Instead of allowing free-form text completion (which can be inconsistent or ambiguous), the schema acts as a strict guideline. This guideline will be enforced upon the LLM via Constrained Decoding ( Structured Output ). You can think of it as giving the model a clear “checklist” or “structured script” to follow. Here is one example of SGR in action from a project in compliance/FinTech domain. This is a pydantic data structure that enforces LLM to perform an analysis of a clause from internal company procedure in a very specific order. We translated domain expert’s mental checklist into a structured reasoning schema for LLM. See also SGR Patterns such as Cascade , Routing , and Cycle . By enforcing strict schema structures, we ensure predictable and auditable reasoning, gain fine-grained control over inference quality, and easily validate intermediate results against test data. In other words, via the structure we can control the layout of the response. This allows us to break tasks into smaller steps, while ensuring mandatory checkpoints. Here are some benefits: Reproducible reasoning - we guarantee more consistent inference across repeated runs. Auditable - SGR makes every reasoning step explicit and inspectable. Debuggable & Testable - intermediate outputs can be directly evaluated and improved (they are linkable to test datasets with evals) We can translate expert knowledge into executable prompts. DDD works really well here. Enhances both reasoning transparency and output reliability. Accuracy boost of 5-10% is not uncommon. This improves reasoning capabilities of weaker local models, making them more applicable in various workloads. Note, that we are not replacing the entire prompt with structured output . We just don't rely only on prompt in order to force LLM to follow a certain reasoning process precisely. Deep Dive To dive deeper: Read through the SGR Patterns : Cascade , Routing , and Cycle . Go through a few SGR Examples that illustrate application of SGR: simple math task text-to-sql document classification advanced reasoning in compliance Business Assistant demonstrates how to build a reasoning business assistant with tool use in 160 lines of Python. Adaptive Planning further explains how and why this simple agent demo is capable of adapting its plans to new circumstances on-the-fly. Production Uses Schema-Guided Reasoning (SGR) is the single most widely applied LLM pattern in AI cases that I've observed. It was used: in manufacturing, construction - to extract and normalise information from purchase orders, data sheets and invoices in multiple languages (when used together with a Visual LLM); in business automation products - to automatically create tickets, issues and calendar entries from the calendar input; in EU logistics - to normalise and extract information from diverse tax declaration forms; in fintech - to accurately parse regulations for further ingestion into compliance assistants, then - to run compliance gap analysis according to the defined checklist process; in sales - to power lead generation systems that run web research powered by custom workflows. Schema-Guided Reasoning (SGR) becomes even more important for the locally-capable models (models that could run on private servers offline). Such models have much less cognitive capacity than what we could get by querying OpenAI or Anthropic APIs. In other words, local models are generally not as smart as the cloud ones. SGR helps to work around this limitation. Support Schema-Guided Reasoning (SGR) works with modern cloud providers that support Structured Output via constrained decoding. It doesn't require reasoning models, but it works well with models that were distilled from the reasoning models. OpenAI - supported via Structured Outputs (including OpenAI on Azure). GPT-5 uses JSON Schema via llguidance . Mistral - supported via Custom Structured Output Google/Gemini - Structured Outputs supported properly since November 2 2025 via JSON Schema (Pydantic and Zod are supported) Grok - supported for multiple models: Structured Outputs . Fireworks AI - via JSON Schema . Cerberas - via Structured Outputs OpenRouter - depends on the downstream provider, maps to JSON Schema . Most of modern inference engines support the necessary capability: ollama - via Structured Outputs vllm - via xgrammar or guidance backends TensorRT-LLM - e.g. via GuidedDecoding SGLang - via Outlines , XGrammar or llguidance Citation @misc { abdullin2025sgr , author = {Abdullin, Rinat} , title = {Schema-Guided Reasoning ({SGR})} , year = {2025} , month = jul , url = {https://abdullin.com/schema-guided-reasoning/} , } References Video with more background on text-to-sql: NODES 2024 - LLM Query Benchmarks: Cypher vs SQL Talk by Andrej Karpathy from MSBuild 2023: State of GPT"}
{"url":"/schema-guided-reasoning/adaptive-planning","title":"SGR Adaptive Planning","text":"SGR Adaptive Planning How can we enable AI agents to navigate uncertainty and adapt their plans to new circumstances? Let me illustrate with a few more examples how the reasoning logic in SGR Demo adjusts its plans based on new information. This is a response to a community comment about my SGR Demo: \"But true agent behavior in production is when the agent doesn't know the entire sequence of steps beforehand and decides what the next step is during runtime.\" Let's take the SGR demo and add two additional tasks to a sample list. The first task will create a memory (fact) that we should never sell SkyNet the online practicum on creating AGI. Just in case. Add rule for skynet @y . com - politely reject all requests to buy SKU - 220 This results in the following execution: For the second task, we'll inform our agent that Elon Musk and SkyNet both want to buy the online practicum on building AGI. If you recall from the source code, each task is executed in a fresh context. Thus, when the agent starts executing the second task, it initially won't remember that SkyNet should be refused an invoice for SKU-220. It discovers this fact only after loading additional information about SkyNet and surfacing this memory. Let's see how this works out in practice: As you can see, the final execution summary looks right: Issued invoice INV-4 for elon@x.com Emailed invoice INV-4 to finance@x.com Politely rejected skynet@y.com request Why did this work out? How did this demo agent adjust the plan on the fly? The trick is in the existing SGR schema! It forces LLM to plan multiple steps ahead via plan_remaining_steps_brief field (this helps create a coherent plan), but then takes only the immediate next step and executes it as a tool via function...= Field(..., description=\"execute first remaining step\") . The rest of the plan is discarded! class NextStep ( BaseModel ): # we'll give some thinking space here current_state : str # Cycle to think about what remains to be done. at least 1 at most 5 steps # we'll use only the first step, discarding all the rest. plan_remaining_steps_brief : Annotated [ List [ str ], MinLen ( 1 ), MaxLen ( 5 )] # now let's continue the cascade and check with LLM if the task is done task_completed : bool # Routing to one of the tools to execute the first remaining step # if task is completed, model will pick ReportTaskCompletion function : Union [ ReportTaskCompletion , SendEmail , GetCustomerData , IssueInvoice , VoidInvoice , CreateRule , ] = Field ( ... , description = \"execute first remaining step\" ) After the tool finishes execution, we append its output to the conversation context of the current task and run the next step, prompting the agent to plan again. At this point, the new plan considers all prior data, adapting to the changing circumstances. This way, we don't need to adjust existing plans because we never keep stale plans around; instead, we create an entirely new plan at every step . This might seem counter-intuitive since planning typically involves considerable effort for humans. However, LLMs don't care - planning and adapting at each individual step is a fixed cost for them. Meanwhile SGR helps orient the entire process toward the concrete goal. Check out full Gist to see the entire sample in action: Github Gist . Keep in mind that Schema-Guided Reasoning is not about agents or planning. SGR is about guiding large language models (LLMs) through predefined reasoning steps to produce structured, clear, and predictable outputs. We do that by encoding desired reasoning pathways in schemas and enforcing these schemas mechanically with constrained decoding (also known as Structured Output or Response Schema )"}
{"url":"/schema-guided-reasoning/demo","title":"SGR Demo","text":"SGR Demo Let's build a demo business assistant. It will demonstrate the foundations of using Schema-Guided Reasoning (SGR) with OpenAI API. It should: implement a business assistant capable of planning and reasoning implement tool calling with SGR and simple dispatch agent should be able to create additional rules/memories for itself use a simple (inexpensive) non-reasoning model for that To give this AI assistant something to work with, we are going to ask it to help with running a small business - selling courses to help to achieve AGI faster. Ultimately the entire codebase should be ~160 lines of Python code in a single file, include only openai , pydantic and rich (for pretty console output). It should be able to run workflows like this: This demo uses the NextStep planner, which plans one action at a time and continuously adapts to changing circumstances during execution. While this is one approach to building agents using Schema-Guided Reasoning (SGR), it's not the only one. SGR itself does not dictate any specific agent architecture ; instead, it illustrates how structured reasoning can be arranged and executed within individual steps. Customer Management System Let's start by implementing our customer management system. LLM will be working with it according to our instructions. For the sake of simplicity it will live in memory and have a very simple DB structure: DB = { \"rules\" : [], \"invoices\" : {}, \"emails\" : [], \"products\" : { \"SKU-205\" : { \"name\" : \"AGI 101 Course Personal\" , \"price\" : 258 }, \"SKU-210\" : { \"name\" : \"AGI 101 Course Team (5 seats)\" , \"price\" : 1290 }, \"SKU-220\" : { \"name\" : \"Building AGI - online exercises\" , \"price\" : 315 }, }, } Tool definitions Now, let's define a few tools which could be used by LLM to do something useful with this customer management system. We need tools to issue invoices, cancel invoices, send emails, and memorize new rules. To be precise, each tool will be a command (as in CQRS/DDD world), phrased as an instruction and coming with a list of valid arguments. from typing import List , Union , Literal , Annotated from annotated_types import MaxLen , Le , MinLen from pydantic import BaseModel , Field # Tool: Sends an email with subject, message, attachments to a recipient class SendEmail ( BaseModel ): tool : Literal [ \"send_email\" ] subject : str message : str files : List [ str ] recipient_email : str Note the special tool field. It is needed to support discriminated unions allowing pydantic and constrained decoding to implement Routing from SGR Patterns . Pydantic will rely on it to pick and instantiate the correct class when loading back JSON that was returned by LLM. This SendEmail command is equivalent to a function declaration that looks like: def SendMail ( subject : str , message : str , files : List [ str ], recipient_email : str ): \"\"\" Send an email with given subject, message and files to the recipient. \"\"\" pass Now, let's add more tool definitions: # Tool: Retrieves customer data such as rules, invoices, and emails from DB class GetCustomerData ( BaseModel ): tool : Literal [ \"get_customer_data\" ] email : str # Tool: Issues an invoice to a customer, with up to a 50% discount class IssueInvoice ( BaseModel ): tool : Literal [ \"issue_invoice\" ] email : str skus : List [ str ] discount_percent : Annotated [ int , Le ( 50 )] # never more than 50% discount Here we are using Le annotation with \"LessOrEqual\" for discount_percent , it will be included into JSON schema and then enforced in constrained decoding schema. There is no need to explain anything in prompt, LLM will not be able to emit 51. # Tool: Cancels (voids) an existing invoice and records the reason class VoidInvoice ( BaseModel ): tool : Literal [ \"void_invoice\" ] invoice_id : str reason : str # Tool: Saves a custom rule for interacting with a specific customer class CreateRule ( BaseModel ): tool : Literal [ \"remember\" ] email : str rule : str Dispatch implementation Now we are going to add a big method which will handle any of these commands and modify the system accordingly. It could be implemented as multi-dispatch, but for the sake of the demo, a giant if statement will do just fine: # This function handles executing commands issued by the agent. It simulates # operations like sending emails, managing invoices, and updating customer # rules within the in-memory database. def dispatch ( cmd : BaseModel ): # this is a simple command dispatch to execute tools # in a real system we would: # (1) call real external systems instead of simulating them # (2) build up changes until the entire plan worked out; afterward show # all accumulated changes to user (or another agent run) for review and # only then apply transactionally to the DB # command handlers go below Let's add first handler. This is how we can handle SendEmail : def dispatch ( cmd : BaseModel ): # here is how we can simulate email sending # just append to the DB (for future reading), return composed email # and pretend that we sent something if isinstance ( cmd , SendEmail ): email = { \"to\" : cmd . recipient_email , \"subject\" : cmd . subject , \"message\" : cmd . message , } DB [ \"emails\" ] . append ( email ) return email # more handlers... Rule creation works similarly - it just stores rule associated with the customer in DB, for future reference: if isinstance ( cmd , CreateRule ): rule = { \"email\" : cmd . email , \"rule\" : cmd . rule , } DB [ \"rules\" ] . append ( rule ) return rule GetCustomerData queries DB for all records associated with the specified email. if isinstance ( cmd , GetCustomerData ): addr = cmd . email return { \"rules\" : [ r for r in DB [ \"rules\" ] if r [ \"email\" ] == addr ], \"invoices\" : [ t for t in DB [ \"invoices\" ] . items () if t [ 1 ][ \"email\" ] == addr ], \"emails\" : [ e for e in DB [ \"emails\" ] if e . get ( \"to\" ) == addr ], } Invoice generation will be more tricky, though. It will demonstrate discount calculation (we know that LLMs shouldn't be trusted with math). It also shows how to report problems back to LLM - by returning an error message that will be attached back to the conversation context. Ultimately, IssueInvoice computes a new invoice number and stores it in the DB. We also pretend to save it in a file (so that SendEmail could have something to attach). if isinstance ( cmd , IssueInvoice ): total = 0.0 for sku in cmd . skus : product = DB [ \"products\" ] . get ( sku ) if not product : return f \"Product { sku } not found\" total += product [ \"price\" ] discount = round ( total * 1.0 * cmd . discount_percent / 100.0 , 2 ) invoice_id = f \"INV- { len ( DB [ 'invoices' ]) + 1 } \" invoice = { \"id\" : invoice_id , \"email\" : cmd . email , \"file\" : \"/invoices/\" + invoice_id + \".pdf\" , \"skus\" : cmd . skus , \"discount_amount\" : discount , \"discount_percent\" : cmd . discount_percent , \"total\" : total , \"void\" : False , } DB [ \"invoices\" ][ invoice_id ] = invoice return invoice Invoice cancellation marks a specific invoice as void, returning an error for non-existent invoices: if isinstance ( cmd , VoidInvoice ): invoice = DB [ \"invoices\" ] . get ( cmd . invoice_id ) if not invoice : return f \"Invoice { cmd . invoice_id } not found\" invoice [ \"void\" ] = True return invoice Test tasks Now, having such DB and tools, we could come up with a list of tasks that we can carry out sequentially. TASKS = [ # 1. this one should create a new rule for sama \"Rule: address sama@openai.com as 'The SAMA', always give him 5 % d iscount\" , # 2. this should create a rule for elon \"Rule for elon@x.com: Email his invoices to finance@x.com\" , # 3. now, this task should create an invoice for sama that includes one of each # product. But it should also remember to give discount and address him # properly \"sama@openai.com wants one of each product. Email him the invoice\" , # 4. Even more tricky - we need to create the invoice for Musk based on the # invoice of sama, but twice. Plus LLM needs to remember to use the proper # email address for invoices - finance@x.com \"elon@x.com wants 2x of what sama@openai.com got. Send invoice\" , # 5. even more tricky. Need to cancel old invoice (we never told LLMs how) # and issue the new invoice. BUT it should pull the discount from sama and # triple it. Obviously the model should also remember to send invoice # not to elon@x.com but to finance@x.com \"redo last elon@x.com invoice: use 3x discount of sama@openai.com\" , ] Task termination Let's define one more special command. LLM can use it whenever it thinks that its task is completed. It will report results with that. This command also follows Cascade pattern. class ReportTaskCompletion ( BaseModel ): tool : Literal [ \"report_completion\" ] completed_steps_laconic : List [ str ] code : Literal [ \"completed\" , \"failed\" ] Prompt engineering Now we have all sub-schemas in place, let's define the core SGR schema for this AI assistant: class NextStep ( BaseModel ): # we'll give some thinking space here current_state : str # Cycle to think about what remains to be done. at least 1 at most 5 steps # we'll use only the first step, discarding all the rest. plan_remaining_steps_brief : Annotated [ List [ str ], MinLen ( 1 ), MaxLen ( 5 )] # now let's continue the cascade and check with LLM if the task is done task_completed : bool # Routing to one of the tools to execute the first remaining step # if task is completed, model will pick ReportTaskCompletion function : Union [ ReportTaskCompletion , SendEmail , GetCustomerData , IssueInvoice , VoidInvoice , CreateRule , ] = Field ( ... , description = \"execute first remaining step\" ) Here is the system prompt to accompany the schema. Since the list of products is small, we can merge it with prompt. In a bigger system, could add a tool to load things conditionally. system_prompt = f \"\"\" You are a business assistant helping Rinat Abdullin with customer interactions. - Clearly report when tasks are done. - Always send customers emails after issuing invoices (with invoice attached). - Be laconic. Especially in emails - No need to wait for payment confirmation before proceeding. - Always check customer data before issuing invoices or making changes. Products: { DB [ \"products\" ] } \"\"\" . strip () Task Processing Now we just need to implement the method to bring that all together. We will run all tasks sequentially. The AI assistant will use reasoning to determine which steps are required to complete each task, executing tools as needed. # use just openai SDK import json from openai import OpenAI # and rich for pretty printing in the console from rich.console import Console from rich.panel import Panel from rich.rule import Rule client = OpenAI () console = Console () print = console . print def execute_tasks (): # we'll execute all tasks sequentially. You can add your tasks # of prompt user to write their own for task in TASKS : # task processing logic pass if __name__ == \"__main__\" : execute_tasks () Now, let's go through the task processing logic. First, pretty printing: print ( \" \\n\\n \" ) print ( Panel ( task , title = \"Launch agent with task\" , title_align = \"left\" )) Then, setup an array that will keep our growing conversation context. This log will be created with each agent run: # log will contain conversation context within task log = [ { \"role\" : \"system\" , \"content\" : system_prompt }, { \"role\" : \"user\" , \"content\" : task } ] We are going to run up to 20 reasoning steps for each task (to be safe): for i in range ( 20 ): step = f \"step_ { i + 1 } \" print ( f \"Planning { step } ... \" , end = \"\" ) Each reasoning step begins by sending request to OpenAI API and asking the question - what should we do next at this point? completion = client . beta . chat . completions . parse ( model = \"gpt-4o\" , response_format = NextStep , messages = log , max_completion_tokens = 10000 , ) job = completion . choices [ 0 ] . message . parsed Note, that this sample relies on OpenAI API. We specifically use gpt-4o, to demonstrate that even a simple and fairly old LLM can be made to run complex reasoning workflows. Let's continue with the code. If LLM flow decides to finish, then let's complete the task, print status and exit the loop. Assistant will switch to the next one task: if isinstance ( job . function , ReportTaskCompletion ): print ( f \"[blue]agent { job . function . code } [/blue].\" ) print ( Rule ( \"Summary\" )) for s in job . function . completed_steps_laconic : print ( f \"- { s } \" ) print ( Rule ()) break Otherwise - let's print out next planned step to the console, along with the chosen tool: print ( job . plan_remaining_steps_brief [ 0 ], f \" \\n { job . function } \" ) And also add tool request to our conversation log. We will do it as if it was created natively by the OpenAI infrastructure: log . append ({ \"role\" : \"assistant\" , \"content\" : job . plan_remaining_steps_brief [ 0 ], \"tool_calls\" : [{ \"type\" : \"function\" , \"id\" : step , \"function\" : { \"name\" : job . function . tool , \"arguments\" : job . function . model_dump_json (), }}] }) A shorter and less precise equivalent will be: log . append ({ \"role\" : \"assistant\" , \"content\" : job . model_dump_json (), }) We have only 3 lines of code remaining: execute the tool, and add results back to the conversation log: result = dispatch ( job . function ) txt = result if isinstance ( result , str ) else json . dumps ( result ) #print(\"OUTPUT\", result) # and now we add results back to the convesation history, so that agent # we'll be able to act on the results in the next reasoning step. log . append ({ \"role\" : \"tool\" , \"content\" : txt , \"tool_call_id\" : step }) This will be the end of the reasoning step and our codebase. Running tasks Now, let's see how this actually works out on our tasks. They are going to be executed in a sequence, making the system more complex over the course of a run. Tasks 1 and 2: memorize new rules First two tasks are simply about creating rules, so they look fine: and: Although one thing I don't like - in the first case the agent didn't bother to load existing customer data to double-check if a similar rule already exists. In a real production scenario with test-driven development, this would be added to a test suite to verify that in such cases SGR always starts by loading relevant customer data. We can verify that by capturing a prompt and ensuring that the first tool to be invoked is GetCustomerData . Task 3: Sama wants one of each product The third task was more complex: \"sama@openai.com wants one of each product. Email him the invoice\" Execution looks correct: it pulls customer data then it issues the invoice with all 3 products and a discount of 5% Then it sends the email with: mentioning \"SAMA\" and 5% discount attaching the invoice Task 4: Elon wants 2x of what Sama got Fourth task requires agent to first look into the account of Sama and figure out what he has ordered. Then, issue the invoice to Elon with 2x everything. The model has done that. It has also correctly figured out that the email should be sent to another email account, as specified earlier in the rules: Although, I don't like that the model decided to give Elon 5% discount. Should've done nothing, in my opinion. This is something that could be fixed via prompt hardening and test-driven development. Task 5: Void and reissue invoice Fifth task was even more complicated: \"redo last elon@x.com invoice: use 3x discount of sama@openai.com The model had to: Find out discount rate of Sama Find last incorrect invoice of Elon - for the number and contents Void that last invoice Issue new invoice with the same contents but 15% discount Remember to email the new invoice after any changes Remember to email the invoice not to elon@x.com but to finance@x.com Planning steps and the actual summary correspond to these expectations: Get full code Reach out to me, if you port the sample to another stack or add nice visualisation! Original version : Python + openai + pydantic by Rinat Abdullin - gist Port to TypeScript : Bun + openai + zod by Anton Kuzmin - gist Python with nice UI : Python + openai + pydantic by Vitalii Ratyshnyi - gist Hardening the code Obviously, this code is nowhere near production ready or complete. Its purpose is to be as minimal as possible. It aims to illustrate: how to use Schema-Guided Reasoning (SGR) that one doesn't need an advanced framework to implement SGR-driven tool calling, in fact it could be done with little code. if we were to make it production-ready, a few more steps would be needed. 1. Start by adding test datasets Create deterministic test scenarios to verify the system behavior for various edge cases, especially around discount calculations, invoice issuance, cancellations, and rule management. Test scenarios could validate correctness using strongly typed fields defined by the SGR schema. 2. Split the code by responsibilities Currently the code is flattened in a single file for the clarity and compactness. In a production case, it will need to be rearchitected to support codebase growth. Replace the large if statement with multi-dispatch or Command design pattern. Write unit tests for each tool handler. Separate business logic from command dispatching and database manipulation. Write integration tests simulating the full workflow for tasks, verifying state consistency after each step. 3. Make DB real and durable In-memory DB doesn't survive restarts very well, so this will have to be changed: Move from in-memory DB to a persistent storage solution (e.g., PostgreSQL). Ensure all writes are atomic and transactional to maintain data consistency. 4. Harden error cases Currently the code is optimistic. It expects that things don't go wrong. However, in practice things will be different. Assistant should be able to recover or fail gracefully in such cases. In order for that: Ensure that tool handlers report errors explicitly in a structured format (e.g., exceptions or error response schemas). Test how LLMs react to such failures. 5. Operational concerns First of all, we'll need to maintain audit logs for every DB change, API call, and decision made by the agent. This will help in debugging problems and turning failures into test cases. Ideally, Human in the loop would also be included. E.g. we can build a UI or API interface to review and approve agent-generated invoices, emails, and rules before committing them to the system. On the UI side we can also improve things further by providing visibility into agent reasoning (planned steps, decision points) to build trust and enable auditability. Plus, experts could flag bad reasoning flows for debugging right there. Conclusion In this demo, we've seen how Schema-Guided Reasoning (SGR) can power a business assistant - nothing special, just 160 lines of Python and an OpenAI SDK. The beauty of SGR is that even simple and affordable models become surprisingly capable of complex reasoning , planning, and precise tool usage. It's minimal yet powerful. Of course, this example is intentionally simplified. Taking something like this to production would mean adding robust tests, reliable data storage, thorough error handling, and operational elements such as audit trails and human reviews. But the core remains straightforward. By the way, this assistant is capable of Adaptive Planning . Read more about how it works ."}
{"url":"/schema-guided-reasoning/examples","title":"SGR Examples","text":"SGR Examples Let's go through a few examples that illustrate application of Schema-Guided Reasoning (SGR) : simple math task text-to-sql document classification advanced reasoning in compliance 1. Simple Math Task Here is an easy example with small reasoning customisation. Let's say we have the following Python classes: class Step ( BaseModel ): explanation : str output : str class MathReasoning ( BaseModel ): steps : list [ Step ] final_answer : str Note, that BaseModel comes from pydantic typing library of Python, frequently used to define response formats. Other languages might leverage different libraries and tools for this purpose. Given this MathReasoning response format and a prompt Solve the equation 8x + 7 = -23 step by step. LLM can respond like this: { \"steps\" : [ { \"explanation\" : \"Start with the equation 8x + 7 = -23.\" , \"output\" : \"8x + 7 = -23\" }, { \"explanation\" : \"Subtract 7 from both sides to isolate the term with the variable.\" , \"output\" : \"8x = -23 - 7\" }, { \"explanation\" : \"Simplify the right side of the equation.\" , \"output\" : \"8x = -30\" }, { \"explanation\" : \"Divide both sides by 8 to solve for x.\" , \"output\" : \"x = -30 / 8\" }, { \"explanation\" : \"Simplify the fraction.\" , \"output\" : \"x = -15 / 4\" } ], \"final_answer\" : \"x = -15 / 4\" } Note, that it thinks in a predefined way before providing an answer. This uses more tokens by investing them to think through the problem aloud, but improves model accuracy. 2. Text-to-SQL example In the real world, we would want to have more precise reasoning schema to boost answer accuracy. For example, when prompting LLM to perform query expansion from human request to a precise SQL query over a predefined schema, adding Schema-Guided Reasoning (SGR) increases the accuracy by 6% out of the box. On the image below this was done by adding strategy field before the sql_query field. It forced LLM to perform analysis according to a custom checklist. In essence, we programmed LLM to reason in a predefined way without writing any executable code. 3. Document classification example Here is an example of a Schema-Guided Reasoning (SGR) from a system for classifying business documents in a RAG: DOCUMENT_TYPES = [ \"invoice\" , \"contract\" , \"receipt\" , \"email\" , ... ] ENTITY_TYPES = [ \"payment\" , \"risk\" , \"regulator\" , \"employee\" , ... ] class DocumentClassification ( BaseModel ): document_type : Literal [ tuple ( DOCUMENT_TYPES )] brief_summary : str key_entities_mentioned : List [ Literal [ tuple ( ENTITY_TYPES )]] keywords : List [ str ] = Field ( ... , description = \"Up to 10 keywords describing this document\" ) In this case, LLM is forced to think through the classification challenge in steps: Identify type of the document and pick it. Literal enforces that. Summarise the document Identify key entities mentioned in the document. List[Literal] ensures that the response will be a list from ENTITY_TYPES Come up with 10 unique keywords. List[str] ensures that the response is a list of strings, while description kindly asks LLM to keep the list at 10 items or less. In this specific example, first two fields are discarded from the response. They are used just to force LLM to approach classification from a predefined angle and think a little about it. Ultimately this improved prompt accuracy in this task. 4. Advanced Reasoning in Compliance This is an example of more advanced workflow that is \"packed\" into a single prompt. While executing this schema, the model will be forced to go through that sequentially. First, we are instructing the model to do preliminary analysis, where most of the analysis is encoded in Applicability reasoning sub-routine (it is implemented as a reusable nested object). The task is phrased explicitly in the field description and field name. field name will get more attention from the model, because it will be copied to the output prompt by the model just before it starts answering the question. Afterwards model has to reason about concrete gaps in the document. These gaps, represented as a list of strings, will be the mental notes that the model gathers before providing a final answer. Note, that description field is passed to the LLM automatically by OpenAI. Other providers might not include that. The answer itself is a fairly straightforward ENUM of three options. However, the reasoning doesn't stop there . Benchmarking has shown that sometimes this reasoning workflow gets too pessimistic and flags too many gaps. To handle that, we are forcing a verification step after the answer: reasonForNoncompliance - model has to pick a category gapSeverity - also another list of categories Information from these two fields is useful in 3 ways: allow to prioritise important gaps by assigning scores to each category allow to test classification precision with our test evals a model gets a chance to review, all the information again and mark the gap as valid, but less relevant. And the final step is to list most important supporting evidence for the concrete identified gap. It happens in the same prompt because we already have all the information loaded in the context, so there is no need in second prompt. Plus, supporting evidence is usually specified exactly by the unique identifiers of text chapters, clauses or snippets. This means, that we could also include this part of the reasoning into the test datasets that ensure quality of the overall system. It would look like this: Q : how to apply that to test pure chat applications that have free text as response? A : split prompt into SGR-controlled part and final text answer. Put most of the complexity and reasoning in SGR part, keep text interpretation trivial. Include grounding and proofs into SGR part as well (see ERC). Cover SGR schema fields with an eval dataset. Use LLM-as-a-judge or human oversight for testing the tail. This way Schema-Guided Reasoning helps to establish faster the the feedback loops that generate valuable test data. This works because with SGR we get more easily-testable parameters per each reasoning process."}
{"url":"/schema-guided-reasoning/patterns","title":"SGR Patterns","text":"SGR Patterns Here is a set of minimal Pydantic schemas that demonstrate foundational building blocks for Schema-Guided Reasoning (SGR) . They illustrate how to encode a specific reasoning pattern that will constrain and guide LLM generation. 1. Cascade Cascade ensures that LLM explicitly follows predefined reasoning steps while solving the problem. Each step - allocating thinking budget to take reasoning one step further For example, in a candidate interview evaluation we can enforce the model to: First summarize and review its knowledge of the candidate. This will make it explicit for the LLM (putting it into the attention) and for human reviewers later. Then rate candidate on the applicability from 1 to 10 Finally make a final decision as a choice between hire , reject or hold This is how the corresponding Pydantic schema would look like: from pydantic import BaseModel from typing import Literal , Annotated from annotated_types import Ge , Le class CandidateEvaluation ( BaseModel ): brief_candidate_summary : str rate_skill_match : Annotated [ int , Ge ( 1 ), Le ( 10 )] final_recommendation : Literal [ \"hire\" , \"reject\" , \"hold\" ] The schema explicitly defines and constrains the order of reasoning: first summarize, then rate, and finally recommend. LLM, driven by the constrained decoding, will reason in this predefined logical sequence. Note, that rate_skil_match is bounded to be within the [1,10] range by Python typing annotations. pydantic will be able to handle that and convert to JSON Schema. conint(ge=1, le=10) can achieve the same, but is going to be deprecated soon. Use Annotated instead It can be plugged into OpenAI-compatible library like this: from openai import OpenAI client = OpenAI () user = \"evaluate Sam Altman for DevOps Role at OpenAI\" completion = client . chat . completions . parse ( model = \"gpt-5-mini\" , response_format = CandidateEvaluation , messages = [ { \"role\" : \"user\" , \"content\" : user }, ], ) and the model will be forced by constrained decoding to structure its response accordingly: CandidateEvaluation ( brief_candidate_summary = ( 'Sam Altman is a high-profile technology executive and entrepreneur ' '(co-founder of Loopt, president of Y Combinator, CEO of OpenAI) with ' 'strong leadership, strategy, product and fundraising experience. ' 'Publicly available information highlights executive management and ' 'company-building skills rather than hands-on systems engineering, SRE, ' 'or platform/DevOps work. He would bring strategic vision and ' 'organizational leadership but not the typical deep, day-to-day ' 'operational expertise expected for an individual contributor DevOps ' 'role.' ), rate_skill_match = 2 , final_recommendation = 'reject' ) Note, that we order parameters to gradually focus and refine the information, until we come up with a concrete conclusion. Start by a generic summary of the candidate, narrow down to the skill rating and end up with a concrete decision. If LLM starts misbehaving in some situations, it would be possible to load back full SGR outlines for these cases and review them. 2. Routing Routing forces LLM to explicitly choose one specific reasoning path out of many. For example, in software triage we can force LLM to explicitly choose the path (\"hardware\" or \"software\"), followed by filling specific required details: from pydantic import BaseModel from typing import Literal , Union class HardwareIssue ( BaseModel ): kind : Literal [ \"hardware\" ] component : Literal [ \"battery\" , \"display\" , \"keyboard\" ] class SoftwareIssue ( BaseModel ): kind : Literal [ \"software\" ] software_name : str class UnknownIssue ( BaseModel ): kind : Literal [ \"unknown\" ] category : str summary : str class SupportTriage ( BaseModel ): issue : Union [ HardwareIssue , SoftwareIssue , UnknownIssue ] By passing SupportTriage to response_format , we will force LLM to make a choice and pick one of the branches. completion = client . chat . completions . parse ( model = \"gpt-5-mini\" , response_format = SupportTriage , messages = [ { \"role\" : \"developer\" , \"content\" : \"triage support\" }, { \"role\" : \"user\" , \"content\" : \"My laptop screen keeps flickering and sometimes turns black.\" } ], ) print ( completion . choices [ 0 ] . message . parsed ) Parsed object will be of type HardwareIssue in this case: SupportTriage ( issue = HardwareIssue ( kind = 'hardware' , component = 'display' ) ) Tools can be represented with branches as well . Consider this schema for a personal business assistant that has access to a few tools: from pydantic import BaseModel , Field from typing import Union , Literal class SendEmailTool ( BaseModel ): tool : Literal [ \"send_email\" ] recipient_email : str subject : str message : str class SearchKnowledgeBaseTool ( BaseModel ): tool : Literal [ \"search_knowledge_base\" ] query : str class CreateSupportTicketTool ( BaseModel ): tool : Literal [ \"create_support_ticket\" ] customer_id : int issue_summary : str priority : Literal [ \"low\" , \"medium\" , \"high\" ] class Response ( BaseModel ): action : Union [ SendEmailTool , SearchKnowledgeBaseTool , CreateSupportTicketTool ] summary : str Here is how we can use this in action: system = \"handle request of Rinat - support agent. Don't make things up\" user = \"Email to jessica@example.com, tell that her refund has been processed\" completion = client . chat . completions . parse ( model = \"gpt-5-mini\" , response_format = Response , messages = [ { \"role\" : \"developer\" , \"content\" : system }, { \"role\" : \"user\" , \"content\" : user } ], ) Response can look like: action = SendEmailTool ( tool = 'send_email' , recipient_email = 'jessica@example.com' , subject = 'Your refund has been processed' , message = ( 'Hi Jessica, \\n\\n Your refund has been processed. If you do not see the ' 'refund on your account or have any questions, please reply to this ' 'email and I will investigate. \\n\\n Best, \\n Rinat \\n Customer Support' ) ) summary = 'Email notifying Jessica that her refund has been processed.' This is how we can wrap this code with actual tool calling: # ----- Mock Tool Implementations ----- def send_email ( recipient_email : str , subject : str , message : str ): print ( f \"Sending email to { recipient_email } with subject ' { subject } '\" ) print ( f \"Body: \\n { message } \\n \" ) def search_knowledge_base ( query : str ): print ( f \"Searching KB for: { query } \" ) def create_support_ticket ( customer_id : int , issue_summary : str , priority : str ): print ( f \"Creating { priority } priority ticket for customer { customer_id } \" ) print ( f \"Issue: { issue_summary } \" ) # Map tool type to handler TOOL_DISPATCH : Dict [ str , Callable ] = { \"send_email\" : send_email , \"search_knowledge_base\" : search_knowledge_base , \"create_support_ticket\" : create_support_ticket } # ----- LLM Wrapper ----- def handle_request ( system_prompt : str , user_prompt : str ): completion = openai . chat . completions . parse ( model = \"gpt-5-mini\" , response_format = Response , messages = [ { \"role\" : \"developer\" , \"content\" : system }, { \"role\" : \"user\" , \"content\" : user } ], ) response = completion . choices [ 0 ] . message . parsed print ( f \"Summary: { response . summary } \" ) tool_type = response . action . tool if tool_type in TOOL_DISPATCH : TOOL_DISPATCH [ tool_type ]( response . action ) else : print ( f \"Unknown tool: { tool_type } \" ) 3. Cycle Cycle explicitly forces to repeat reasoning steps. Here we are forcing LLM to come up with multiple risk factors. At least two, but no more than four: from pydantic import BaseModel from typing import List , Literal from annotated_types import MinLen , MaxLen class RiskFactor ( BaseModel ): explanation : str severity : Literal [ \"low\" , \"medium\" , \"high\" ] class RiskAssessment ( BaseModel ): factors : Annotated [ List [ RiskFactor ], MinLen ( 2 ), MaxLen ( 4 )] And the execution: user = \"The server room has poor ventilation and outdated surge protectors.\" completion = client . chat . completions . parse ( model = \"gpt-5-mini\" , response_format = RiskAssessment , messages = [ { \"role\" : \"developer\" , \"content\" : \"be brief\" }, { \"role\" : \"user\" , \"content\" : user } ], ) response: factors = [ RiskFactor ( explanation = ( \"Poor ventilation leading to elevated temperatures, increased \" \"risk of thermal shutdown, shortened hardware lifespan, and \" \"potential downtime.\" ), severity = \"high\" ), RiskFactor ( explanation = ( \"Outdated surge protectors that may not adequately guard against \" \"voltage spikes or electrical faults, raising risk of hardware \" \"damage and data loss; replace with modern surge/UPS protection.\" ), severity = \"high\" ) ] By the way, we can use Cycle to extend the schema from the tool calling example to enable parallel tool execution like this: class Response ( BaseModel ): action : List [ Union [ SendEmailTool , SearchKnowledgeBaseTool , CreateSupportTicketTool ]] summary : str Now the response will contain a list of different tool calls that we can dispatch in parallel before passing the results back to LLM for further processing."}
{"url":"/sku-vault/","title":"📦 SkuVault","text":"📦 SkuVault At the end of 2014 AgileHarbor invited me to assist in the development of SkuVault : a rapidly growing web-based warehouse management system for eCommerce . I'll be assisting Slav Ivanyuk and his developers on a part-time basis. The role would be one of serving software engineer : suggesting solutions for the problems; sharing mistakes and good decisions from the other projects; offering software design ideas to simplify and scale existing system. This is the story of this journey."}
{"url":"/sku-vault/2015-q1/","title":"Retrospective of 2015 Q1","text":"Retrospective of 2015 Q1 Introduction First quarter of 2015 is behind. We managed to achieve some impressive results at SkuVault. This will allow us to offer new features and a better product to customers in the second quarter of 2015. Catalyst 2015 - ChannelAdvisor’s annual industry conference, one of the most important events for e-commerce. Last week Catalyst 2015 took place, where SkuVault was a bronze sponsor and had an exhibition booth. This conference was a big milestone for us: we had an opportunity to demo a completely new feature there - Interactive WavePicking . Wave picking in warehouse management systems is a process of grouping incoming orders into small batches (picklists), which are assigned to workers for assembly. This way a worker can assemble multiple orders in parallel, increasing the overall throughput. Interactive WavePicking in SkuVault provides ability to create these picklists from pending sales, using various criteria. When this picklist is assigned to a worker, the worker will get guidance from SkuVault, telling where to go, which item to pick and which compartment in the cart to put it into. This guidance is location-aware, thus it minimizes walking around the warehouse and increases the overall throughput of the assembly process. It was hard to implement this feature in SkuVault v1 design (based on CQRS/ES architecture), so a different approach to design and development was required. We aimed at two goals simultaneously: come up with a new software design, tailored for SkuVault project to use in further development and constructed to serve customers better in the upcoming years; implement a new feature using that design. Implementation of the feature was an indicator of success or failure of the new design . If development of the feature works out, then the new design is good enough to move forward with. If it doesn’t, then we fail early and save the effort of a major rewrite and a cost of a big failure. I think, we did well. SkuVault team demonstrated WavePicking at Catalyst 2015 , and soon this feature will be available to all customers. Benefits of the new design will come forth in the next quarters of 2015. Major thanks go to: Slav Ivanyuk - for managing the process and doing all super-helpful code reviews Chris Witt - for picking up ReactJS/Flux and doing all the work in UIv2 Jason Henson - for helping with the demo setup, testing and deployment Feature design was driven by SkuVault experts, who had to do a very complex job of identifying the feature essence and figuring out which parts would need to make it into MVP: Andy Eastes , Slav Ivanyuk and Danny Shaw . In order to prepare path for Interactive WavePicking we had to: Design and implement a new infrastructure for the next UI version and backend with APIs. Provide a way to integrate v2 with v1 in various environments: production, where we have more than 20GBs of events coming at high throughput, QA with auto-deployable backend and demo with in-memory backend hosted inside IIS process. Figure out an approach for gradual migration of all existing features from v1 to v2. Figure out a way to improve testability, scalability in new design, while making it easier to develop and learn. So here is what we achieved so far. New Infrastructure Caveat: this retrospective covers only a fraction of SkuVault's life that I was involved in. If we forget about integration with v1 code, the new infrastructure is quite simple. Backend is a collection of .NET modules with an API (JSON over HTTP, provided by NancyFx). Each module can subscribe to events and publish events (async, batched). Modules behaviour can be captured in use cases, which are then applied to verify the correctness of the implementation. This .NET implementation is very similar to reference implementation in golang , which is an open source reference implementation based on the design insights we came up at Happy Pancake. I did almost a 1:1 conversion of golang code to C#. Middleware to pass events between modules and also from v1 to v2 is MessageVault ( available on github ). It is a simple Kafka-inspired event bus, which uses Windows Azure to maintain a highly-available transaction log. Reads are served by Windows Azure storage (via a .NET client library), while writes are handled by a cluster of worker roles (master election is done via Azure blob locks). Front-end in v2 is an absolute pleasure to work with. Previous version is a single project based on ASP.NET MVC with AngularJS and Lokad.CQRS client libraries. New version is a collection of single-page web applications (one web app per major feature), which are statically compiled into JS and CSS bundles. These apps are stateless (web server simply serves the content) and get all data directly from backend API. Stateless web applications are very easy to scale, especially if they are represented by statically compiled content. Besides, any web server can serve them, making development and deployment process very flexible. Just at SkuVault, we can serve UIv2 via: webpack (dev environment), Azure Web Sites (QA and PROD environments), IIS (demo environment). Internally we chose to use ReactJS with Fluxible, since that was the stack we had arrived at during HappyPancake. This software design is backed up by the work done at Facebook and Instagram (it is always good to stand on the shoulders of giants). Chris Witt jumped right into the development process and tackled the UI side of feature development since then. He did a very good job. For example, the UI below allows warehouse managers to create picklists from pending sales. They can apply a dozen of filters to these sales, pick individual sales or batches, reorder sales in the picklist. With ReactJS, we decomposed a relatively complex UI into domain-specific reusable components; FLUX architecture pattern provided a consistent way to capture event-driven UI and client-server interactions in the code. It was easy for multiple developers to work in the resulting codebase, both to continue each other’s work and to develop UI elements in parallel. Software Design Feature decoupling is the most important aspect in our new design. Even though wave picking is quite complex to implement (that’s why it is rarely handled properly in warehouse management), in SkuVault this complexity is isolated from the rest of the system. WavePicking backend API is a separate event-driven module, which can be tested, deployed and scaled independently from the rest of the backend (it can also run in-process with all the rest, for the demo, development and on-premises deployments). Similarly, Wave-picking UI is a separate web application composed from reusable ReactJS components, Flux actions and stores. This application benefits from shared elements (e.g. styles, login/logoff, UI components, build process), but it can be developed and deployed separately. This gives us the benefit of controlled system evolution : we can take existing features and transition them to our new design one by one, minimising the risks and avoiding the total rewrite. In parallel, we can add completely new features. Splitting the system into modules and UI features with well-defined boundaries also simplifies team management and resource allocation . It is easier to manage development of several isolated features than to coordinate development within a single tightly coupled product. It is also simpler to scale the development process, too. Explicit separation between the UI and backend with API was an important design decision with a long-lasting impact on existing product. First, it allows us to divide (and conquer) development into two distinct contexts with very different specifics and challenges: backend and API development focuses on the core domain, scalability and making it very easy for UI folks to build various front-end features. That is pure .NET with low-level optimisations for performance and scalability. UI is going to be THE primary consumer of API, but not the only one: mobile clients and partners will be using it as well. It cares more about the User Experience, feedback, rapid development iterations and pure HTML/JS/CSS development (for the Web). Another interesting side-effect of the design is that we get cheap UI deployments and foundation for A/B testing . One can simply copy feature UI files to a new directory on a web server, getting a different version deployment. If you point different users to different versions of a feature, you get the ability to do gradual roll-outs, per-user customizations and the grounds for A/B testing. On the backend side, development process is enhanced by event-driven use-cases introduced to specify and verify API behavior scenarios. These use-cases improve upon existing specifications at SkuVault, making them less fragile and focused more on the public contract rather than internal implementation. Unlike traditional unit tests, API verification with use-cases can pinpoint the problem in case of failure. As a result, this process saves development time and lends itself to “Getting Things Done” mentality. Use-cases additionally grant us other benefits: API documentation can be generated automatically. It will always stay up-to-date and its quality will be better than what libraries like ServiceStack and NancyFx can auto-generate provide out-of-the-box. Use-cases align very well with the development process, making it easier to manage. Especially well it works with Domain-Driven Design, which SkuVault already employs. Sensible stress-tests can be auto-generated out of the use-cases. They allow running the system through all the scenarios from the specifications, but repeated 1000 times or more. This capability does not replace custom stress-testing scripts, but it comes for free. Build server can run stress tests on each commit, watching for performance regressions and correlating them with changes in the code. Development Process I think, we managed to reduce development friction in v2. UI features in v2 are incrementally recompiled on-the-fly whenever a file changes (thanks to the webpack). We also leverage webpack dev server to handle hot reload (when compilation happens in memory and changes are pushed to the browser). This speeds up web development dramatically. UIv2 features no longer require ASP.NET MVC, so it is not bound to Visual Studio (or to Microsoft Windows itself). Developers are free to choose an environment that fits their needs. It seems that Atom IDE and Sublime are the most frequent choices of people doing web development with ReactJS. Tools like WebPack, ESLint, ES transpilers work from the command line and are supported by all modern IDEs. We also observed that ReactJS simplifies UI development , especially for user interfaces with complex interactions, and when compared to the MV* designs (MVC, MVVM, MVP). That increases productivity and lowers development risks. There obviously is some learning barrier while switching from ASP.NET MVC + AngularJS + jQuery to ReactJS, however the effort pays for itself soon. To make UI development more productive, we use LESS preprocessor for styles (also managed by the webpack), lodash for functional helpers in JS and superagent for AJAX calls. User interactions are all captured in vanilla JavaScript at the moment (later we might enable some ES6 features, which can be transpiled down to ES5 by webpack). JavaScript can be tricky , to make development more reliable we run a linter (ESLint) with a rather strict set of rules. It forces all code to be written consistently, avoiding code smells and bad practices. These rules are enforced by the build server. Real-time statistics and logs are still evolving at SkuVault. Although there are a few glitches (e.g. Hekad integration is less than perfect on Windows), we are consistently improving the experience. At the moment we have more than 80GB of searchable logs handled by ElasticSearch and visualized by Kibana. Carbon and Graphite take care of capturing and reporting dozens of stats from various cloud services running on Azure (starting from RAM/CPU consumption and down to a specific web request latency). Statistics and logs aren’t a mission-critical piece of infrastructure, however they are extremely helpful in understanding software behaviour under production loads. SkuVault needs to scale a lot to serve new customers better, and this toolkit provides real-time insight for that (aside from helping us to debug any potential issues). Instrumenting the existing code to write to the distributed log or to report a new stat metric is easy. API v2 comes with these capabilities from the start, since we are planning to go for aggressive scalability targets with it. Learning process and complexity had to be factored in the new design as well. The simpler it is for developers to understand the design and become productive in it, the simpler it will be for the company to find new talent and to grow. Here is the list of technologies which v2 aims to discard (for good): ASP.NET MVC and all web development in .NET Angular.JS with jQuery ServiceStack API Lokad.CQRS Instead we introduce: ReactJS/Flux NancyFx So far knowledge transfer for UIv2 development had been rather smooth within the company (thanks to the talented developers of SkuVault). Long-term Impact It is always nice to consider possible long-term benefits that could come either cheap or for free. With the new design we potentially get: Well-used and tested API, which could support various clients. Ability to reuse the experience and product knowledge of our web developers to build native clients for the modern mobile platforms: iOS and Android (thanks to React Native). Path for scaling out the system (API is scaled by modules and then partitioned by tenants; UI is stateless and can scale infinitely). Support for on-premises deployments of SkuVault, along with geo-affinity around the world. Ability to deploy system to different clouds to provide higher availability guarantees to our customers. What's Next? For SkuVault, scaling and stability is the primary focus for Q2 2015 . We are going to take existing features one by one and migrate them to the new design, while improving test coverage and performance. New event-driven design gains solid APIs in UI (ReactJS) and on the backend (event-driven design), which reduces coupling and fragility, leading to a simpler and smaller codebase. In order to achieve these goals, continuous integration and build process will also have to be enhanced. We are interested in fast builds and development feedback. Time that a developer spends waiting for his computer to run MSBuild or Resharper to finish code indexing is wasted time. Visual Studio and MSBuild are notorious for being extremely slow on large solutions. Fast builds in v2 are already a part of the design: UIv2 doesn't have .NET in the pipeline at all (plus webpack provides continuous builds), while backend gets faster builds due to the smaller solution footprint. Build process itself could get feedback from: Static analysis: ESLint and Flow in JavaScript. Unit-Tests: testing a single component. Integration tests: use-case verification, derived sanity checks and tests for interactions between the components. Automatic UI/UX verification via scripted user interactions (e.g. WebDriver) Performance tests: scenarios derived from use-cases and custom ones. Codebase size and complexity tracking. Ideally, developers would get this feedback within 5-10 minutes after a commit - while they can fix issues most efficiently. As more features migrate from v1 to v2, we will pull more developers into the new environment. The knowledge transfer facilitation has already started, it will have to be managed more explicitly in Q2. All in all, 2015 Q2 is going to be a very interesting time at SkuVault :) Many thanks to Slav, Andy and Ksenia for reviews."}
{"url":"/sku-vault/2017-01-22-dsl-impression/","title":"Emerging DSL","text":"Emerging DSL This is a quick update on the progress of our migration from Cassandra to LMDB (initial plan outlined in previous blog post ). Fast Tests We started by rewriting one of secondary systems from event-driven Cassandra to event-driven LMDB design. We kept all the tests and API contracts, everything else was rewritten by a team member in a week or two. This system is read-heavy with a short burst of writes on event replay, when it builds local read models. At this point it needs to replay of 100M events taking 50GiB. Cassandra implementation took 10 hours, LMDB does it in 2-3 without any serialization optimizations. Here are immediate observations: Cassandra implementation used to run our tests (they verify behavior of the entire system at the API level) at the speed of 1 test per second . LMDB tests currently run at the speed of 950 per second on the same laptop. Since all ~700 tests now run in a second, we enabled test auto-generation by default . This added 2200 tests, and improved edge case coverage at the cost of 3 more seconds. Cassandra and LMDB data access patterns tend to be similar (despite the obvious differences between the databases). Structure of the keys and values stays roughly the same. This made migration straightforward most of the time. Screenshots are from my own development laptop, which runs Windows in a separate VM and gives it only two hardware threads out of 4 available. Benchmark laptop has 8 of them and runs Windows directly. We achieved high test throughput with two trivial things: switching LMDB to async flushing and creating a database per thread. High availability LMDB is an embedded transactional database (unlike Cassandra which is a replicated eventually consistent data storage). Transactions make life better, but in exchange we need to deal with high availability and fail-over . This will come via: Partitioning nodes by tenants. Running multiple replicas of a partition behind a load balancer (we could pick which partitions are replicated and how). Nginx can easily handle routing and failover in this scenario. Having secondary hot replicas in a different data center. If the primary datacenter goes down, we could switch to the secondary by updating DNS records (good enough for now). This is the hard part. Fortunately, we can make some business-specific trade-offs and simplify the overall design considerably: write-side could survive 1-5 seconds of downtime (especially, if API clients can queue up work) as long as it is consistent; read-side could lag behind up to 5 seconds as long it is always available, transactionally consistent with version numbers and fast. Emerging design While porting system to LMDB we agreed to start by writing everything by hand. FoundationDB client libraries made our life simpler, but still required some manual byte wrangling. public void When ( Tx tx , StatusRenamed e ) { var oldKey = FdbTuple . Create ( ModuleTables . StatusByName , e . RefInfo . TenantId . Id , GetCorrectKey ( e . OriginalName . Name )). ToSlice (). GetBytes (); try { tx . Delete ( oldKey ); } catch ( Exception ) {} var newKey = FdbTuple . Create ( ModuleTables . StatusByName , e . RefInfo . TenantId . Id , GetCorrectKey ( e . NewName . Name )). ToSlice (). GetBytes (); tx . Put ( newKey , BitConverter . GetBytes ( e . Id . Id )); } After a week of work we reviewed the system and identified emerging patterns. Instead of writing a data access library to made them simpler (that would've involved some reflection and complex performance optimizations), we just plugged in a simple codegenerator which replaced 30-40% of the code with the generated stuff. We wanted this new code to be boringly simple and things turned out well. Given these DSL definitions from the single schema file: ( space \"StatusByName\" [ tenant/id status/name ] [ status/id ] ( put \"AddStatusName\" ) ( del \"DeleteStatusName\" ) ( fetch \"GetStatusIdByName\" )) Developers would have boring LMDB access methods generated for them. public void When ( Tx tx , StatusRenamed e ) { var tenantId = e . RefInfo . TenantId . Id ; var statusId = e . Id . Id ; Lmdb . DeleteStatusName ( tx , tenantId , GetCorrectKey ( e . OriginalName . Name )); var dto = Lmdb . NewStatusByNameDto (). SetStatusId ( statusId ); Lmdb . AddStatusName ( tx , tenantId , GetCorrectKey ( e . NewName . Name ), dto ); } These generated methods simply spell out method parameters and pass them in a proper order to byte managing routines. public static void AddStatusName ( Tx tx , long tenantId , string statusName , StatusByNameDto dto ) { var key = DslLib . CreateKey ( Tables . StatusByName , tenantId , statusName ); DslLib . Put ( tx , key , dto ); } Dtos are also generated automatically in such a fashion that we could later swap out Protobuf serialization format for values to something more performant without requiring developers to rewrite business logic. Currently we are evaluating FlatBuffers / Cap'n Proto for storage of complex objects that rarely change. Native multi-value capability of LMDB would work well for frequently changing property groups. Custom bitstream compression is the ultimate goal. Edge cases While doing that we agreed to codegen only the cases which were repeating all over the place. Edge cases will have to be written by hand until we get enough of them to see emerging patterns. Resulting code feels trivial and fast, if compared to the other versions of the same logic. Like a breath of fresh air on a clear and chilly autumn morning in the Hudson valley. This is the goal we are striving for: boring code designed to run fast and scale well for new features. This should ultimately let us create more value for the customers, while making our own days more enjoyable and less stressful. Linguistic complexity Technically, the language of the system is more complex - we have two languages instead of one (with a codegen step). Practically, it still feels simpler. Probable reasons for that are: our handwritte code uses only a small (good) subset of C#/.NET features. It is just a bunch of static methods that transform data between different reprentations, while being conscious about performance. DSL doesn't hold much magic. It is just a bunch of Clojure files doing obvious transformations according to the simple rules. Anybody could rewrite it to another language (e.g. Racket or CommonLisp) in a couple of days. Side benefits DSL design forces us to write out type aliases before referencing them in LMDB layer. They spell out types of data (properties) within a linguistic (bounded) context, while aliasing them with short namespaced identifiers and descriptions. ( group \"tenant\" \"tenant-related schemas\" ( alias id \"Numeric tenant id\" positive-long ) ( alias name \"Tenant name\" string not-empty )) ( group \"user\" \"User within a tenant\" ( alias id \"Unique user id\" positive-long ) ( alias name \"User name\" string not-empty ) ( alias email \"User email/login\" string not-empty ) ( alias role \"User role, affects permissions\" string not-empty ) ( alias password-hash \"Password hash\" string not-empty ) ( alias enabled \"Set to false to disable access\" bool )) ( group \"product\" \"product details\" ( alias capped-key \"Capped sku/code that fits 200 chars\" string ( max-len 200 )) ( alias is-kit \"Is this a kit or not\" bool ) ( alias id \"Product id\" positive-long ) ( alias key-type \"Product key type constant\" byte )) These aliases can also be associated with data specs, which spell out how should the data look like. For example, if we know that product/capped-key will always comply with (max-len 200) , then we don't even need to define types or write argument checks. These aliases will help in defining consistent APIs with auto-generated documentation. We can also perform consistent argument validation not only in API implementations, but also in client libs in multiple languages. Generating Swagger definitions (to get beautiful API explorer) and JavaScipt client libraries are the next planned steps in this direction. Developer onboarding and knowledge transfer It takes a couple of days to get a first commit from a developer new to LMDB and this DSL (provided that API design and scenarios are already done). It would probably take him 3-4 weeks to get used to writing finely-tuned LMDB code in places, where this matters. We're planning to add continuous performance testing to shorten the feedback loop and improve productivity in this area. Rewriting existing DSL from scratch or porting it to a new language is probably a week of effort for a senior developer familiar with Lisps and AST manipulation. There are just 450 lines of code there (mostly with C# code fragments). Destination - golang In the long run we plan to migrate the core business logic to golang. The language itself is a good fit for the job: simple, fast and runs natively on Linux. We don't care much about the advanced features from C#, since they are handled by the DSL layer anyway. DSL design pursues the goal of streamlining migration from C# to golang: generated code doesn't have to be rewritten by hand, only the edge cases."}
{"url":"/sku-vault/2017-07-15-high-availability-and-performance/","title":"High Availability and Performance","text":"High Availability and Performance This is a summary of interesting things we've achieved at SkuVault since the previous blog post on LMDB and emerging DSL . Current stats are: 1.43B events with the total size of 316.3GB . ~450 event types. System Architecture System architecture didn't change much since the last update. However, we improved our ability to reason and communicate about it. Let's go quickly through. The core building block of the system is something we call a service . It has a public API which can handle read and write requests according to the public contract of the service. We do an event-driven variation of service-oriented architecture . Whenever a service needs to persist a state change, instead of overwriting DB state with new values, we craft change events and append them to the distributed commit log . Apache Kafka could work as such distributed commit log. You can also try rolling your own commit log on top of a distributed transactional key-value store (to act as a buffer and a serialization point) and a replicated blob storage. Any interested service could subscribe to this commit log, replay interesting events and get its own version of the state. This state will stay in sync with the global commit log via the subscription mechanism. This means that you can roll out new service versions, features or even specialized services (e.g. analytics and reporting that have different storage requirements) side-by-side with the existing services. Further in this blog post you will discover the term module - it is a monolithic unit of deployment for a group of services that share similar functional, performance and availability characteristics. In other words, it is convenient to deploy, manage and version them together. You could scale by placing a load balancer in front of a cluster to route different requests to the service instances that are designed to handle it. The same load balancer would allow you to fail over to replica nodes in case of outage or a simple rolling upgrade in the cluster. This approach also allows multiple teams work together on a large system while building software on different platforms. In this environment, the service behavior can be expressed via its interactions with the outside world : events it subscribes to; API requests it handles; API responses it returns events it publishes to the commit log. If we somehow were able to capture all important combinations of these interactions of the service, then we'd probably specify its behavior well enough to prevent future regressions. This is what event-driven specifications are for. These specifications come with a few potential benefits: separate design phase from coding phase (which reduces cognitive burden); help you to decompose complex requirements into simple event-driven tests; break rarely (see aside); let you refactor services by moving around requirements, behaviors and functionality. In my experience these specifications almost never break. They just keep on being added to the codebase when new requirements and edge cases show on the horizon. This comes with a potential downside: code that never breaks - never gets refactored and cleaned up. So you might end up with specifications written in different flavours. Imagine, you have a service with some complex functionality inside. This functionality has evolved over time. Service accumulated specifications for the behavior, weird edge cases and strange regressions that slipped through. Over the time, it became apparent that this service implementation houses two distinct feature sets. They are currently intertwined but would live better if separated. What you can do in this situation: split specifications for these services into two different groups; discard the implementation completely (or keep it as a reference); iterate on new versions of the services until all tests pass. What if there are important behaviors that weren't covered by the scenarios? This is very good - thanks to this refactoring exercise you've just discovered gaps in your test suite. Throwing out the implementation logic and rewriting services from scratch might be scary. In practice it is a bit simpler and safer than it looks like: these event-driven specifications capture all important design decisions and service interactions (including edge cases and error responses); developers can focus on one specification at a time, iterating till all specs pass. These specifications are the reason why SkuVault V2 backend was able to evolve through multiple storage backends till it arrived to a place we are quite happy with: LMDB. Cassandra to LMDB At this moment we completely migrated away from Cassandra to an embedded LMDB as a data store. Outcomes: better (and more predictable) performance, reduced DevOps load and simpler scalability model. This also means that: We have to write our own application-specific data manipulation layer , which would've been difficult without lisp DSL and code generation. Our reads are served from memory (best case) or local SSD (worst case) . This simplifies scenarios like route generation for wavepicking algorythm or running complex search filters - you have to worry less about the performance. This also means that our operations are currently mostly IO-bound, while CPU is under-utilized. Were we to switch API from JSON to binary format, while also compressing event chunks, then CPU would become the next bottleneck. For now, event replay speeds on smallish nodes look like this: Replay is the process of downloading relevant event chunks from the commit log, getting interesting events and dispatching them to local handlers in order to rebuild node state. We tend to do that on every new deployment. This is the last step in our journey from one storage engine to another. It looks like we'll stay here for some time. Idempotency didn't work out Exactly-once delivery of messages or events in a distributed system is hard . Initially we've tried to work around this problem by enforcing idempotency in the code via auto-generated scenarios. You take an ordinary Given-When-Then scenario and mutate it by duplicating one event at a time. If the scenario produces the same result, then the affected code paths are idempotent. However, as it turned out, it can be hard or plain impossible to implement idempotent event handlers in some cases. Especially, when you have to deal with events coming from the legacy code. So we are planning to push this deduplication concern into the subscription infrastructure . It could handle deduplication automatically by checking up on known message IDs before passing messages to handlers. New IDs could then be pushed into LMDB within the same transaction in which handlers are executed. This would also solve an edge case when a node crashes after pushing event to the commit log but before writing down acknowledgement from it. Edge case looks like this. Imagine that node-1 produces an event-A and pushes it to the distributed commit log. However, before ACK comes back, the node crashes. System will fail over to the replica node-2 and continue working there. Later, when node-1 comes back online, it will receive the same event-A from the commit log, ignoring it - the event is stamped with node-1 identity (so it is supposed to already be in the local state). Cluster Simulation At the moment of writing, V2 part of SkuVault backend has 3 different types of modules: public (serving web UI), internal (serving legacy systems) and export (handles export jobs). Each module has a couple of replicas. Public module also has 4 partitions (each with a replica). These modules collaborate with each other via our cluster infrastructure: distributed key-value store acting as event buffer, tenant version dictionary and a queue; replicated commit log; forwarder process - single writer responsible for pushing events from the buffer into the replicated commit log. Options for such a distributed key-value store are: Google Spanner , VoltDB , CockroachDB and MarkLogic . Existing cluster infrastructure works fine for development, qa and production purposes. It requires some setup, though. We needed another way to run the entire cluster locally with zero deployment . That's when we finally started tapping into the simulation mode: replaced the distributed key-value store with local LMDB implementation, while keeping all cluster-enabling logic intact; replaced the azure-locked version of the commit log with the file-based simulation; launched all modules in one process in this simulated environment that works very similarly to the production setup (there are some caveats, of course). While it is possible to reason about behavior of the distributed system in some edge cases, seeing it happen under the debugger with your own eyes is much better. It helps to reproduce and fix infrastructure-related bugs, while building an intuitive understanding of the system behavior. A first edge case that was reproduced locally - behavior of the system, when cluster fails from a node-A to a node-B that isn't fully up-to-date yet (e.g. it could be in another datacenter). And this happens while a user edits a version-sensitive entity. In that case UI negotiated with the new node to postpone the edit until the node received all the relevant events. Besides, having a simulated local cluster works well for demo deployments , where you regularly wipe everything and restart from a preset state. It is also a prerequisite for on-premises deployments that can sync up with the remote data-center or fail over to it. Note, that this isn't yet a deterministic simulation that can be used to test distributed systems. It is merely the first step in this direction. Missing bits include: bringing all event handling loops to run on a single thread (for determinism and ability to rewind time forward); simulating transaction conflicts in the cluster infrastructure (current LMDB implementation never conflicts - it takes write lock); simulating load balancing, its error handling and fail overs; injecting faults and adding code buggification in the simulation layer and above; running this setup for a couple of months. If you are interested in this kind of setup, check out things like: Discrete event simulation , The Verification of a Distributed System and especially Testing Distributed Systems w/ Deterministic Simulation . Why Does Deterministic Simulation Matter? Let's imagine we are building some distributed system that deals with business workflows across multiple machines. It might deal with financial transfers or resource allocation. Let's also assert that bugs in this system are very expensive . In theory, implementing that distributed logic shouldn't be a challenging task . We either have a cluster with distributed transactions or implement a saga (I'm oversimplifying the situation, of course). We can even prepare for the problems by planning compensating actions at every stage (and covering this logic with unit and integration tests). To be even more rigorous, we can even have multiple teams implement the same system on different platforms. These implementations would then double-check each other. In practice, how do you know that this implementation would actually work out in real world? How would it deal with misconfigured routers, failing hardware or CPU bugs? What if network packets arrive out of order, double-bit flips happen or disk controller corrupts data? Remember, we've said that bugs in this distributed system are very expensive. Studies by IBM in the 1990s suggest that computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month . This is where deterministic simulation comes into play. If you design your system to run in a virtual environment that is deterministic, simulates all external communications and runs all concurrent processes on a single thread (pseudo-concurrency), then you can get a fair chance of: fuzz-testing random problems and reactions or your system to them; injecting additional faults and timeouts for extra robustness; being able to reproduce any discovered failure, debug and fix it. In essence, this approach is about upfront design of the distributed system for testability and determinism. It allows to throw more machines at the problem of exploring the problem space for costly edge cases , giving us a chance to build a system that is better prepared for random failures the universe might throw at us in real life. Zero-downtime Deployments No much to say here. You throw a couple of load-balancers in front of your nodes, telling them how to route traffic and fail over. With that, a deployment turns into a process: spin up new node instances and let them warm-up local state (if needed); switch LB forward to the new version for a portion of the users (ssh, swap config and do nginx reload); if all goes well - switch more users forward; if things go bad - switch users back. Switch takes only a few seconds, since we don't have long-running requests that need to be drained. We are currently using nginx as the load balancer. However I see lyft/envoy as a promising replacement with a better high-availability story. Single-Writer Caveat The system is designed to have a single writer per process . This design decision lets us benefit from LMDB and its ability to have unlimited isolated reads in parallel. Read throughput is usually limited by the network IO in our case - services process incoming requests faster than they can arrive to the network card. There are some exceptions where CPU, memory and disk IO are the bottleneck; for example: routing and complex filtering which require deep graph-like traversal of persisted state. The caveat is: write requests have to be optimized for low latency (taking more than 15-50ms per write is too much). At the same time, we have a lot to do within a write: check local state to ensure that the write request is valid; generate a new event (or a couple of them); apply this event to the local state; commit events, jobs and any other messages to the cluster generate a response to the user (e.g. include newly generated IDs or totals). As it turns out, it isn't very hard to profile performance and improve. For that we hijacked Google Chrome tracing tools. We simply collect method call timings in memory (it helps that most of the code is auto-generated) and render them in Trace event format which could be opened with chrome://tracing . We've also added passive sampling to the servers - they will try to capture and store traces of potentially interesting requests. Note : for some important requests we also check with the cluster to ensure that the node has the latest version of relevant event stream. This adds a network round-trip to the commit cost. If we find the need to increase write throughput of the system, there are a few options: split the system into more partitions (since single writer limitation is per partition); a single machine could handle multiple partitions; move heavy reads to read-only replicas; trade off some latency for throughput by batching multiple writes together; migrate from Windows to Linux/Unix with optimized network and IO; switch from virtual machines to the dedicated hardware; move from .NET C# to golang (or even lower). The only memorable lesson during these optimizations was: don't use services provided by some cloud vendors, if you value consistently low latency. These services tend to value overall throughput for all customers over consistently low latency for one customer (which makes sense). Next Problems Below is the list of the problems that lie ahead for the project. You might find these problems a little bit less exciting than simulating cluster or running hundreds of full-stack tests per second. This is probably due to the fact that the technical design is established for this project, important challenges are mapped. Further scaling would mostly take place in the realm of people, teams and organization. Development Processes and Training Initial development processes were quite flexible. This was possible thanks to the small size of the team: we just went to the office kitchen and drew on the whiteboard till consensus emerged. As new developers come to the project (including developers from the remote locations), it has to switch to a more formal development model. Training process, methodology and materials also have to be established. Client libraries and DSL Code generation of client libraries (along with the API) will speed up development inside the company, while simplifying integration story for the partners and customers. It will save API consumers from doing double work and re-implementing half of the domain validation logic in their own language. Besides, shipping client libraries with a nice interface would save us the effort of precisely documenting our API along with intricacies of the transport layer, fail-over, back-off, authorization and version management. A client library would let us switch from JSON API to something more performant (e.g. GRPC with HTTP/2) without customers investing any development effort (aside from updating dependencies). Cluster Management and Visualization Imagine a clustered system with 3-4 types of the modules, where: each module has a different deployment profile (e.g. one scales by partitioning, another - by running multiple replicas, third - by adding more queue consumers); each module can have multiple versions running on the cluster (some versions could be enabled, others disabled, some more - partially rolled out); some modules could be in warm-up state (replaying commit log to build their local state); there could be additional deployments of modules used to test or demo new features (currently incompatible with the rest of the cluster); in order to save resources, multiple module instances could be crammed into a single VM (as long as doing that doesn't hurt high availability). This situation might look like a mess. It definitely can confuse project managers, stakeholders, testers and even developers. Confusion can lead to mistakes, anger and the dark side. We don't want that. One way to remedy the situation is to visualize cluster state for people and provide tooling to manage this state, while enforcing domain-specific rules . Fortunately, all this is simple to achieve with some diagnostic endpoints, shell scripts and d3.js . For example: never remove the last active node from the load balancer; never add a stale node to the load balancer; keep partition replicas on different machines; ensure replication factor of 3 for VIP replicas. Summary This design isn't doing anything particularly novel. We are essentially building application-specific database engine tuned for the challenges and constraints of the domain. Event sourcing and Domain-Driven Design are used as higher-level design techniques that let as align \"what people want to achieve\" with \"what computers are capable of\" in an efficient way. In more concrete terms, the current solution: solves existing scalability problems and provides foundation for solving future challenges, should the business experience rapid growth; lays groundwork for bringing in new developers, training them and organizing multi-team collaboration around the core product, services and add-ons; maps a way to tackle complexity in the software, while adding new features (including some that are unique for this industry) and maintaining them through implementation rewrites."}
{"url":"/sku-vault/2017-12-17-scaled-out/","title":"Black Friday Boring","text":"Black Friday Boring 3 years ago I was invited in SkuVault to help with a set of business and technical challenges: short-term : figure out logging and telemetry for a distributed system at scale; mid-term : scale out the software in response to the business growth, while also improving the performance; long-term : simplify the design of SkuVault, making it easier to maintain and add new features. For me it all boiled down to a single goal: ensure that SkuVault can handle Black Friday peak load without any major issues (meaning that an ordinary busy day would just be a breeze). The journey took 3 years. Finally this year SkuVault had a very boring and uneventful Black Friday . During the same period some of the competitors have experienced customer-impacting performance issues. Previous two peak seasons were successful but quite stressful for everybody (emergency calls in the middle of the night aren't what we usually sign up for as software engineers). We organized shifts to maintain round-the-clock watch during the peak season of 2017 as well. Fortunately, no intervention was required during these boring hours. This achievement was a result of work done by multiple teams: V2 team drove the migration , continuously learning and fearlessly adopting new technologies and development practices; V1 team covered the backs by supporting the existing product and switching parts of it to the new infrastructure; QA and Customer Support helped to ensure the best customer experience despite all the changes, they went great lengths at that, too; DevOps/SysAdmin team tirelessly tended to the ever-changing and growing garden of services as the infrastructure evolved to adapt to the new requirements. Let's see what was achieved in the past 3 years. SkuVault V2 3 years ago SkuVault was based on a Lokad.CQRS framework - designed to assist in building distributed message-based systems that could be developed locally and deployed to Azure. This implementation, later called \"SkuVault V1\", was good enough to build a product. However, it wouldn't scale behind a certain point . As the product grew in complexity, adding new features also became difficult . V2 implementation allowed to scale out critical parts of the system , taking the load off the existing V1 components during the last Black Friday. V2 is also the place where the new and complex features are being implemented . SkuVault V2 backend is based on the event-driven services design which was covered in a few conference talks and blog posts (e.g. see the design overview ). Frontend is a classical single page application in React.js and Redux. It would still take some time to migrate the remaining V1 parts to V2. However, the process is well-understood, supported by the experience and some tooling. Wave Picking Wave Picking is an example of a feature that was possible to develop only on V2 design (if we wanted to support big customers with huge warehouses and diverse inventories, that is). This feature initially evolved with the understanding of \"V2 design\", being used as a litmus test of its maturity and usability. It was later extended with the new functionality like \"carts\", \"zone picking\" and \"partial orders\" (which warranted a UI/UX overhaul). At this moment Wave Picking is a stable feature that has been in use for a long time. A dedicated team is currently working on enhancing it with the capabilities which would let small retailers to keep it up in the game with Amazon . Telemetry Extensive telemetry infrastructure , which was non-existent just 3 years ago, is an essential part of SkuVault system these days . There are multiple Elastic Search and InfluxDB clusters ingesting gigabytes of logs and millions of metrics on a continuous basis. Grafana and Kibana offer a real-time insight into this data , providing tools for building visualizations, exploring the data and building up the intuition about the system behavior. New Office SkuVault always had some remote developers. However, 3 years ago I was the only one working from Ufa (Russia). Nowadays, there is an office with 11 people - team with the experience of working together . The experience and the roles include: doing business analysis and writing specs for V2 systems; designing and developing V2 frontend; developing, optimizing and maintaining V2 backend systems; testing and ensuring quality of the software; managing and organizing the development process; doing systems administration for the entire SkuVault and managing V2 infrastructure. It was an interesting experience to design the software system for simplicity of use and training . It was even more captivating to facilitate the initial knowledge transfer, adjusting the model based on the learned lessons, and seeing how ideas and practices spread further across the team. Epilogue My personal goal at SkuVault is complete: the last Black Friday was boring. There also is a shared understanding about how the software system could evolve in the next years to handle more features and meet new scalability targets. For instance, for the next year the teams at SkuVault will be quite busy completing V1 migration. However, this completion would only signify the start of the next phase of research and development towards V3 . But that would be another story for another scribe to document. Some of the long-term challenges include : completely decouple V2 implementation from the Windows Azure, once the V1 migration is complete (this involves developing a new version of Message Vault); build up a vision and tooling to develop a consistent user experience on multiple form factors (tablet, desktop and mobile) while running on different platforms (Android, iOS, Web and Win CE) and continuously delivering new features; bring platform experience to the partners, allowing them to extend user experience in SkuVault or even deliver completely new features. If you have experience that could complement this work and are interested in relocating to Louisville, KY, then SkuVault might have a compelling job offer for you. This company is growing (which often happens when the software doesn't get in the way of the business or even empowers it a little), so they are on the lookout for the capable talent ."}
{"url":"/sku-vault/event-driven-verification/","title":"Verification of Event-Driven Systems","text":"Verification of Event-Driven Systems I'd like to share an interesting experience with event-driven design from SkuVault (the company is a warehouse management system provider). We developed a new feature called wave picking while building it on a new infrastructure, better suited for high loads and scalability ( more about this feature ). Here is the story. Starting with a prototype We started working on the feature with a prototype : a stateless web application (HTML/ReactJS/Fluxible) that talks to a backend via an API (JSON over HTTP). The Web UI focused on convenient user experience, while the backend handled persistence, business rules and domain complexity. This entire system was an add-on to the existing software, communicating with it via an event bus (with the ability to replay all events from the very start). To speed up the initial development, we chose in-memory persistence for views in the backend . This meant that the system would maintain all views (read models) in process, while rebuilding them upon the restart. This also meant no failover or scale-out. This approach allowed us to quickly capture important domain concepts in code : starting from flexible sales search and up to route generation inside the warehouse. It was fairly simple, since we only had to deal with in-memory structures and a little bit of concurrency. Once the first features were implemented in the backend, UI developers could pull the backend to their machines, launch it (there were no dependencies, except for the .NET runtime) and just concentrate on the UI. Further development took place in parallel. Given user stories, interface (HTML/CSS) and proposed API implementation, backend and frontend work could start at the same time. Backend developers would translate their requirements into explicit scenarios and in-memory model. In the meantime frontend developers would start building ReactJS/Flux parts. A few days later (or hours, depending on the complexity) first drafts of new in-memory implementation would be passed to them to give a spin. Sometimes frontend developers would point fingers and say “This method doesn’t work as expected, here’s the curl expression” or “We could use HasMore field on the collection to simplify paging”. These requests would be incorporated into the scenarios and implemented in in-memory. Scenarios A scenario in this project is a data structure that nails down behavior that we want to last . It captures this behavior as a transition between two states, where states are expressed via domain events. Given Events - a sequence of events that setup preconditions and describe what happened in the past (e.g. Sale 12345 was created with 1 bike-blue-pro ) When Request - API request that we are testing (e.g. POST “/wavepicking/session” with sale id 1234 ) Then Response - API response that is expected from the server (e.g. HTTP 200 OK { “version”: 2 } ) Then Events - any events that might be published to event bus (e.g. SessionCreated with sale id 1234 ). We don't verify the internal state, since it could change between versions. We are interested only in public contracts of the system: events and API calls. Here is an example of how such use case could be expressed in C# code: public sealed class When_create_cart : UseCaseSyntax { public static UseCase give_good_lowercase_code () { var author = New . Author (); var newDate = DateTime . UtcNow ; var ri = New . RefInfo ( author , newDate ); var newId = Guid . NewGuid (); return new UseCase { When = POST ( \"/cart/\" , new { code = \"myCart\" }) . As ( author ), ThenResponse = OK ( new { cart = new { code = \"MYCART\" , id = newId }, version = new { inventory = 1 } }), ThenEvents = Events ( new CartCreated ( new CartId ( newId ), ri , \"MYCART\" , 0 )), ServerGenerated = new object [] { newId , newDate } }; } /* more scenarios for cart creation */ } Note that in scenarios we don’t test the entire API response, but rather specify only DTO paths that we are interested in. This way we could write a bunch of scenarios testing a single feature in a given API response (e.g. a specific denormalization), while leaving other parts of the API response to other scenarios. Once you have the scenario, development becomes a simple matter of monkey-coding: bang on the keyboard, until all scenarios pass (including the new one). In essence, these scenarios became our type system checks for the business domain . They nail down behaviors that we want to last. There is one cool aspect with this approach: scenarios are an essential part of the development process. They are written whenever there is need to capture a new requirement, fix a bug or debug a complex behavior. This is an example of test-driven development that actually reinforces itself. Tackling Complexity of the Scale At some point down the road, we had to make the backend production-ready: durable and scalable. We picked Apache Cassandra as the primary storage, since it is widely adopted, scales elastically and never loses data, if used properly. All the complex intricacies of the domain logic were already captured in the in-memory prototype and reinforced by the scenarios. This turned a complete backend rewrite into a rather simple task approachable by a single developer: just make the Cassandra implementation work like the in-memory prototype, while making sure that all scenarios pass . A few weeks down the road, this new implementation was ready in a separate git branch. It brought a lot of technical complexity linked to the way data modeling works in the Cassandra universe, but this complexity was be tackled separately from the intricacies of the domain model. It felt good. We swapped QA backend to a new version (this swap happens without any downtime if you have a load balancer) and asked people to use the software just like they used to, while paying special attention to edge cases, where the behavior changed. There were a few differences, of course . They pinpointed real cases which weren’t covered by specifications, letting two implementations diverge from each other. Fixing that was quite trivial (since the problematic areas were already identified). More often than not, the root cause was within the implementation of Cassandra’s data model, forcing us to rethink it. Being able to separate domain complexity (captured in scenarios and in-memory) from the technical details (cassandra) led to a enjoyable experience, fast debugging and development iterations . We loved it. Let's Keep In-Memory Version At that point were were able to get rid of the in-memory implementation. However, the in-memory version worked so well for us in explaining how things should work, that we decided to keep it in the codebase while making it a first-class citizen . The branches were merged, and both implementations of the backend were kept. Here is how one of the modules looks like: public sealed class ProductListModule : IModule { public ModuleInstance Memory ( IMemoryDependencies provider ) { var store = new Memory . ProductStore (); var reg = new ModuleInstance (); reg . AddProjection ( new ProductsProjection ( store )); reg . AddProjection ( new ProductStatusProjection ( store )); reg . AddDataStore ( store ); reg . AddApi (() => new Memory . ProductsService ( store )); return reg ; } public ModuleInstance Cloud ( ICloudDependencies provider ) { var session = provider . GetSession ( \"v2_product\" ); var store = new Cloud . ProductStore ( session ); var reg = new ModuleInstance (); reg . AddDataStore ( store ); reg . AddProjection ( new ProductDenormalizer ( session , store )); reg . AddApi (() => new Cloud . ProductsService ( store )); return reg ; } } Keeping in-memory implementation around raised a few eyebrows outside of the dev team. \"Wouldn’t we waste time on maintaining two backend technologies at once?\" Fortunately, at this point we already understood benefits well enough to articulate that in-memory implementation is our domain model . It provides us with a quiet happy place in our minds, where we could forget about Cassandra specifics while focusing on the essence of the business. Once is it captured, further implementation is just an application of skill and prowess of a developer to write heavily optimized code. Besides, there were a few additional and very specific benefits to keeping in-memory implementation around: In-memory implementation is used to run local development versions backend for UI developers (we don't want them to bother with installing and maintaining Cassandra nodes locally). In-memory version is also used as a demo (we could create short-lived transient accounts in memory for customers interested in trying out the product). Thus, we were able to prove the point and preserve that quiet happy place in our minds. Idempotency Testing In a distributed system it is very hard (or rather expensive) to guarantee that a given event will be delivered to subscribers only once. It is far “easier” to ensure that subscribers are idempotent instead - that sending a duplicate event wouldn’t change a state. We can guarantee this to a certain extent by taking all existing scenarios, duplicating events in them on-the-fly and then passing through the same scenario verification code. Adding an “idempotency tester” mode to verification code is a simple operation, but it ensures that any existing or new behaviors are idempotent (as long as they are captured by scenarios). Delta Tests Delta testing is another approach which reuses existing scenarios and makes them more useful. We take all existing scenarios and run them against memory and cloud implementations, comparing outputs (published events and API responses) between each other. This includes even fields that we never check for in actual scenarios. If outputs differ somehow, then we have a probable issue on our hands. There is a nice psychological benefit to idempotency and delta tests - they make original scenarios more useful and rewarding to write. It is always awesome when one of these tests discovers an peculiar edge case you never thought about. Random testing At some point we realised that the application of this in-memory model for testing resembled property-based testing like Quick-Check. In QuickCheck you test a software by creating a model for a function, then letting the test suite hammer it with various inputs, while comparing output of the model vs output of the real system. If these differ - then you have a problem (the real implementation is more complex than this explanation, since you have to implement test reduction logic, which flattens the test sequence to minimal number of inputs required to reproduce the issue. With this insight we were able to extend our test suite with random tests. Random testing takes delta testing approach one step further. Its purpose is to explore various state transitions and find spots where behavior of the memory model differs from the one of the cloud implementation. The process looks like this: We take some events from a real system (normally a QA which usually is filled with all sorts of crazy data). We feed event sequence to cloud/memory implementations to populate their state via replay. Then we pass these event sequences through request generators to create random API requests based on that data. Send API requests to both implementations to get outcomes: API responses and any published events. We compare these outcomes between each other, searching for any differences. If there is one - it is a bug. Here is a simplified example of a generator: public static Request gen_search_sales ( IList < IEvent > events ) { var last = events . OfType < SaleCreated > () . LastOrDefault (); if ( last == null ) { return null ; } var author = New . Author ( last . RefInfo ); return GET ( \"/wavepicking/sales\" ) . As ( author ); } Tests are deterministic - passing the same sequence of events to API request generators will always create the same requests and lead to the same test outcomes. We can improve search space (where we compare memory model to cloud implementation) by: Adding more specific generators Taking different event sequences from the real data. Random tests differ from delta tests: We use subsets of events from a real system (instead of well defined clean scenarios) We generate random API requests that could make sense for these events. Random testing becomes especially interesting to run, if we have different implementations written by different developers. In this case, there is a chance that some implicit behaviors (e.g. behaviors that are required but somehow missed specification phase) will be implemented differently. Random testing might uncover these behaviors that went “under the radar”, allowing us to design them properly and nail them down with scenarios. Keeping an Eye On Performance Decent performance comes naturally when you have to replay hundreds of millions of events upon each schema change. Any inefficiency in how you access data storage, deal with RAM, GC or CPU intensive operations - and your deployment would grind to death. Obviously, it would be great to catch the most glaring performance problems during the development. Fortunately, it is trivial to enhance scenario verification suite with basic sanity checks and performance profiling. You just inject proper heuristics in the scenario verification environment. For example, we could measure execution time of any API call or handler, spot the bottlenecks and print them out. Another approach is to bake in sanity checks into the code, like ensuring that all requests to Cassandra run as prepared statements. Graphs and Analysis Scenarios are just data constructs which nail down behaviors (state transitions) which we want to last. These could be easily parsed and analyzed. Models also capture behaviors, but in an implicit form. We could observe these behaviors by running data through (real-world or synthetic) and capturing various data points. Here are a few ideas that we tried: Detect scenarios or behaviors that are untested (e.g. an event is declared in module dependencies but not covered by any scenario). Detect unused code paths (e.g. an event is published by a module, but there is not a single subscriber in the system). Detect specified and tested code paths that aren’t used by the customers in real world. Besides, you could always render this data as graphs and let human perception spot problems and irregularities. Next Steps This journey has been extremely rewarding so far. It feels great to be able to evolve the domain model while knowing that all the captured domain knowledge would be losslessly translated into actual implementation. There are still many things to explore in this direction: Using Lisp to express and generate contracts and schemas for events and API in multiple languages (currently custom ANTLR-based DSL is used). Switching cloud implementation to Kafka/Cassandra/Mesos/Spark/Elastic (currently it is locked to Windows Azure) Adding higher-level model validation based on valid shapes of data and data flows (e.g. when you have multiple data centers in different regions with replication across via event bus)"}
{"url":"/sku-vault/foundationdb-layers/","title":"FoundationDB Layers","text":"FoundationDB Layers A few days ago Apple open-sourced FoundationDB - a transactional (ACID) key-value database with crazy testing and ability to scale to peta-byte clusters. I wrote more on why this is so exciting in FoundationDB is Back! . A few companies have been using FoundationDB in the past years: Apple - bought FoundationDB 3 years ago; WaveFront by VMWare - running petabyte clusters; Snowflake - cloud data warehouse. What I didn't tell at the time: SkuVault happily used FoundationDB in the past years as well. This is the story. Commit Log Layer Initially SkuVault V2 was designed to run on top of a MessageVault - a custom highly-available hybrid between Kafka and EventStore , designed to run on top of Windows Azure Storage. We referred to it as the Distributed Commit Log . Distributed Commit Log at SkuVault holds 1.5B events spanning 400GB of data and growing. It is split into 4 partitions and a few thousand event streams (stream per customer). Some streams have 10M events and more. That implementation was good enough to let the development move forward. Further on it became apparent that we wanted better throughput, predictable and low tail latencies . It was hard to achieve while writing to Azure Storage. Commits that took a few seconds were fairly common. We also needed some kind of stream version locking while doing commits. Otherwise nodes had a slim chance of violating aggregate invariants during fail-over. Version Locking is similar to how aggregate invariants are preserved in classical event sourcing with aggregates. When any node attempts to commit an event, we want to ensure that it used the latest version of the view state in decision making (view state is derived by subscribing to the events). FoundationDB felt like the only option that would let us implement these features fairly quickly . It worked as planned. Given the green light, I was able to implement the first version of the new Commit Log Layer for FDB a few days . In comparison, it took a few months to implement the initial Message Vault. Layer implementation was simple. We designated a subspace to act as a highly available buffer to which nodes would commit their events. Key tuples included a stream id and UUID (that was also a lexicographically ordered timestamp when persisted) which ensured that: there would be no transaction conflicts between the different nodes, allowing high commit throughput; events would be ordered within the stream (and somewhat ordered globally). Another subspace acted as a stream version map . Commit Log Layer took care of consulting with it while performing a commit. If node acted on stale data, there would be a conflict, forcing the application to back off, get the latest events and retry the operation again. Both the commit and version check were a part of the same transaction, ensuring the atomicity of the entire operation. There was also a background repartitioner process that was responsible for downloading messages from the buffer in bulk and pushing them to the blob storage. Performance picture improved noticeably as well: Were I challenged to improve this layer, I'd use newly introduced versionstamps instead of UUIDs (for the global order). Write contention within a stream could also be reduced by bringing the concept of conflict ranges into the application code. Queue Layer Given the new Commit Log implementation, server-side logic became fast. Commit latencies were low and predictable. However, occasionally, the server would just block for a long time doing nothing. This often happened when we had to dispatch a message to Azure Queues. The tracing picture looked like this: 20ms delays were common, however we saw occasional queuing operations that blocked everything for a second or two . Given an existing FoundationDB setup, it was trivial to add a Queue Layer that would allow us to push messages in the same transaction with the events. Tenant version checks were improved along the way (switched to async mode) leading to a prettier performance profile: Messages would still be forwarded to a Azure Queues, but that was a side process, operating in bulk. Miscellaneous Layers Eventually a few more application-specific layers were implemented: Cluster Jobs - to schedule jobs for the background workers. Settings and CRUD Storage - to persist metadata and documents that don't need event sourcing. Async Request Processing - for the cases when a client wants to process a lot of request in bulk, caring only about throughput. Application logic can chain changes to all these layers in the same transaction , making sure that all operations either succeed or fail together. In the end we had half a dozen of various application-specific data models living on the same FoundationDB cluster . All that - without introducing and maintaining a bunch of distinct servers. Cluster Simulation API of the FoundationDB is quite similar to the one of Lightning Memory-Mapped Database (LMDB) . Both operate with key-value ranges where keys are sorted lexicographically. Both support true ACID transactions. The difference: FoundationDB is a distributed database used to coordinate the cluster; LMDB is an in-process database with a single writer , used to persist node data. This synergy allowed us to leverage FoundationDB layer modeling principles and primitives while working to the LMDB databases . That was even before we started using FDB! Later all that experience made us feel right at home while working with FoundationDB itself. We were also able to use LMDB to simulate FoundationDB cluster for the purposes of local development and testing. The last part was particularly amazing. You could literally launch an entire SkuVault V2 cluster in a single process ! Nodes would still use all these FDB Layers to communicate with one another, background repartitioner would still run, jobs would still be dispatched. Except, in that case all the layers would work against LMDB. This meant that you could at bring up the debugger, freezing the entire \"cluster\" at any point, then start poking around the memory and stored state of any node or process. How amazing is that?! This capability was later reused in SkuVault demo mode, which is a special deployment type that creates new accounts on-demand, wiping the data after some time. We packed the entire V2 backend cluster into a single process and instructed it to restart and wipe all data when requested. By the way, if you are interested in this kind of engineering and live in Louisville, KY (or are willing to relocate), SkuVault is hiring. Done Anything Different? Would I have done anything differently at SkuVault? Yes, I'd make a better use of FoundationDB. Use FoundationDB to split event blobs into chunks and store the metadata, making it easier to operate event stores at the terabyte scale . This would also simplify event versioning and GDPR compliance. Switch event pub-sub mechanism from blob polling to watches, reducing event delivery latency. Switch aggregate version locking from a single version number to aggregate conflict ranges (just like FoundationDB does it internally), increasing write throughput to large aggregates. Embrace the synergy between LMDB and FoundationDB, making sure that internal tools (debugging, dumps, REPL, code generation DSL) can target both from the start. Double down onto deterministic simulation testing, making it easier to learn, debug and develop fault-tolerant application logic. Epilogue All the way through the development process at SkuVault, FoundationDB was an invaluable asset . SysOps team particularly liked it - they never had to touch the production cluster. The cluster required zero maintenance during the Black Friday load (barely breaking a sweat under the load) or when Microsoft started pushing Meltdown and Spectre patches across the cloud. I'm overjoyed that this outstanding database is now open-source and available to everybody."}
{"url":"/sku-vault/planning-simulation/","title":"Planning Event-Driven Simulation","text":"Planning Event-Driven Simulation In my previous article I shared with you through some of growth pains of an event-driven business system. We also discussed some possible solutions: use API calls and event contracts as the language of the system; capture system behavior with event-driven scenarios; create a simple in-memory implementation with the purpose for explaining the system; make it pass the tests; create robust and scalable implementation that also passes the tests; keep both solutions around, use them to reason about the system and run comparative tests. This approach worked rather well in practice and the system was pushed to production. We kept on adding more features and inviting more users. 6 months later, it faced peak season of 2016 and survived . Grafana and Kibana were an invaluable source of insight into the system under the stress. Coupled with various canaries and real-time alerting, they saved the day more than once. Practice also showed us the pain points of this design : complexity, speed and scale limitations . This blog post will present a short retrospective over the last 6 months: what worked, what didn't and what we are going to change. The Good Parts Here are the things that worked as advertised: Event-driven approach in modeling, design and collaboration. Given-when-then scenarios (these event-driven specifications are the most valuable part of the codebase by now). Statically compiled web app as a front-end. As you can see, the core of our event-driven approach held strong. However, some other implementation parts didn't work out as well. The Fallacy of Two Implementations The idea of running two implementation side by side looked good in the beginning, but couldn't handle complexity growth in the long term: One implementation stores everything in memory, helps to validate API and design, runs demo and development modes. The other is obeys the same specifications but is for production. Code is a liability , maintaining two implementations of the back-end takes more effort. Fixing a bug in memory implementation doesn't necessarily translate to the corresponding fix in the production implementation; models started diverging in subtle ways not described by the scenarios. In the long run, having a separate memory version of the system felt like a dead-end for us. Cloud Scale Cassandra storage back-end didn't work that well for us for a number of project-specific reasons: Once you reach 500 event-driven scenarios, verifying production implementation starts taking 20-25 minutes and hence is bound to happen less frequently. If you think about it, running a Cassandra cluster in the cloud can get rather expensive for the work it actually does. You need to consider: performance overhead of running a replicated DB on top of replicated SAN, DevOps overhead even with Terraform, unpredictable tail latencies that are inherent to any SAN but somehow are worst on Azure Storage. Cassandra is an eventually consistent database (it picks AP from the CAP). Cassandra is very good at what it does, but can't guarantee strong consistency even if you are using something like RF >= 3 with QUORUM on both reads and writes. This is the dark side of deceptively simple CQL that we've been burnt more than once. Terraform and Packer are amazing tools from HashiCorp. With them we were able to benchmark different cluster configurations in the cloud, iterating through them quickly. ScyllaDB felt like a worthy contender. This C++ implementation of Cassandra promises 2-10 improvement, but currently lacks some features we use in Cassandra. In the long run, having a production system running on top of Cassandra felt like a dead-end for us. If I have enough money for the load and a single writer per entity, I'd pick Cassandra in a heart-beat. It makes a perfect view storage. Besides, we wanted transactions. This is similar to what the paper on Google Spanner talked about: We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. Smart Scenario Testing Advanced verification modes (idempotency, delta and random testing) didn't work out for us either. They take a lot of time to run and hence were bound to be used even less frequently than basic verification runs. To be precise, idempotency tests were used occasionally, delta tests stopped being used after models diverged a little, random tests were never used. This was a pity, since I liked the idea of catching more bugs simply by running a console in an advanced \"hunter\" mode. What Next? We've been thinking about possible solutions to our problems for quite a while. We needed a solution that would make the code simpler and faster: get rid of model duality, reduce the amount of code that we write and maintain; replace Cassandra with something that allows us to run tests much faster (at least 30-50 times faster on a development machine); be able to handle our ever-increasing amount of data (more than 200GB of events by now), provide good throughput with bounded latency; run the same code in the cloud, on dev machines and on-premises. As we dug deeper, we've realized that the only way to have really fast verification runs was to move data storage inside the application (to get rid of the inter-process communication and also emulate the storage reliably). However, that option would come at a cost: we need to implement our own replication and fail-over; fast low-level embedded databases usually come with a simple key-value interface, where both keys and values are byte arrays, so we'd need to implement our own storage layer on top. FoundationDB Legacy Fortunately, there once was a nice database called FoundationDB, that was acquired by Apple. The database itself was unique - distributed key-value storage with strong ACID transaction guarantees. It scaled very well and self-healed. Apple has a policy of wiping out every trace of existence of the companies they buy. This disappointed FDB customers and made everybody else wary of closed-source storage engines. The engineering team behind that database was even more impressive. They tested their database in a simulation, running entire clusters on a single thread using emulated networking and storage. This approach allowed them to: inject faults that would be hard to get in a real world (swapping IPs, flipping network packets, emulating network partitions); reproduce any faults or problems deterministically under the debugger; emulate days of cluster lifetime in minutes. @aphyr said : \"haven't tested foundation in part because their testing appears to be waaaay more rigorous than mine\". So if we managed to implement some fairly simple (but robust enough) replication and fail-over approach while supporting proper system simulation, then we could test the living daylights out of them under different failure scenarios. We already have the building blocks for simulating everything except the storage, anyway. Here are a few inspiring resources that we could steal ideas from: Consistent Transactions with Inconsistent Replication Cheap replication via ordered multicast on the network Simple Testing Can Prevent Most Critical Failures Use of Formal Methods at Amazon Web Services Combining Model Checking and Testing Automating Failure Testing Research at Internet Scale The Verification of a Distributed System Testing Distributed Systems w/ Deterministic Simulation Lightning Memory-mapped Database Roughly at the same time we discovered a wonderful piece of engineering called LMDB . It is a very opinionated embedded key-value database with predictable performance and strong ACID transactional guarantees. LMDB is based on B+Tree design, and works very well with read-heavy workloads. LMDB also leverages memory-mapped files. This means that you can read an object from the database with zero memory allocations. You just need to encode your data using FlatBuffers or Cap'n Proto serialization format. Early benchmarks for event-driven projection with indexes achieved on my laptop 7k/sec (multi-write ACID tx); in event replay mode - 150k/second (ACI). Luckily for us, FoundationDB also left a lot of legacy in tutorials on how to model various storage layers on top of a key-value database. If you look at things from their perspective, Cassandra storage can map rather cleanly on top of an ordered key-value database. Same for SQL databases as well. DSLs for Boring Code We are already using Lisp-based DSL to generate all message contracts with schema validation. It is a natural next step to move even more of the boring and verbose code into a higher-level DSL: API definitions and DTOs (with added benefit that we can generate swagger definitions, client libraries and documentation); storage layer for key-value database (hand-coding tuple and byte operations is boring); pull application-specific consistency controls into the code. Why bother with Lisp? Check out this great article on Lisp and Racket written by a newcomer. Next Steps These insights helped us to plan our next design iteration: keep event-driven design , all scenarios and APIs; discard in-memory implementation completely; replace Cassandra backend with LMDB implementation using FDB storage layer approach; use LMAX-style replication approach (master replicates events to followers, we fail over to them via the LB); take ideas from papers like TAPIR to make consistency decisions part of an application logic (which matches what business did for years anyway); simulate the system and beat it with failures , taking inspiration from giants like Netflix, Twitter, FoundationDB, Amazon; save money by replacing a rather expensive Cassandra cluster with smaller and cheaper application node replicas , partitions; also get a simulation cluster (running on buggy hardware); replace repetitive code (or code that tries to work around the limitations of OOP) with meta-programming . In theory all this should work well."}
{"url":"/sku-vault/week-01/","title":"Getting started","text":"Getting started In the beginning we discussed with Slav the most immediate challenges related to software, maintenance and ongoing business growth. Short-term challenges revolve around logging and statistics aggregation. Mid-term challenges are about scaling the software in response to business growth and improving performance . Long-term objectives: simplifying the design of front-end and back-end, making it easier to add some required features. Logging and Statistics SkuVault generates hundreds of megabytes of logs each day , mostly related to integrations with 3rd party systems. At the moment they are just saved to local storage on back-end worker roles (there are many instances running). Slav wants to bring these logs into one central place for searching and convenient audits. Given the sheer size of logs, he was interested in ElasticSearch and Kibana. They are already used in the industry with a great success. ElasticSearch is an open source, distributed, real-time search and analytics engine. Kibana is ElasticSearch’s data visualization engine. The only trouble would be about feeding logs from the numerous Azure Worker Roles to one central location. We want to have an easy interop with .NET code, ability to perform logging even if ElasticSearch server is not available. Ideally, everything should work out-of-the-box, reusing existing solutions. I performed a quick background check for the stack and interoperability with the existing software. For log ingestion we could use Mozilla Heka, launching a daemon on each worker role to ingest log files produced by NLog . Heka is a high-volume logging infrastructure developed by Mozilla. It is written in go and hence requires very little resources. It runs as a statically compiled native code on many Operating Systems and processors. Another alternative for log ingestion is Logstash (the most popular one, part of ELK stack). However Heka seems to be a better fit, because: It runs as a native code and doesn't require any runtime (Logstash is JRuby), making it easier and cheaper to launch inside Azure Worker Roles. Heka can accept StatsD protocol out-of-the box, forwarding performance metrics to something like Graphite. Heka is a drop-in replacement for Logstash. Drawback of running Heka: it is a newer project with less production exposure and fewer plug-ins. However, if Heka doesn't work for some reason, replacing it with logstash is straightforward. Performance Metrics Improving performance and scalability of SkuVault is an important concern. In order to do that in the most effective manner, we need to answer following questions: What are the biggest bottlenecks and how exactly do they affect software? How do our code changes change performance (if they do)? These answers have to be based on real data - performance timers and counters from the software running in production. Fortunately, Heka does support StatsD format over UDP and there is a .NET library for sending metrics this way. Reporting performance metrics via UDP is used frequently in software. It is fast and doesn't slow down the software much. Besides, UDP is fire-and-forget: if the statistics aggregating process goes down, this will not affect actual application. This week I'll spend a few hours to setup a dedicated Windows Azure VM with ElasticSearch, Kibana and Graphite. Then I'll produce a sample Azure Worker Role, configured to send logs and statistics to that server through Heka. Team at SkuVault will then take their time to reconfigure existing servers to publish logs and stats to this new infrastructure. Ideally, that would require minimal changes to the existing code. Then, we could use captured data to understand behavior of the system, reason about it and plan next steps ."}
{"url":"/sku-vault/week-02/","title":"Improving Azure Ecosystem","text":"Improving Azure Ecosystem The objective of this week was to deal with short-term concerns of SkuVault in 10 hours: setting up infrastructure for log aggregation and performance analysis. I think, we did a good job at that. Log Aggregation and Analysis First of all, we wanted to aggregate existing logs from Windows Azure roles pushing them to a central location for the analysis in real-time. One of the best tools for the job is Elastic Search with Kibana. It took just an hour to have them installed and configured from scratch on Linux. That's how Kibana would look like, once filled with data and configured: It feels like Windows Azure Virtual Machines aren't friendly with Linux. Operations took noticeable more time, compared to the other providers. VM in question wasn't that small - 4 cores and 7GB of RAM. Currently SkuVault uses NLog for logging in .NET. I implemented a small console application demonstrating how to redirect that NLog output to our new log server. Slava's team will need to import these changes into Windows Azure Roles. Mozilla Heka daemon is bundled into the console to act as a gateway. It picks changes in file logs, forwarding them to a Heka daemon on logging server via TCP. Remote daemon pushes them into the Elastic Search immediately. In this setup, log shipping has little impact on existing .NET code: C# code uses same code to write things to NLog which outputs them to files. Heka daemon doesn't use a lot of CPU to parse logs and ship them downstream. If any part of logging infrastucture goes down, logging to files will continue, untill servers come back online. Working with Mozilla Heka was a pure pleasure . It is well documented, works out of the box, easily bridging together Windows and Linux machines and has different plugins available. Dedicated Server Later Slav suggested to move this to a dedicated server. He got one with 2TB of local storage, 2 Intel X5650 processors @ 2.67GHz and 24GB RAM. If memory serves me well, this dedicated server is 2 times less expensive than 7GB RAM Virtual Machine at Windows Azure with 4 virtual cores. While rebuilding Elastic Search + Kibana + Nginx setup on a dedicated server I recorded a screencast covering all steps. This should make process reproducible internally, just in case. Capturing Performance Metrics With Heka insfrastructure in place, adding detailed perofmance metrics capabilities to .NET code was just a matter of: installing Graphite and Carbon to store, aggregate and graph metrics in real-time; including .NET library capable of speaking StatsD into the console sample; telling Heka daemons to aggregate StatsD metrics and push them to our logging backend every 5 seconds. Once I discovered a docker image for Graphite + Carbon, installing (and reinstalling from scratch) turned out to be a 10 minute task. In this setup .NET Library ( StatsdClient , available on Nuget) talks to the local Heka daemon via UDP. This means: There is a little performance overhead from measuring performance, since UDP is \"fire-and-forget\" protocol. Unlike TCP, we don't wait for the recipient to ACK back. If Heka daemon ever slows down or dies, this will not have any impact on production code, it will just continue broadcasting into the void. Introducing detailed performance metrics will require some changes to the existing .NET codebase, since metrics aren't captured at the moment. Fortunately, we can start with a limited amount of metrics first: count number and frequency of each command passed (by type and total); measure execution time of each command handler; capture similar statistics for events and event handlers; record byte sizes and access times for different view types. measure performance of event storage. Carbon retention policies are : store time series at 10 second resolution for 8 days, 1 minute resolution for 31 day, 10 minutes for 1 year and 1 hour resolution for 5 years. That should be enough for SkuVault team to have more than 100000 different time series . They will not need to worry about the disk space either, since carbon will aggregate and cleanup whisper files automatically . Here's how Graphite dashboards can look like, once populated with data for engineers: Ideally, SkuVault will eventually get a dedicated monitor somewhere in the office, displaying the most important metrics and graphs in real-time. Summary Final setup is rather simple. We just use existing software that is known to work well out-of-the-box. Software runs on a single dedicated Ubuntu server and can be setup from scratch in hours. If any part of this devops infrastructure fails, it will have no impact on the production code. Existing .NET codebase doesn't change much. We only need to adjust the infrastructure to report data. Here's how the setup looks like from the high-level perspective: At the beginning of the next week I'll need to tighten the security of this new server and setup backups. Meanwhile SkuVault developers will probably start tweaking Windows Azure Roles to ship logs and statistics for analysis. Afterwards, we'll switch to longer-term perspective: improving the performance in hot spots (where it hurts the most), simplifying the design ."}
{"url":"/sku-vault/week-03/","title":"DevOps and Event-Driven Design","text":"DevOps and Event-Driven Design My third week with SkuVault was 10 hours long. It involved both infrastructure concerns and software design. First, I had to secure our new devops server: encrypt TCP traffic with logs and performance stats; serve web dashboards via HTTPS; add authentication to the web UI. Afterwards we started planning evolution of software design at SkuVault. TLS for Heka Fortunately, encryption is easy to setup with Heka. It works same way on Windows and on Linux. All Heka communications now go through TLS connection, using self-signed client-server certificates. That secures both log messages and performance stats. Golang is picky about certificates. You need to generate certificates with TLS Client extension. Heka would refuse to connect otherwise. HTTPS SkuVault already had a wild-card certificate bought for their domain. Securing the web UI was just a metter of copying certificate chains and keys to devops PC, then telling Nginx to accept only HTTPS connections. Nginx is an open-source reverse proxy server with capabilities for load balancing and web caching. It is performant and battle-tested. Besides, it runs on a operating system that can install updates without rebooting. We also perform HTTP redirect to HTTPS for all insecure connections. Along with that I added basic HTTP authentication to devops server. It is good enough, since we encrypt all traffic anyway. Event-Driven Design During the week we started planning design improvements with Slav. We had in mind following requirements in mind: Better scalability to sustain business growth. Higher availability. Simplify existing code. Make all changes in small steps (no big rewrites). Leverage strong points of Windows Azure. Big software rewrite can be appealing, yet it is a risky option. We all know stories where a long-term rewrite project turned out to be a complete waste of time and money. The purpose of a software design process is to find ways to gradually improve existing software. At the current moment SkuVault consists from multiple modules representing bounded contexts in the domain and running as individual azure worker roles . Modules are event-sourced, their interchange contracts shaped by CQRS principles. Each module has its own private event store and a set of projected views, which are publicly accessible. Being the author of Lokad.CQRS library, I'm guilty of creating a .NET framework which introduces design concepts leading to complex code in the long term. They say, there is a special place in hell for people like me. We can make this design more steady and decoupled by shifting from integration by sending commands and querying view state to integrate modules via events and crafted APIs . This shift would mean that: We hide internal state of the servers from the outside world, making it easier to evolve them and change the implementation. We can replace the majority of the commands in existing design with synchronous API calls. This would also simplify client code. APIs would act out as a natural Anti-Corruption Layer, besides we need to start introducing public APIs to the sytem anyway. Events are less fragile than commands (if crafted properly), they work well in pub/sub integration. Asynchronous code can promise better scalability and performance, yet in practice it introduces complexity, unpredictable execution and weird bugs that can drive other developers crazy. This can be especially dangerous, if they own weapons and know home address of the author. It is better to play safe and limit asynchrony to well-defined places, while keeping the rest of the code simple and synchronous. Ideally, we would introduce all these changes in small steps. Longest iteration would be 2 weeks. One way to start this process is to introduce a small piece of software, responsible for gathering events from all the private event stores and publishing them to all interested subscribers. This software will need to be simple, fast and highly available. At the end of the week I had 3 possible implementation options for this piece in mind: Fourth option was Apache Kafka. This highly scalable distributed commit log would be a good fit for a project with Java stack. In a .NET company it would look like an Englishman in New York. FoundationDB + Golang event storage. Implementation on top of Azure Blob Storage (essentially, next version of Lokad.CQRS event store). Thin layer on top of Azure Table Storage. Each option has its own set of trade-offs. We would need to pick one that fits SkuVault project the best. That is what the next week going to be about."}
{"url":"/sku-vault/week-04/","title":"Message Vault","text":"Message Vault This week with SkuVault was 10 hours long, like the previous one. It was purely dedicated to long-term design evolution. After going through a series of scenarios we ended up with the following plan: Introduce Message Vault - a simple pub/sub message streaming systems for Windows Azure. Make all existing modules publish events to Message Vault . Extract some projection code from the existing SkuVault code and move them into simple modules with API on top: Search , Reports , ProductViews etc. Cover these modules with use-cases. One by one, start migrating aggregate logic from existing SkuVault codebase to the new design. Remove commands and simplify aggregates, where appropriate. Implementing Message Vault is the biggest step here. All the other steps are going to be smaller and more incremental. Message Vault Message vault provides publish-subscribe with replay and batching for Windows Azure . It is heavily inspired by Apache Kafka. However, it is going to be a lot more simple than that. We are going to simply add a thin wrapper around Azure Blob Storage, to act as an API and write synchronization point. Windows Azure is going to do all the heavy-lifting. We need to embrace Windows Azure in order to keep the implementation simple, yet scalable. At the moment of writing, SkuVault stores more than 100000000 events with total size above 25GB. Semantics are going to be similar to Apache Kafka (this way we could migrate to Apache Kafka or any other implementation of a distributed commit log, if needed ). Producers push messages to the Vault which serves them to consumers. Consumers can replay events from any point in time or chase the tail. We partition messages by streams. Each stream is an immutable and ordered sequence of messages. All messages in a stream get a unique offset and timestamp. Absolute order between messages in different streams is not guaranteed, but we can still sort by timestamps (within the time drift on Azure). Message Vault is not going to be an Event Store , it is not designed for event sourcing with aggregates\" (you need NEventStore or EventStore for that). Design Trade-offs Message Vault makes following trade-offs: optimize for high throughput over low latency ; optimize for message streams which are gigabytes large; prefer code simplicity over complex performance optimizations ; http protocol instead of binary protocol; rely on Windows Azure to do all the heavy-lifting (this simplifies code, but couples implementation to Azure); high-availability via master-slave setup (uptime limited by Azure uptime, no writes during failover); no channel encryption (if needed, use SSL with Azure Load Balancer or your load balancer); no authorization schemes (if needed, configure your load balancer or add a proxy on top); implemented in imperative C# (.NET runtime is heavy, but Windows Azure optimizes for it); client library is intentionally simple (view projections and even checkpoints are outside the scope); each stream is a sepate page blob (they can grow to 1TB out-of-the-box, having thousands of streams isn't a good idea). Implementation Implementation should be rather straightforward, since we brutally optimize the implementation for the task at hand (SkuVault project itself). We allocated 20 hours for that. The most tricky part is going to be: write and test master-slave failover (we need high availability) and handling writes between the moment the master goes down and moment, when azure lock expires. By the way, awesome folks from AgileHarbor (company behind the SkuVault) agreed to make the project open-source under the New BSD license. The project will be hosted on github . New BSD License (or _3-clause BSD_) allows you almost unlimited freedom with the code as long as you include the copyright notice. You don't need to share your code. You cannot use the names of the original company or its members to endorse derived products. Evolution Once we have MessageVault in place, we could start extracting some view-based logic into new modules. This will help us: we reduce complexity of existing codebase; new modules are designed for new performance requirements, take off the load from the existing codebase; new modules (at least the ones with pure denormalization logic) are going to be brutally simple. Eventually, we are planning to have SkuVault composed from the modules like this one: In essence, this is an application of \"event-driven micro-services\" paradigm to .NET environment. SkuVault C# modules are probably going to be \"larger\" than equivalents in erlang or golang. That is because C# .NET ecosystem is shaped by enterprise mindset. As a side-effect, everything tends to be bigger on .NET: libraries, classes, variable names and build times."}
{"url":"/sku-vault/week-07/","title":"Delivering First Bits to QA","text":"Delivering First Bits to QA Deploying MessageVault This week we finally pushed our MessageVault project to QA deployment. It will live there for some time, before going to production. To remind you, MessageVault is a simple message storage server that runs on Windows Azure. It is inspired by Apache Kafka. This server allows saving messages as streams to page blobs on Windows Azure Storage. Streams can reach terabytes and go beyond. Once we persisted messages to Azure Storage, we could read them directly from there, MessageVault server is not needed for that. You can have any number of consumers reading them or even replaying from the start. All events that are recorded in QA deployment of SkuVault, get pushed to dedicated streams on MessageVault. It is easy to consume them from any code at this point. From now we could start migrating some existing logic to dedicated modules to do reporting, exports, serve as API etc. The most important benefit of MessageVault setup is that it would allow to: Implement some of the existing functionality as stand-alone modules with API. Scale out the solution by batch processing messages, and adjusting implementation of each module according to the use-case. Experiment with writing new modules without risking the integrity of the existing solution. Simplify existing solution, by replacing integration via commands and views with integration via crafted domain events . Introduce testability of API modules as one whole. Simplify front-end (no need to query multiple key-value documents in order to display one simple report or a view). What's more important - these changes could happen incrementally without big rewrites to the system . At least that's the plan. We'll see how the things would work out in the next weeks. You might get an impression that we are going to transform a large application into a crowd of small micro-services in .NET. This is not entirely true. In fact, the number of projects could even go down, as we reduce coupling and duplication of view logic. We had to learn limitations, week points and scaling possibilities of MessageVault. So it went through some intensive stress-testing before the QA. The best approach was to deploy the system and test it in different scenarios. For example, these are the throughput numbers (transactions per second and messages per second) and latencies while running it on Large, Medium and Small instances of Windows Azure Role. As you can see, MessageVault doesn't handle long-term load well, if deployed as Small Azure Service. It handles load better (and with lower latencies), if deployed as Large Azure Service. These performance numbers are preliminary, there are things that could see improvement (e.g. improving scheduler for high-throughput scenarios). However, we are going to do an internal code-review of the project before moving forward with it. This approach follows the mantra: first make it work, then make it beautiful; afterwards make it fast. You could check out the sources and performance for yourself. MessageVault is available on github as part of AgileHarbor repository. It includes libraries for the client and server along with examples and an Azure deployment you can use. Hekad Gateway Graphite statistics on the image above are a courtesy of our dedicated devops server. The same server also provides a nice UI to access and search logs in real-time. In order to benefit from these capabilities, a code has to know how to push data to this devops server. Integration between .NET and Linux proved to be the most tricky part. However today, we pushed to QA a first deployment of SkuVault with real-time statistics and logging enabled. SkuVault codebase itself isn't instrumented with performance counters yet. However, it became easy to do so. This will allow to learn behavior of the system in real-time, understand the bottlenecks and improve performance for the users. The integration project for pushing logs and statistics from .NET code to Hekad is available as open source . It is probably too specific for reuse, but it could serve as sample. Breaking the Build Fun fact: in the process of QA deployment I broke the build half a dozen times. The project didn't compile on my machine due to a private dependency I didn't have. So I had to be pretend to be a compiler and MSBuild, verifying my own changes in the head before pushing them to the build server. It all worked in the end, but I probably didn't leave a good impression of a first-time committer to the core codebase :) SkuVault Record Obviously, the development effort in SkuVault didn't focus just on DevOps and messaging middleware. The team was busy improving user experience and stability of the rapidly growing product. By the way, during last Black Friday, SkuVault storage broke the record of 100M events stored in the system. At this point in time, there were more than 4.5M sales created in SkuVault in total (throughout the history) and 2.3M sale items picked. Kudos to the team of SkuVault for reaching these impressive numbers and exceeding them. Next Steps In the next weeks we'll be working on migrating some of the existing logic to a simpler event-driven design, as well as planning and prototyping new features for the product."}
{"url":"/sku-vault/week-11/","title":"Simplifying Web UI","text":"Simplifying Web UI During my last few weeks at SkuVault I focused on three areas: Find a way to simplify existing application design, scale it and introduce an API. Work on a new SkuVault feature (time-bound) implemented with API and a new UI prototype to back it up. Plan and prepare infrastructure bits needed to start rolling things out to the Production. So far, as it looks like, all design changes seem to come together rather nicely. Most of the credit here goes to Slav and his team for making everything event-driven from the start (the story would’ve been much less interesting, if the product was centered around the database). Plan for the backend is quite simple: gradually replace existing modules with API modules, while covering API with specifications and simplifying the internal implementation. A lot of commands and views will become obsolete. Language and platform of the new code would stay the same - .NET. While working on the new feature, we came to the idea of how to deal with the Web UI as well. SkuVault UI already reflects a lot of complexity from the underlying domain, while ASP.NET MVC, Lokad.CQRS and Angular.JS don’t necessarily make things easier. Fun domain fact: you can’t really prohibit a warehouse worker from doing something in the application. If really needed (as in: “manager just told him so”), he will do whatever is needed. However without the ability to record that change in the app, discrepancies will start accumulating. Solution to that problem couldn’t involve the Big Rewrite , because of the time-frames and the risk. Eventually, we came up with a simple idea: Start implementing new features as tiny single-page Web applications that talk to the HTTP API. Make sure that if you are logged into one feature, you are logged into the entire application. Any JS framework could be used to implement a feature. It all works, as long as it renders into a static JS bundle. Since a feature is implemented as a static single-page application, it could be hosted on any platform that can serve static files (Azure Web Sites, CDN, Amazon S3). Existing app could serve these features as well, and that’s how we will migrate the UI - replace existing code with small features that are versioned and served separately. Although, if React.JS is used to implement a feature, visual widgets from the shared library could be reused. Folks at Facebook use that a lot. This approach actually has a few additional benefits : UI Features could be developed, deployed and tested in parallel, without stepping on the toes of each other. We could have a setup, where a git push of a new version of a feature product-search to a branch smart-auto-complete will actually deploy that feature to a website under the url /product-search-smart-auto-complete/ . That could speed up the development and QA cycle. We have a clean separation of concerns between the stateful API (handles scaling and domain complexity) and the stateless UI (focuses on user interactions). A/B testing and gradual feature roll-out to customers also become quite easy to implement. While working on new functionality, I started implementing UI this way because it was much faster to prototype. However, once we discovered additional benefits of that approach and a simplicity of integration with the existing UI, we might give it a try. These changes could take some time but should bring the entire product into a rather technically boring stack: HTML + JS frontend and .NET services in the backend. Boring, if you forget ReactJS, possibility of React Native, event-driven backend services and the scalability challenges :) Within the upcoming weeks we plan to roll out new MessageVault integration and API bits to work with the events coming from the production (got to test realistic workloads). I’ll be focusing on the new feature most of the time."}
{"url":"/solving-the-right-problem/","title":"Solving the right problem","text":"Solving the right problem In the Solving the wrong problem , we took a look at what happens when the project applies a very good engineering solution to the wrong problem. This blog was an example. Here is a story of another project that focused on the right problems from the start. Getting priorities straight A few years ago I wrote a small journaling web app to help me work on multiple projects at Trustbit. The purpose was to let me track any piece of information with as few distractions as possible. A lot of effort went into the user experience . I've been ruthlessly killing and rebuilding all features that somehow affected the writing flow. I tried to minimize the friction and make the writing process as rewarding as possible. A little \"pat on the back\" once in a while can do wonders. Throughout the development, I deferred any performance optimizations or tech improvements until it really hurt (it happened only once). Whenever the project could be dumbed down, it would be. I killed these awesome features that were a technical and cognitive burden: full-text search note tagging expandable topic views topic name auto-completion within the editor (to make it a wiki). Plus, I have replaced Ruby-on-rails with a boring Python Flask app, file-based storage and SQLite indexing. What is the measurable outcome? Metrics speak for themselves. I've been using Py-journal for ~70 weeks. It has more than 1700 posts. That is 3 notes on average per day! Granted, only a small fraction of these notes could be shared publicly, but there are quite a few interesting stories there. In the hindsight, what were the most important decisions leading to this outcome? First, it takes only a few steps to write down something that belongs to an existing topic. Everything that is not so relevant is either faded or removed. The writing flow looks like this: Open the web app (it lives in a persistent Safari tab). Type a few letters to find the matching topic via instant fuzzy search (or hit \"add\" to create a new one). Hit \"Enter\" to open the first match. Hit \"A\" to add a new note and start typing right away. Write down the text in a visual editor. Copy-paste or drag images to attach them to the note. Save when finished. Second, it is the job of the application to make content discoverable and rewarding. As soon as the note is saved, the application will parse the HTML and pull any optional headers. These are added to the global table of contents associated with the topic. It is so rewarding to add a new note and see the outline grow! This Topic view also helps to maintain a high-level perspective on the progress . Along the way, it subtly encourages to continue the work : \"Look, you have been working on this thing for so many years, what about digging a little bit more, just to keep the momentum\"? Weekly review shows the work done within a specific week (also rewarding): All topic images are included in the pinboard view (more fun than useful, but encourages to add illustrations): Filtered export creates a nice PDF for all work that was done within a specific time period in some topics. This is how I send work notes to my colleagues twice a week. Lessons learned Core lessons that I've learned while working on that project (and writing this retrospective): Good engineering solutions can be a liability when they solve the wrong problem. Pick the metrics to optimize and focus on them. Remove anything that doesn't improve the metrics, this will make room for helpful additions down the road. Given these lessons, I plan to fix the problems with this blog as well. It should be easier and more rewarding for me to publish here. As a first step, I moved the entire blog to a tiny cloud VM (Debian) with the Caddy server running in front. The website is still static, but the editing loop is faster. It is a small relief. It is just the first week of 2022, but I'm already finishing my 3rd blog post. Next possible solutions to explore: introduce comments; speed up publishing pipeline; add WYSIWYG editor with image copy-paste experience; make URL slugs and filenames - optional; drop GA and set up privacy-preserving analytics to track the numbers of visits in Grafana; set up target metrics that I want to achieve: number of weekly visits, number of blog posts, reactions etc; encourage friends and colleagues to write more frequently as well. We'll see how this works out 😄 Follow-ups I don't like Google Analytics. It is slow and invasive. So I'm removing it completely. Lean web analytics"}
{"url":"/solving-the-wrong-problem/","title":"Solving the wrong problem","text":"Solving the wrong problem I've blogged about Problem-driven design , then followed up with Problem-driven vs Domain-driven . Let's take a look at a more concrete example. We'll use this blog as an example of a technically correct solution that solved the wrong problem and actually sabotaged my writing efforts. Then we'll check out another project that has achieved a different outcome by focusing on a different problem. The story is split into two blog posts. Good solution This blog has a long history. It started somewhere on Livejournal, Blogger, and WordPress. After getting tired of WordPress maintenance, I moved to Squarespace. Sometime later - to Hugo, then to a static website generator written in node.js, then in Python. The story is captured in Blog Migrated to Python . From the operational perspective, the current setup is perfect . Content and the code are versioned in git. Python code runs a few hundred content files through the templates to generate a static website. Files are uploaded to S3 and served via Cloudflare CDN with edge servers across all major countries. Amazon handles the DNS, domain, and certificate renewal. I'm a software engineer, so I optimized the solution to some degree of perfection. The maintenance overhead and the costs are negligible, the scalability - unlimited. It never goes down. Wrong problem This \"perfect\" solution has an issue - it solves the wrong problem . While chasing the surface problem \"Make it scalable, cheap, and low-maintenance\", I forgot about the original purpose of the blog: to publish stuff and learn through that process . When in doubt, consult the metrics : there were just 3 blog posts published in 2021 and 5 blog posts in 2020. Clearly, this setup doesn't help me to publish blog posts more frequently. This is an example of what happens when you solve the wrong problem: the solution is valid from the engineering perspective, but the overall product is useless . Ugly outcome Let's dig into the writing workflow for the potential clues. I have an idea for a blog post, something to share. I go to the work folder, navigate to content/ and notice that the git state is dirty from the last time. Ok, time to commit previous changes. This will require reviewing the changes and writing an appropriate commit message. Meanwhile, the original idea faints away. Blog posts are stored as markdown files with YAML front-matter. So I need to create a new file. How should it be called? What should be the title? When the editing time comes, I'll need to keep in mind the original idea, the Markdown syntax. Where should I put that image attachment? What was that markdown syntax for the images? Why did I pick the markdown syntax? Because it is popular and makes it easy to edit existing text. The truth is - I rarely edit something that was published. I published something, but the changes don't show up. Ah, I need to invalidate Cloudflare CDN caches to see the changes online and share the URL. Let me log into the AWS console and find my way around. Ouch, there is an embarrassing typo. I probably need to edit the file, publish and then invalidate caches again. Also, don't forget to commit changes to the git with a meaningful name. Little by little, these tiny distractions nibble away at the original blog post idea and inspiration . Until I throw up my hands in exasperation and switch to something else. This also leaves the git repository in a dirty state. As such, I have published just one blog post every 4 months (on average) in the last six years. There are quite a few unfinished posts as well. How could we solve the underlying problem? Let's recap. We want to encourage writing and publishing more frequently . To optimize the entire flow for my specific constraints: limited brain capacity, little time, plenty of distractions, and interruptions. Not a problem : ultra-low maintenance, high scalability, and low-effort editing of the existing blog posts. From this perspective, the software stack is less relevant than the user experience . Follow-ups I'm getting rid of Google analytics on this website: Lean web analytics We take a look at another writing project of mine that focused on different problems and achieved better results: Solving the right problem ."}
{"url":"/stop-user-session-tracking-experiment/","title":"Stop user session tracking experiment","text":"Stop user session tracking experiment For the past months I've been running an experiment on abdullin.com: Removed Google Analytics from this website Replaced analytics with cookie-less user flow tracking: (Over) Designing privacy-first analytics . Used Clickhouse and Python to analyze http access logs from Caddy: Analyze website logs with clickhouse-local . IP addresses were not used. Outcome: not worth it for me . The setup was clever, but the simplicity of \"I don't track your session, even in anonymous and aggregated way\" - wins. I could produce charts like that: or that: But in reality a chart like that is good enough for me: Update list for the website: Remove Hotwired Turbo and stop tracking user sessions. We are back to plain old server-side pages. Upgrade rendering pipeline to Python 3.9 (and a couple of other dependencies) Convert project repository to a proper Python project with setup.py , setup.cfg , pyproject.toml etc. This is consistent with Python standards and data science projects that I help to manage at day work. Less context switching is good."}
{"url":"/structured-output/","title":"Structured Output","text":"Structured Output Summary : Structured Output (constrained decoding based on grammar) forces LLM to respond only according to a predefined schema. Structured Output was popularised by OpenAI, but since then found its way to multiple cloud providers and local inference engines. The best way to illustrate the concept is with a code snippet. Let's say, we want to parse chat messages and extract calendar event data out of them. One way to approach that is by prompting LLM to respond in a specific format. Then, parsing the response with regular expressions to extract required fields. Another approach is to use response schema which will ensure that the output will be structured in a certain way. Like this: Response follows CalendarEvent schema, so we can parse and manipulate is as a type object right away. This saves a lot of development time. The code above prints the list of parsed participants. It will print ['Alice', 'Bob'] When using Python, you can leverage different types of properties to constrain the response: from pydantic import BaseModel from typing import Literal , List class SqlResponse ( BaseModel ): sql_query : str query_type : Literal [ \"read\" , \"write\" , \"delete\" , \"update\" ] tables : List [ str ] class ComponentResponse ( BaseModel ): height_mm : float width_mm : float depth_mm : float number_of_pins : int component_type : Literal [ \"AC/DC\" , \"DC/DC\" ] Different languages will make use of various typing frameworks. Implementation-wise, under the hood everything will most likely be converted to JSON Schema before being passed to LLM inference engine. Structured output is an essential tool for improving LLM accuracy via Schema-Guided Reasoning (SGR) How does this work? Under the hood Structured Output works like a regex for the token generation. LLMs generate probabilities for all tokens at each single token, and constrained decoding simply prohibits certain tokens from happening. We can illustrate this with a simple snippet. The code below prompts a local model (Mistral 7B in this case): \"Write me a mayonnaise recipe. Please answer in Georgian\". By default Mistral 7B is a small model that will not be capable of answering in a lesser-known language, but this specific code will work: from transformers import AutoModelForCausalLM , AutoTokenizer , LogitsProcessor import torch model = AutoModelForCausalLM . from_pretrained ( \"mistralai/Mistral-7B-Instruct-v0.2\" , torch_dtype = torch . float16 , device_map = \"auto\" ) tokenizer = AutoTokenizer . from_pretrained ( \"mistralai/Mistral-7B-Instruct-v0.2\" ) # MAGIC HAPPENS HERE messages = [ { \"role\" : \"user\" , \"content\" : \"Write me a mayonnaise recipe. Please answer in Georgian\" }, ] tokens = tokenizer . apply_chat_template ( messages , return_tensors = \"pt\" ) . to ( device ) generated_ids = model . generate ( tokens , max_new_tokens = 1000 , do_sample = True , num_beams = 5 , renormalize_logits = True , logits_processor = [ Guidance ()]) decoded = tokenizer . batch_decode ( generated_ids ) print ( decoded [ 0 ]) The reason for that is a small class called Guidance which we pass to logit_processor field. This class makes it impossible for LLM to answer in anything but Georgian: import regex alphabet = re . compile ( r '[\\u10A0-\\u10FF]+' ) punctuation = regex . compile ( r '^\\P {L} +$' ) drop_mask = torch . zeros ( 1 , tokenizer . vocab_size , dtype = torch . bool , device = \"cuda:0\" ) for k , v in tokenizer . get_vocab () . items (): s = k . lstrip ( '▁' ) if alphabet . match ( s ) or punctuation . match ( s ): continue drop_mask [ 0 ][ v ] = True drop_mask [ 0 ][ tokenizer . eos_token_id ] = False class Guidance ( LogitsProcessor ): def __call__ ( self , input_ids , scores ): return scores . masked_fill ( drop_mask , float ( '-inf' )) Code will work as expected, but will come with a caveat: Mistral 7B will indeed answer only with Georgian letters but would sometimes respond in a complete gibberish. This highlights the major caveat with Structured Output - it forces the model to respond only in a very predefined format, but: this doesn't magically distill the model with corresponding skills this can actually reduce model accuracy, because we constrain not only response but also thinking process. We can leverage Schema-Guided Reasoning (SGR) to use Structured Output while improving accuracy. Caveats Caveat: Description Fields Some LLM APIs will use the response schema twice: To compile and load into the inference engine To silently insert into a prompt Because of this, the following structured request will work as expected on OpenAI: class ResponseFormat ( BaseModel ): say_hi_like_a_royal_person_briefly : str = Field ( ... , description = \"Respond in German!\" ) it will respond in German: { \"say_hi_like_a_royal_person_briefly\" : \"Guten Tag, ich grüße Sie hochachtungsvoll!\" } However, not all APIs and inference engines do that. You can use the ResponseFormat above to test this assumption. Caveat: Accuracy It is very easy to reduce accuracy of a model by introducing constrained decoding. This happens, because we take the ability of a model to think before providing an answer. Schema-Guided Reasoning (SGR) could help to mitigate the problem. Implementations Structured Outputs by OpenAI Mistral Structured Outputs Structured Outputs with Google Gemini Local: XGrammar Local: Outlines"}
{"url":"/talks/2015-01-06-dddbe-facebook-flux-reactjs/","title":"ReactJS and Facebook Flux at DDDBE in Belgium","text":"ReactJS and Facebook Flux at DDDBE in Belgium In January 2015 I was invited by awesome DDDBE community in Belgium to give a talk about building responsive web applications with Facebook Flux and React.js. This was a big honor, since European DDD community is the place where the most of recent insights in Domain-Driven Design originate from. Below you will find links to video, slides and the source code for this presentation. References Source Code ( version from the talk ) Video ~2 hours Slides Trappist Beer Belgiam DDD community kindly introduced me to the Trappist Beer , which is probably among the best beers in the world. Trappist beer is brewed by Trappist breweries. Only ten monasteries — six in Belgium, two in the Netherlands, one in Austria and one in the United States — currently brew beer and sell it as Authentic Trappist Product. Weeks later, these few treasured bottles reminded me about the given promise to commit something cool to the Open Source, marking it with #dddbe_beers tag. Thanks to _Orval Trappist Ale , seq` was delivered. It is a library in Golang to verify assertions against complex objects (e.g. events and API responses). We always wanted to have this library, while working on event-driven specifications at HappyPancake project. You can find it out on github . Feedback Many thanks to Mathias Verraes and Stijn Vannieuwenhuyse for organizing the event. Also my thanks go to the DDDBE community for having me and providing such an inspiring feedback. DDD Belgium Community We liked our last speaker @abdullin so much, that we've booked another Russian. Bram De Moor : Great talk yesterday! A nice mix between tech and conceptual overview, good pace, and comprehensive answers to the questions. Tom Soete : Note to self: Take day off after each @dddbe meeting! Can't wait to start playing with #ReactJS + #FluxJS after the great talk by @abdullin David Cumps : Had a very nice @DDDBE presentation by @abdullin, thanks. Enjoyed it, going to look into react and flux, really resonated. Steven Willems : Thx @abdullin for introducing me tonight to #flux and #reactjs. I really liked what I heard. Actually, it really is CQRS in the browser :) Tom Janssens : Thank you; we had a blast! Kees van den Berg : thanx for the inspiring presentation of Flux with the latest @DDDBE meetup Sven Schelfaut : Really nice @DDDBE session from @abdullin Looking forward to the next session @combell Dieter Provoost : Great talk on ReactJs/Flux by @abdullin for @DDDBE. Thanks!"}
{"url":"/talks/2015-03-04-event-driven-design-workshop-madrid/","title":"Workshop on Event-Driven Design in Madrid","text":"Workshop on Event-Driven Design in Madrid A few days ago I ran a small workshop on software design for the awesome folks at h4ckademy. h4ckademy - 6 weeks of immersive code retreat in Madrid, inspired by HackerSchool. It's designed for programmers that want to become better professionals. Israel Gutiérrez is one of the organizers of the event. Audience was very diverse - heterogeneous backgrounds, different levels of experience (recent college graduates and professionals with 10+ years in software) and different minds (from designers to console hackers). We decided to run a crash course in software design, focusing on the practical ideas that could be explained quickly and in a coherent manner . The plan looked like this: Essence of software design - principles of Domain-Driven Design, fast iterations, importance of feedback, focus on contexts, boundaries and contracts. Divide and conquer. Domain events and API contracts as the core part of interchange contexts. Event Storming Session - collaborative exercise on analyzing a new business domain and building a useful model: contexts, domain events, API contracts. Practical application of event-driven model - split domain implementation into modules; capture behaviors with runnable specifications expressed via domain-events and API contracts; scale design to handle more features, team members and higher loads; high availability. CQRS Beers - an informal discussion, focused on QA, actual code and implementation details. We talked about ReactJS/Flux vs AngularJS and the other MV* frameworks, career paths of a developer, building reputation, working remotely and dealing with burn-outs. The workshop went well, I enjoyed presenting to such a diverse audience and doing exercises together. Many thanks to Israel Gutiérrez for inviting me over for such an event. What could be improved To make this workshop better: allocate more time for the same amount of material (we had to rush through some concepts) or reduce the amount of material for this time-frame; explore more than one problem domain in exercises, switching people between teams; dedicate a block of ~15-20 minutes to front-end problems and designs. Things that worked out very well and should be kept: mixing presentations with QA and collaborative exercises - this allows to keep people more involved in the process, cover more material; CQRS Beers - that format works very well for relaxed discussions (as proven by the years of experience :) ); event-storming - it is one of the best parts of the design process, thanks to the EU DDD community, there is a lot of fun and interactivity; following the reasoning sequence from \"foundational design principles at bird-view\" to \"collaborative design exercise\" and then to \"practical applications\"; maybe it could be extended with hand-on exercises for longer workshops. References These additional materials contain more information on the topics covered during the workshop. They can also provide answers to some questions we didn't have time to address. Download slides (PDF). Slide 3 is the most important one there. HappyPancake Story - story of a HappyPancake project, covering many aspects of event-driven design, \"micro-services\", specifications and event-driven UI. Check this story, the others and simply browse the site. Being The Worst - light-hearted podcast on domain-driven design, implementation patterns and learning how to build a task manager. We are still doing it. Domain-driven Design - the foundational book by Eric Evans on DDD methodology and related patterns. Start reading it with the chapter on context Mapping. Keep in mind that \"Domain Events\" weren't considered to be very important when the book was written (things have changed since then). Implementing DDD - newer book by Vaughn Vernon, addressing some concerns and questions which appeared since the \"Blue Book\" came out. Don't read the appendix on Event Sourcing, material there is very outdated. Ziobrando's Lair - blog by Alberto Brandolini, a great guy and an experienced DDD practitioner that coined term \"event storming\". Sample TODO app in ReactJS/Flux - front-end sample that I showed during the CQRS Beers. It is work in progress. Sample modular backend for TODO app - work in progress, but already includes these specification tests I showed during the workshop. References above are most closely aligned with the material given during the workshop. If you have more specific questions, please don't hesitate to get in touch. Responses h4ckademy : Very funny workshop about event-driven design with @abdullin Enjoying it a lot! h4ckademy : ... A lot of fun and learning, both at the workshop and the beers! Ruben H : Great fun yesterday learning EDD with @abdullin at @h4ckademy Feedback Anonymous: What did you like the most? When you explained how to implement a microservices architecture with a DDD approach. What else would you like to hear about? More in depth event treatment, techniques and tools. Anonymous: What did you like the most? I liked a lot a general map of how the architecture is made. It was very clear, and interesting. Also i enjoyed play with the architecture over the table splitted in teams of a company, it's great to have a hands-on game. What could make this workshop better for you? The workshop was excellent, but to make it even better, maybe could be possible to make a little application or at least the basic architecture of event driven design, just to have a feeling with the architecture. What else would you like to hear about? I would like to know more use cases and maybe a real stories of implementing this kind of architecture in real projects. Ruben H : What did you like the most? First part, where you explained how to design an app/software using DDD. What could make this workshop better for you? More time on event-storming session. What else would you like to hear about? Compare DDD to other methods that also allow to design and implement app/software. Víctor Pérez : What did you like the most? It gives you a general overview of how to join bussiness and development worlds in a verbose way. What could make this workshop better for you? Maybe longer, but it was already adjusted to our needs, so maybe there's no reason for changes. Rubén Antón : What did you like the most? The explanation was really good and easy to follow for everyone. But the part I liked the most was the event storming one. It showed off how valuable can be in order to understand the domain and the acceptance tests we could write later on. What could make this workshop better for you? I'd love to hear more about the architecture inside the microservices. What else would you like to hear about? If you were at the workshop and want to share your feedback, please don't hesitate to get in touch with me via twitter @abdullin or email:rinat@abdullin.com."}
{"url":"/talks/2015-03-18-event-driven-design-barcelona-ebay/","title":"Software Design in Barcelona at eBay Enterprise","text":"Software Design in Barcelona at eBay Enterprise A few days ago I had a chance to give a talk on software design in Barcelona. The event was organized by Barcelona Software Craftsmanship group and hosted by eBay Enterprise International. Photo by Angel Rojo . The audience was quite diverse: folks with PHP, JavaScript, Java, C# background, most of them are programmers, a few technical managers, DevOps. There were people from eBay teams as well as the other members of Barcelona software scene. Plan for the class looked like this: Principles of software design - principles of the Software Design Process, DDD and decomposition, importance of fast feedback loop and iterations, boundaries and contracts. Divide and conquer principle. Domain events and API contracts as the core part of interchange contexts and enforcing long-term stability. Event Storming Overview - audience was too large to run an event-storming session within the conference. Instead we went though the theoretical aspects of event storming and various uses. Practical application of event-driven model - split domain implementation into modules; capture behaviors with verifiable use cases expressed via domain-events and API contracts; scale design to handle more features, team members and higher loads; high availability. CQRS Beers - an informal discussion, focused on QA, actual code and implementation details. We talked about CQRS/DDD principles, command buses, asynchronous UI updates, handling high load and ReactJS/FLUX. I enjoyed the class and the beers afterwards. Many thanks to Alvaro García , Manuel Rivero , Cabre Barrera and Villazala Gordo for the invitation and organization of the event. Thanks to Angel Rojo for the photo. References These additional materials contain more information on the topics covered during the workshop. They can also provide answers to some questions we didn't have time to address. Download presentation (PDF). Slide 3 is the most important one there. Sample modular backend for TODO app - work in progress, but already includes these use cases I showed during the workshop. HappyPancake Story - story of a HappyPancake project, covering many aspects of event-driven design, \"micro-services\", specifications and event-driven UI. Check this story, the others and simply browse the site. Being The Worst - light-hearted podcast on domain-driven design, implementation patterns and learning how to build a task manager. We are still doing it. Domain-driven Design - the foundational book by Eric Evans on DDD methodology and related patterns. Start reading it with the chapter on context Mapping. Keep in mind that \"Domain Events\" weren't considered to be very important when the book was written (things have changed since then). Implementing DDD - newer book by Vaughn Vernon, addressing some concerns and questions which appeared since the \"Blue Book\" came out. Don't read the appendix on Event Sourcing, material there is very outdated. Ziobrando's Lair - blog by Alberto Brandolini, a great guy and an experienced DDD practitioner that coined term \"event storming\". References above are most closely aligned with the material given during the workshop. If you have more specific questions, please don't hesitate to get in touch. Responses Angel Rojo : I enjoyed a lot the talk by Rinat... Interesting round of Q&A after the session and thanks to eBay to provide the snacks and drinks! Nacho Cabre : Masterclass on event driven design by @abdullin thanks for all the food for thought! Félix Delval : ...really enjoyed the presentation of @abdullin about event driven software jhvaras : The important is in the white spaces. Great talk. Thanks @abdullin @eBayESP @bcnswcraft Francesc Gil : @abdullin really good talk today at meetup ;) Alvaro Garcia : @abdullin thanks for the very interesting and information-packed conference at @eBayESP about #DDD and #eventDriven Feedback If you were at the workshop and want to share your feedback, please don't hesitate to get in touch with me via twitter @abdullin or email:rinat@abdullin.com. Nacho Cabre : What did you like the most? I enjoyed a lot the explanation of emergent design from a DDD standpoint, event storming and all the discussion around domain events. What could make this class better? It would be very interesting to explain all these concepts just working on a simple problem and following all the steps you described on the emergent design with the same problem. For instance, the typical TODO list application how it would be implemented following the DDD approach. What else would you like to learn about? I think it would be very useful to do a practical workshop and work on a problem and practice event storming, context mapping, develop some component using tactical design patterns, measure and then, refactor it in a second sprint."}
{"url":"/tech/","title":"🤔 Opinionated Tech","text":"🤔 Opinionated Tech This is a collection of my ever-green notes on software design, languages and technologies. \"Ever-green\" means that these articles are kept up-to-date with my learning and discoveries. Unlike articles from Archive ."}
{"url":"/telemetry/","title":"Telemetry","text":"Telemetry Telemetry originates from Greek roots meaning \"remote measurements\". When applied to software development, it gives a real-time insight into a running software. Metrics can include: gathering logs from different machines; capturing performance metrics in real-time; tracing individual requests running through different machines. Linux world has a few handy (and battle-tested) tools to help with these problems. Installing and configuring them can be a task ranging from \"it takes just an apt-get\" to \"it requires a day of admin work\". Logging My preferred approach to logging can be described as: write structured text logs and ship them to an ElasticSearch cluster with Kibana dashboards . Let's discuss that in more detail. Structured logging libraries so far worked the best (for me) for capturing logs in applications. They allow to: capture meta-data in the events; write these events to rolling text files; ship events in batches to ElasticSearch cluster (with retries and cluster fail-over). Serilog is a logging library by Nicholas Blumhardt (also famous for Autofac.NET). It preserves the structure of events it writes, making it well-fitted for feeding that data to ElasticSearch. Recommendations: Always enable self-diagnostics log of your logging library (e.g. write it to a file that is truncated every time it reaches 1MB). If you have legacy-style log lines (e.g. \"System shuts down after {0} hours\" ), replace them with the annotated metrics ( \"System shuts down after {uptimeHours} hours\" ), which will come in handy for log processing later. Enable rolling logs and constrain maximum size for each file, based on your known storage quota. This is especially important for deployments with limited disk space, like Windows Azure. Elastic Search is an open-source server for searching and analyzing data in real-time. There are a lot of ways to ship the data to it from different machines. For example, you can use Logstash in Java world or Serilog with ElasticSearch sink in .NET. There are a lot of materials written about ElasticSearch. Here is a feature list . Recommendations: Given the choice between a cloud and a dedicated machine to host ElasticSearch, go for the latter. Non-virtualized environments have more performance and are generally cheaper. Individual shards can fail. It is advised to run ElasticSearch in a small cluster. Don't do all the setup yourself, just find a sysadmin to spend a day or to and handle the task. It will be faster and cheaper. Kibana is a nice Web Interface for exploring and visualizing the data. It plugs into the ElasticSearch and just works. Recommendations: Give access to Kibana to the developers and just let them build dashboards for themselves. Setup an automated backup of Kibana dashboards (even if they are stored in ElasticSearch cluster). Metrics Application metrics are usually real-time events that fall into the following categories: timers express duration of a particular method call (e.g. API call or DB request). counters measure number of calls, amount of data that goes through or anything else that can be counted (e.g. a number of requests per second, bytes per minute or exceptions per day). gauges represent a single variable that can change over the time (e.g. amount of used disk space or RAM consumption). These metrics are frequently represented as tiny events with a complex name ( API_v1.PROD.Worker_IN_0.azure.queues.sales.count ) and a numeric value. StatsdClient is one of the libraries for capturing metrics in a .NET application. Metrics can be shipped from a running process via a UDP broadcast to a locally running aggregator daemon. This daemon will then send them to a metrics storage back-end. I like Mozilla Hekad for metrics aggregation, however it might encounter some issues under Windows. Carbon (storage), Graphite (robust but ancient Web UI) and Grafana (front-end for Graphite front-end) are frequently used to store these performance metrics, explore and build dashboards for them. Graphite can be a pain to setup, so reusing some existing docker container is recommended. InfluxDB is more robust alternative to Carbon/Graphite (and Grafana works with it), however sending data to it would require rethinking metrics and rewriting the dashboards. Automated alerts could be setup, once you have application metrics available. For example you could get emails: if there is no activity in queues during a certain peak interval; there is an unusually high number of registrations or web requests; server runs out of disk space or starts swapping. API takes unusually long time to respond."}
{"url":"/terms-and-conditions","title":"Terms and Conditions","text":"Terms and Conditions Terms and Conditions - ML Labs 1. Introduction Welcome to abdullin.com , including its subdomain labs.abdullin.com (referred to as the \"Website\"). These Terms and Conditions (T&Cs) govern your access and use of the Website, including any restricted areas that may require account creation. By accessing or using the Website, you agree to be bound by these T&Cs. If you do not agree with these terms, please do not use the Website. 2. Accounts and Access to Restricted Areas Account Creation: To access certain restricted areas of the Website, you may be required to create an account by providing an email address. The Website may send you an access link to verify your email and grant you access. You agree that the information you provide is accurate and up to date. Account Responsibility: You are responsible for maintaining control over access to your account (e.g., access to your email). Please notify us immediately if you suspect any unauthorized access to your account. Account Termination: We reserve the right to suspend or terminate your account at any time if you violate these T&Cs, misuse the Website, or engage in any unlawful or harmful behavior. 3. Use of the Website Informational Purpose: The content provided on this Website is for informational purposes only. While efforts are made to ensure that the content is accurate, the Website does not guarantee its completeness, reliability, or timeliness. Prohibited Activities: You agree not to: Use the Website for any illegal or unauthorized purpose. Interfere with the security of the Website or any restricted areas. Share or transfer your access to the Website to another person. Attempt to access areas of the Website you are not authorized to access. Distribute, reproduce, or share content from the Website without prior written consent and proper credit to the Author. 4. Intellectual Property and Content Ownership The content, design, and layout of the Website, including text, images, graphics, and logos, are protected by intellectual property rights, including copyright, and are owned or licensed by the Website owner. You may not reproduce, modify, distribute, or republish any part of the Website without prior written consent. For further details, please refer to the Copyright section of the Website. 5. Limitation of Liability Use at Your Own Risk: The Website is provided \"as is\" without any warranties, express or implied. The Website owner makes no guarantees as to the accuracy, completeness, or reliability of the content and disclaims any liability for damages arising from your use of the Website. No Liability for External Links: The Website may contain links to third-party websites, which are not under our control. We are not responsible for the content or practices of these external websites. 6. Governing Law These Terms and Conditions shall be governed by and construed in accordance with the laws of Austria. Any disputes arising from the use of the Website shall be subject to the exclusive jurisdiction of the Austrian courts. 7. Changes to the Terms and Conditions The Website reserves the right to update or modify these T&Cs at any time without prior notice. Any changes will be posted on this page, and it is your responsibility to review these T&Cs regularly. By continuing to use the Website after any modifications, you agree to the updated terms. 8. Contact Information For any questions or concerns regarding these Terms and Conditions, please contact the Website owner at biz -bot @ abdullin . com."}
{"url":"/video-course-privacy","title":"Privacy Policy for Online Courses","text":"Privacy Policy for Online Courses Privacy Policy for Online Courses 1. Introduction This Privacy Policy applies to all personal information collected by Rinat Abdullin, e.U., through the online courses available at labs.abdullin.com. By registering and using the Courses, you consent to the procedures described in this policy. 2. Data Controller Company Name : Rinat Abdullin , e . U . Address : 1220 Vienna , Gisela - Legath - Gasse 5 / 1 , Austria Email : biz @abdullin . com is the data controller responsible for the processing of your personal data as described in this Privacy Policy. 3. Personal Data Collected and Stored The following personal data is collected when you use our services: Email addresses Information about access rights and purchased courses Answers to quiz questions, where applicable Records of watched materials 4. Purpose of Data Collection The data collected is used for the following purposes: To provide access to the digital content of the Course To manage and administer your access rights to the purchased materials To process transactions and maintain a purchase history To communicate with you regarding your account and any relevant product announcements or updates 5. Legal Basis for Processing The processing of your data is necessary for the performance of a contract to which you are a party (i.e., the provision of the Courses as per the Terms and Conditions agreed upon purchase of an individual course or a subscription). 6. Data Retention Your personal data will be retained for as long as necessary to fulfill the purposes outlined in this Privacy Policy unless a longer retention period is required by law. 7. Data Security We implement appropriate technical and organizational measures to ensure a level of security appropriate to the risk, including hashing of your passwords and securing your data against unauthorized access, destruction, or alteration. 8. Sharing and Disclosure The author does not share your personal data with third parties for their marketing purposes. We may disclose your personal data to third-party service providers, who perform services on our behalf under confidentiality agreements, ensuring they adhere to the privacy standards mentioned in this policy. No website analytics services track or collect data about your visit or usage patterns on our site. 9. Your Rights Under the GDPR, you have the following rights regarding your personal data: Right to Access (Art. 15 GDPR): You can request confirmation and details about the personal data we process about you. Right to Rectification (Art. 16 GDPR): You can correct inaccurate or incomplete personal data. Right to Erasure and Restriction (Art. 17 and 18 GDPR): You can request deletion of your data, or restrict processing under certain circumstances. Right to Data Portability (Art. 20 GDPR): You can receive your personal data in a structured format or transfer it to another party. Right to Object (Art. 21 GDPR): You may object to processing of your data based on our legitimate interests, especially for direct marketing. Right to Lodge a Complaint (Art. 77 GDPR): You can lodge a complaint with a supervisory authority if you believe your data is misused. We are also obligated to notify any recipients of your data about your requests for rectification, erasure, or restriction of processing. 10. Changes to the Privacy Policy We may update this Privacy Policy from time to time by posting a new version on our website or third party services. You are advised to review this Privacy Policy periodically for any changes. 11. Contact Us If you have any questions or concerns about our Privacy Policy or the data processing practices outlined herein, please contact us at biz -bot @ abdullin . com."}
{"url":"/video-course-terms","title":"Terms and Conditions for Video Courses - Hosted Video","text":"Terms and Conditions for Video Courses - Hosted Video Terms and Conditions for Video Courses - Hosted Video 1. Introduction These Terms and Conditions govern the sale and use of all online courses (hereafter referred to as \"the Courses\") Rinat Abdullin (hereafter referred to as “the Author”), a sole proprietor in Austria, available at labs.abdullin.com (hereafter, “the Website”). By purchasing access to any Course, you agree to be bound by these Terms and Conditions and our Privacy Policy. The contract between the Customer and the Author is concluded at the time of purchase when the Customer completes the payment process. 2. License and Access The Courses are offered as digital products available for viewing on the Website and are strictly for viewing at the Website and not for download. Upon purchase, customers will receive the access link via their registered email, which will serve as their login for accessing the Courses. This license is limited, revocable, non-exclusive, and non-transferable except as explicitly provided herein. 3. Pricing and Payment The pricing for the Courses varies and is displayed on the website. Payment is due at the time of purchase and is a one-time fee unless otherwise specified. Access to the Courses is granted immediately after the payment is processed. If you purchase a multi-seat or team access plan, you must provide the email addresses of the team members who need access. Access for these additional seats will be granted promptly after you provide the required details. The number of seats available for each pricing plan is specified at the Website. 4. Cancellation and Refund Policy In accordance with Austrian law regarding digital content and online services, consumers have the right to withdraw from the purchase contract within 14 days of the contract's conclusion without giving any reason. However, this right of withdrawal expires once the streaming of the digital content has begun with your prior express consent and acknowledgment that you thereby lose your right of withdrawal. Therefore, refunds are not supported once access to the Course has been initiated. 5. User Obligations You agree not to provide your login credentials to others or to allow others to use your account. You agree not to use the Course in any manner that violates applicable laws or regulations. 6. Intellectual Property All intellectual property rights in the Course, its content, and all materials distributed at or in connection with the Course are owned by the Author. Any unauthorized use, reproduction, or distribution is strictly prohibited. 7. Disclaimer of Warranties and Limitation of Liability The Course is provided 'as is' without any guarantees, conditions or warranties as to its accuracy, completeness, or reliability. We do not warrant that the website or the services that make the Course available will be continuous, uninterrupted, secure, or error-free. 8. Miscellaneous These Terms and Conditions constitute the entire agreement between you and the Author, regarding your purchase and use of the Course. 9. Governing Law and Jurisdiction These Terms and Conditions are governed by and construed in accordance with the laws of Austria. Any disputes arising out of or in connection with these Terms and Conditions shall be subject to the exclusive jurisdiction of the Austrian courts. 10. Contact Information For any inquiries or complaints, please contact biz -bot @ abdullin . com."}
{"url":"/video/event-sourcing/","title":"Event Sourcing Video","text":"Event Sourcing Video This video explores relations between domain model, event sourcing and traditional fact-based state persistence. Video link DDD Transport Tycoon kata could be found on github . See also There is another video on Event Storming (12 min) (workshop format). You can also check out a few blog posts about a large event-driven project I participated in: Verification of Event-Driven Systems High Availability and Performance"}
{"url":"/video/event-storming/","title":"Event Storming (12 min)","text":"Event Storming (12 min) Video link Event Storming is a workshop where people use colored post-it notes to discover and understand together something complex: business, process or software. This video goes through: prerequisites for running an Event Storming workshop; workflow of a single workshop; what happens after the workshop. You can learn more about Event Storming in a book by Alberto Brandolini . See also Event Sourcing Video ."}
{"url":"/webinar-terms","title":"Webinar Terms & Conditions","text":"Webinar Terms & Conditions"}
{"url":"/write-things-down/","title":"Write things down","text":"Write things down If you have some things on your mind - write them down. Your brain has a limited capacity to keep things alive. Besides, it takes energy to keep things around. So your brain will start shedding away some \"irrelevant\" details anyway. By writing things down, we preserve them on a more durable medium. Paper or files are a cheaper form of storage than our brain (energy-wise). When we write things down, we permit ourselves to forget things. The unburdened mind could then give complete attention to the task at hand . I found out that I could work with 5-6 projects during the day without feeling too stressed (2-3 is the comfortable zone). The only trick was to treat me as a very dumb goldfish and write things down in a journal as soon as possible . This way, at any moment in time during the workday, I'll have most of the brain capacity free for focusing on the task at hand. Notes don't have to be verbose . A short sentence and a chat screenshot can be all it takes to get current context out of my mind and switch to a new task with full dedication."}
