Collegium:Terra: Difference between revisions

From OODA WIKI
Jump to navigation Jump to search
AdminIsidore (talk | contribs)
AdminIsidore (talk | contribs)
No edit summary
Line 1: Line 1:
{{DISPLAYTITLE:Imperium System Architecture}}
{{DISPLAYTITLE:Imperium System Architecture}}
{{italic title}}
{{italic title}}
'''Imperium''' is a distributed, multi-node data processing pipeline designed to automate the collection, processing, and publication of data. The system is composed of several specialized hardware nodes, each with a distinct role, orchestrated to work in concert. This document outlines the foundational architecture as of September 19, 2025.
'''Imperium''' is a distributed, multi-node data processing pipeline designed to automate the collection, processing, and publication of data. The system is composed of several specialized hardware nodes, each with a distinct role, orchestrated to work in concert. This document outlines the foundational architecture as of September 23, 2025.


== Core Infrastructure Nodes ==
== Core Infrastructure Nodes ==
The Imperium pipeline is built upon four primary local and cloud-based servers, each with a unique specialization.
The Imperium pipeline is built upon five primary local and cloud-based servers, each with a unique specialization.


=== [[Horreum]] ===
=== [[Horreum]] ===
Serves as the primary high-performance compute node, specializing in GPU-intensive tasks. It operates as a headless server, receiving jobs from the orchestrator node, Roma.
Serves as the primary high-performance compute node, specializing in GPU-intensive tasks. It operates as a headless server, receiving jobs from the orchestrator node, Roma.
* '''Role''': Headless GPU Compute Server
* '''Role''': Headless GPU Compute Server
* [cite_start]'''Hardware''': HP Z620 Workstation [cite: 794]
* [cite_start]'''Hardware''': HP Z620 Workstation [cite: 1109]
* [cite_start]'''Operating System''': Ubuntu 24.04.3 LTS [cite: 794]
* [cite_start]'''Operating System''': Ubuntu 24.04.3 LTS [cite: 1109]
* [cite_start]'''CPU''': 6-Core / 12-Thread Intel Xeon E5-2630 v2 @ 2.60GHz [cite: 799, 808]
* [cite_start]'''CPU''': 6-Core / 12-Thread Intel Xeon E5-2630 v2 @ 2.60GHz [cite: 1113]
* [cite_start]'''GPU''': NVIDIA GeForce RTX 5060 Ti with 16 GB VRAM [cite: 850, 852]
* [cite_start]'''GPU''': NVIDIA GeForce RTX 5060 Ti with 16 GB VRAM [cite: 1165, 1167]
* [cite_start]'''Memory''': 32 GB [cite: 812]
* [cite_start]'''Memory''': 32 GB [cite: 1110]
* [cite_start]'''Storage''': 1 TB Micron SSD, configured with a 100 GB LVM partition for the OS and ~850 GB of unallocated space for data volumes. [cite: 816, 819]
* [cite_start]'''Storage''': 1 TB Micron SSD, configured with a 100 GB LVM partition for the OS and ~850 GB of unallocated space for data volumes. [cite: 1131, 1170]


=== [[Roma]] ===
=== [[Roma]] ===
The central orchestrator of the pipeline. Roma is responsible for managing the workflow, scheduling tasks, and dispatching compute-intensive jobs to Horreum.
The central orchestrator of the pipeline. Roma is responsible for managing the workflow, scheduling tasks, and dispatching compute-intensive jobs to Horreum.
* '''Role''': Orchestration & CPU Processing
* '''Role''': Orchestration & CPU Processing
* [cite_start]'''Hardware''': Custom build with AMD A10-7700K APU [cite: 1698, 1702]
* [cite_start]'''Hardware''': Custom build with AMD A10-7700K APU [cite: 11]
* [cite_start]'''Operating System''': Ubuntu 22.04.5 LTS [cite: 1698]
* [cite_start]'''Operating System''': Ubuntu 22.04.5 LTS [cite: 7]
* [cite_start]'''CPU''': 4-Core AMD A10-7700K @ 3.40GHz [cite: 1698]
* [cite_start]'''CPU''': 4-Core AMD A10-7700K @ 3.40GHz [cite: 7, 11]
* [cite_start]'''GPU''': Integrated AMD Radeon R7 Graphics [cite: 1699]
* [cite_start]'''GPU''': Integrated AMD Radeon R7 Graphics [cite: 8]
* [cite_start]'''Memory''': 8 GB [cite: 1714]
* [cite_start]'''Memory''': 8 GB (6.7Gi usable) [cite: 23]
* [cite_start]'''Storage''': 2 TB Hitachi Ultrastar HDD [cite: 1715, 1721]
* [cite_start]'''Storage''': 2 TB Hitachi Ultrastar HDD [cite: 31]


=== [[Torta]] ===
=== [[Torta]] ===
A low-power, always-on node that serves as the central file hub for the pipeline, managing both raw and processed data.
A low-power, always-on node that serves as the bastion host and central file hub for the pipeline, managing both raw and processed data.
* '''Role''': Centralized File Storage
* '''Role''': Bastion Host & Centralized File Storage
* [cite_start]'''Hardware''': Raspberry Pi 4 Model B [cite: 652]
* [cite_start]'''Hardware''': Raspberry Pi 4 Model B [cite: 767]
* [cite_start]'''Operating System''': Debian GNU/Linux 12 (bookworm) [cite: 652]
* [cite_start]'''Operating System''': Debian GNU/Linux 12 (bookworm) [cite: 766]
* [cite_start]'''CPU''': 4-Core ARM Cortex-A72 @ 1.80GHz [cite: 652]
* [cite_start]'''CPU''': 4-Core ARM Cortex-A72 @ 1.80GHz [cite: 767]
* [cite_start]'''Memory''': 8 GB [cite: 652]
* [cite_start]'''Memory''': 8 GB [cite: 767]
* '''Storage''': 32 GB SD Card for OS; [cite_start]Two external HDDs (1.8 TB and 698 GB) for data storage [cite: 665, 666, 667]
* '''Storage''': 32 GB SD Card for OS; [cite_start]Two external HDDs (1.8 TB and 698 GB) for data storage [cite: 782, 783]


=== [[Latium]] ===
=== [[Latium]] ===
The public-facing cloud node responsible for interacting with external APIs and services. It handles the initial data collection and the final data publication.
The public-facing cloud node responsible for interacting with external APIs and services. It handles the initial data collection and the final data publication.
* '''Role''': API Scraping & Data Uploading
* '''Role''': API Scraping & Data Uploading
* [cite_start]'''Hardware''': DigitalOcean Droplet [cite: 549]
* [cite_start]'''Hardware''': DigitalOcean Droplet [cite: 1865]
* [cite_start]'''Operating System''': Ubuntu 22.04.5 LTS [cite: 549]
* [cite_start]'''Operating System''': Ubuntu 22.04.5 LTS [cite: 1865]
* [cite_start]'''CPU''': 1-Core DO-Regular CPU [cite: 549]
* [cite_start]'''CPU''': 1-Core DO-Regular CPU [cite: 1866]
* [cite_start]'''Memory''': 2 GB [cite: 550]
* [cite_start]'''Memory''': 2 GB [cite: 1866]
* [cite_start]'''Storage''': 50 GB SSD [cite: 570]
* [cite_start]'''Storage''': 50 GB SSD [cite: 1889]


=== [[OodaWiki]] ===
=== [[OodaWiki]] ===
A cloud-based server hosting the MediaWiki instance that serves as the final destination and presentation layer for the processed data.
A cloud-based server hosting the MediaWiki instance that serves as the final destination and presentation layer for the processed data.
* '''Role''': Final Data Presentation Layer
* '''Role''': Final Data Presentation Layer
* [cite_start]'''Hardware''': DigitalOcean Droplet [cite: 1351]
* [cite_start]'''Hardware''': DigitalOcean Droplet [cite: 1001]
* [cite_start]'''Operating System''': Ubuntu 22.04.5 LTS [cite: 1351]
* [cite_start]'''Operating System''': Ubuntu 22.04.5 LTS [cite: 1001]
* [cite_start]'''CPU''': 2-Core DO-Regular CPU [cite: 1351]
* [cite_start]'''CPU''': 2-Core DO-Regular CPU [cite: 1001]
* [cite_start]'''Memory''': 4 GB [cite: 1352]
* [cite_start]'''Memory''': 4 GB [cite: 1002]
* [cite_start]'''Storage''': 80 GB SSD [cite: 1372]
* [cite_start]'''Storage''': 80 GB SSD [cite: 1024]
* [cite_start]'''Services''': MediaWiki running on PHP 8.1 [cite: 1431][cite_start], Redis [cite: 1417][cite_start], MySQL [cite: 1421][cite_start], Nginx [cite: 1427]
* [cite_start]'''Services''': MediaWiki running on PHP 8.1 [cite: 1081][cite_start], Redis [cite: 1068][cite_start], MySQL [cite: 1072][cite_start], Nginx [cite: 1077]
 
== Network Architecture ==
The Imperium network is divided into a private local network and a secure cloud-to-local tunnel, establishing the "Pomerium" boundary.
 
=== Local Network (Pomerium) ===
The core local servers operate on a subnet with static IP addresses assigned by the router.
 
File sharing between these nodes will be handled by a Network File System (NFS) hosted on '''Torta'''.
 
=== Secure VPN Tunnel (Aquaeductus) ===
A point-to-point WireGuard VPN provides a secure, encrypted tunnel between the public cloud and the private local network.
* '''Purpose''': Allows `aqua_datum` (raw data) to be transferred securely from '''Latium''' to '''Torta'''.
* '''Endpoint''': The tunnel's public endpoint is the home network's public IP, with UDP port forwarded to '''Torta'''.


== Data Pipeline Workflow ==
== Data Pipeline Workflow ==
The pipeline operates in a continuous, automated loop orchestrated primarily by '''Roma'''.
The pipeline operates in a continuous, automated loop orchestrated primarily by '''Roma'''.


# '''Data Collection''': A scheduled script on '''Latium''' queries an external API. The raw data is pulled and transferred via SCP/SSHFS to the first external hard drive connected to '''Torta'''.
# '''Data Collection (Castra)''': A scheduled script or containerized agent on '''Latium''' queries an external API. The raw data (`aqua_datum`) is collected.
# '''Processing Dispatch''': A script on '''Roma''' continuously monitors the raw data drive on '''Torta'''. When new data is detected, it initiates the processing phase.
# '''Secure Transport (Aquaeductus)''': The `Salii` system on '''Latium''' transfers the `aqua_datum` through the secure WireGuard tunnel to the first external hard drive on '''Torta'''.
# '''Compute & Processing''': '''Roma''' handles standard data parsing. For tasks requiring significant parallel processing, '''Roma''' dispatches the job to '''Horreum''', which leverages its RTX 5060 Ti GPU. Both nodes work with data stored on '''Torta'''. Processed data is written to the second external hard drive on '''Torta'''.
# '''Processing Dispatch''': A script on '''Roma''' continuously monitors the raw data drive on '''Torta'''. When new `aqua_datum` is detected, it initiates the processing phase.
# '''Data Publication''': A script on '''Latium''' monitors the processed data drive on '''Torta'''. When new processed data is available, it is pulled to '''Latium''' and then formatted and uploaded to the '''OodaWiki''' server using Pywikibot.
# '''Compute & Processing''': '''Roma''' handles standard data parsing. For tasks requiring significant parallel processing, '''Roma''' dispatches the job to '''Horreum'''. Both nodes work with data stored on '''Torta''' via NFS. Processed data (`grana_datum`) is written to the second external hard drive on '''Torta'''.
# '''Data Publication''': The `Cubile` system, containing the pywikibots, runs on '''Latium''' inside the secure Pomerium zone. It accesses the `grana_datum` from '''Torta''' and uses it to update the '''OodaWiki''' server.

Revision as of 14:56, 23 September 2025

Template:Italic title Imperium is a distributed, multi-node data processing pipeline designed to automate the collection, processing, and publication of data. The system is composed of several specialized hardware nodes, each with a distinct role, orchestrated to work in concert. This document outlines the foundational architecture as of September 23, 2025.

Core Infrastructure Nodes

The Imperium pipeline is built upon five primary local and cloud-based servers, each with a unique specialization.

Horreum

Serves as the primary high-performance compute node, specializing in GPU-intensive tasks. It operates as a headless server, receiving jobs from the orchestrator node, Roma.

  • Role: Headless GPU Compute Server
  • [cite_start]Hardware: HP Z620 Workstation [cite: 1109]
  • [cite_start]Operating System: Ubuntu 24.04.3 LTS [cite: 1109]
  • [cite_start]CPU: 6-Core / 12-Thread Intel Xeon E5-2630 v2 @ 2.60GHz [cite: 1113]
  • [cite_start]GPU: NVIDIA GeForce RTX 5060 Ti with 16 GB VRAM [cite: 1165, 1167]
  • [cite_start]Memory: 32 GB [cite: 1110]
  • [cite_start]Storage: 1 TB Micron SSD, configured with a 100 GB LVM partition for the OS and ~850 GB of unallocated space for data volumes. [cite: 1131, 1170]

Roma

The central orchestrator of the pipeline. Roma is responsible for managing the workflow, scheduling tasks, and dispatching compute-intensive jobs to Horreum.

  • Role: Orchestration & CPU Processing
  • [cite_start]Hardware: Custom build with AMD A10-7700K APU [cite: 11]
  • [cite_start]Operating System: Ubuntu 22.04.5 LTS [cite: 7]
  • [cite_start]CPU: 4-Core AMD A10-7700K @ 3.40GHz [cite: 7, 11]
  • [cite_start]GPU: Integrated AMD Radeon R7 Graphics [cite: 8]
  • [cite_start]Memory: 8 GB (6.7Gi usable) [cite: 23]
  • [cite_start]Storage: 2 TB Hitachi Ultrastar HDD [cite: 31]

Torta

A low-power, always-on node that serves as the bastion host and central file hub for the pipeline, managing both raw and processed data.

  • Role: Bastion Host & Centralized File Storage
  • [cite_start]Hardware: Raspberry Pi 4 Model B [cite: 767]
  • [cite_start]Operating System: Debian GNU/Linux 12 (bookworm) [cite: 766]
  • [cite_start]CPU: 4-Core ARM Cortex-A72 @ 1.80GHz [cite: 767]
  • [cite_start]Memory: 8 GB [cite: 767]
  • Storage: 32 GB SD Card for OS; [cite_start]Two external HDDs (1.8 TB and 698 GB) for data storage [cite: 782, 783]

Latium

The public-facing cloud node responsible for interacting with external APIs and services. It handles the initial data collection and the final data publication.

  • Role: API Scraping & Data Uploading
  • [cite_start]Hardware: DigitalOcean Droplet [cite: 1865]
  • [cite_start]Operating System: Ubuntu 22.04.5 LTS [cite: 1865]
  • [cite_start]CPU: 1-Core DO-Regular CPU [cite: 1866]
  • [cite_start]Memory: 2 GB [cite: 1866]
  • [cite_start]Storage: 50 GB SSD [cite: 1889]

OodaWiki

A cloud-based server hosting the MediaWiki instance that serves as the final destination and presentation layer for the processed data.

  • Role: Final Data Presentation Layer
  • [cite_start]Hardware: DigitalOcean Droplet [cite: 1001]
  • [cite_start]Operating System: Ubuntu 22.04.5 LTS [cite: 1001]
  • [cite_start]CPU: 2-Core DO-Regular CPU [cite: 1001]
  • [cite_start]Memory: 4 GB [cite: 1002]
  • [cite_start]Storage: 80 GB SSD [cite: 1024]
  • [cite_start]Services: MediaWiki running on PHP 8.1 [cite: 1081][cite_start], Redis [cite: 1068][cite_start], MySQL [cite: 1072][cite_start], Nginx [cite: 1077]

Network Architecture

The Imperium network is divided into a private local network and a secure cloud-to-local tunnel, establishing the "Pomerium" boundary.

Local Network (Pomerium)

The core local servers operate on a subnet with static IP addresses assigned by the router.

File sharing between these nodes will be handled by a Network File System (NFS) hosted on Torta.

Secure VPN Tunnel (Aquaeductus)

A point-to-point WireGuard VPN provides a secure, encrypted tunnel between the public cloud and the private local network.

  • Purpose: Allows `aqua_datum` (raw data) to be transferred securely from Latium to Torta.
  • Endpoint: The tunnel's public endpoint is the home network's public IP, with UDP port forwarded to Torta.

Data Pipeline Workflow

The pipeline operates in a continuous, automated loop orchestrated primarily by Roma.

  1. Data Collection (Castra): A scheduled script or containerized agent on Latium queries an external API. The raw data (`aqua_datum`) is collected.
  2. Secure Transport (Aquaeductus): The `Salii` system on Latium transfers the `aqua_datum` through the secure WireGuard tunnel to the first external hard drive on Torta.
  3. Processing Dispatch: A script on Roma continuously monitors the raw data drive on Torta. When new `aqua_datum` is detected, it initiates the processing phase.
  4. Compute & Processing: Roma handles standard data parsing. For tasks requiring significant parallel processing, Roma dispatches the job to Horreum. Both nodes work with data stored on Torta via NFS. Processed data (`grana_datum`) is written to the second external hard drive on Torta.
  5. Data Publication: The `Cubile` system, containing the pywikibots, runs on Latium inside the secure Pomerium zone. It accesses the `grana_datum` from Torta and uses it to update the OodaWiki server.