Saving GTFS-RT data to Parquet

This commit is contained in:
2026-03-13 00:01:00 +01:00
parent afd195dab9
commit c90be4a981
18 changed files with 930 additions and 6310 deletions

235
README.md
View File

@@ -1,217 +1,70 @@
# Skopje Bus Tracker
# OpenJSP Bus Tracker
Real-time bus tracking for Skopje public transport. Modular system supporting any stop and route.
Real-time Skopje public transport tracking with Bun, GTFS/GTFS-RT ingestion, parquet persistence, and optional S3-compatible segment upload.
## What Is In This Repo
- `bus-tracker-json.ts`: terminal tracker for one stop + one route.
- `background-tracker.ts`: continuous collector for multiple routes/stops.
- `lib/database.ts`: parquet write layer with rolling segments and optional S3 upload.
- `lib/gtfs.ts`: GTFS CSV loading helpers.
- `config.ts`: API base URL, defaults, and tracker timing.
## Requirements
- Bun 1.x+
- Network access to the configured GTFS/JSON upstream APIs
## Quick Start
```bash
npm install
npm run setup-gtfs # Download latest GTFS data
npm run web
bun install
bun run typecheck
```
Open **http://localhost:3000**
Visit **http://localhost:3000/analytics.html** for historical data and performance analytics.
## TimescaleDB Setup
The application uses TimescaleDB for storing time-series data (vehicle positions, arrivals, delays).
### Start the database:
Run single stop/route terminal tracker:
```bash
cd infrastructure
docker compose up -d
bun run tracker
```
### Configure environment:
Create a `.env` file (or use the defaults):
Run with custom stop and route IDs:
```bash
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=iot_data
POSTGRES_USER=postgres
POSTGRES_PASSWORD=example
bun run tracker -- --stop 1571 --route 125
```
The database will automatically:
- Create hypertables for efficient time-series queries
- Set up compression and retention policies (90 days)
- Build continuous aggregates for hourly metrics
- Index data for fast queries
### Analytics Features:
- **Vehicle Position History**: Track individual buses over time
- **Delay Analysis**: On-time performance, average delays, patterns
- **Hourly Patterns**: See when buses are typically late/early
- **Route Statistics**: Reliability scores, service quality metrics
- **Stop Performance**: Compare delays across different stops
### Background Tracker:
For continuous data collection without keeping the web interface open:
Run background collection pipeline:
```bash
npm run track
bun run track
```
This automatically tracks these popular routes every 30 seconds:
- Routes: 2, 4, 5, 7, 15, 21, 22, 24
- Private routes: 12П, 19П, 22П, 45П, 52П, 54П, 61П, 9П
## Environment
Data is stored in TimescaleDB for historical analysis. The tracker runs indefinitely until stopped with Ctrl+C.
Copy `.env.example` to `.env` and adjust values as needed.
## Features
Key variables:
- **Fully Modular Web Interface**: Select any stop and route via UI controls or URL parameters
- **Dynamic Tracking**: Change stops/routes without restarting the server
- Interactive map with live vehicle positions
- Real-time arrivals with delays
- **Time-Series Data Storage**: Historical tracking with TimescaleDB
- **Analytics Dashboard**: Delay statistics, hourly patterns, performance metrics
- 5-second auto-refresh (web), 10-second (terminal)
- CLI arguments for terminal tracker
- Configurable defaults via [config.ts](config.ts)
- Shareable URLs with stop/route parameters
- `PARQUET_DIR`: local output directory for parquet files.
- `PARQUET_ROLL_MINUTES`: segment rotation interval.
- `SAVE_ALL_VEHICLE_SNAPSHOTS`: save full raw vehicle feed snapshots.
- `SAVE_ALL_VEHICLE_POSITIONS`: persist all vehicle positions (not only route-matched).
- `S3_ENABLED`: enable object storage upload.
- `S3_BUCKET`, `S3_REGION`, `S3_ENDPOINT`, `S3_PREFIX`: object storage target.
- `S3_ACCESS_KEY_ID`, `S3_SECRET_ACCESS_KEY`: object storage credentials.
- `S3_DELETE_LOCAL_AFTER_UPLOAD`: delete local parquet after successful upload.
- `S3_UPLOAD_RETRIES`, `S3_UPLOAD_RETRY_BASE_MS`: upload retry behavior.
## Commands
## Scripts
```bash
npm run setup-gtfs # Download GTFS data
npm run find -- --stop "american" # Find stop IDs by name
npm run find -- --route "7" # Find route IDs by number/name
npm run web # Web interface at http://localhost:3000
npm run tracker # Terminal interface (default)
npm run tracker -- --stop 1571 --route 125 # Custom stop/route
npm run track # Background tracker for popular routes (30s intervals)
npm start # Same as web
```
- `bun run start`: alias for the terminal tracker.
- `bun run tracker`: terminal tracker.
- `bun run track`: background collector.
- `bun run typecheck`: TypeScript no-emit check.
### Finding Stop and Route IDs
## Notes
Not sure which Stop ID or Route ID to use? Use the find command:
```bash
# Find stops by name (case-insensitive)
npm run find -- --stop "american"
npm run find -- --stop "центар"
# Find routes by number or name
npm run find -- --route "7"
npm run find -- --route "линија"
```
### Web Interface Usage
1. **Default tracking**: Open `http://localhost:3000` (loads default stop/route, can be changed in UI)
2. **Direct URL**: `http://localhost:3000?stopId=1571&routeId=125` (bookmarkable)
3. **Change tracking**: Use the controls at the top to enter different Stop ID and Route ID
4. **Share**: Copy URL after selecting a stop/route to share with others
### CLI Arguments
Terminal tracker supports custom stop and route:
```bash
npm run tracker -- --stop <stopId> --route <routeId>
npm run tracker -- --help
```
### API Endpoints
**This Application's API:**
- Complete docs: **[API-DOCUMENTATION.md](API-DOCUMENTATION.md)**
- Interactive docs: http://localhost:3000/api-docs.html (when server is running)
- OpenAPI spec: **[openapi.yaml](openapi.yaml)**
**Upstream ModeShift GTFS API:**
- Documentation: **[UPSTREAM-API-DOCUMENTATION.md](UPSTREAM-API-DOCUMENTATION.md)**
- Provider: ModeShift (Skopje public transport data)
#### Quick Reference
Query parameters for custom tracking:
```
GET /api/config?stopId=1571&routeId=125
GET /api/arrivals?stopId=1571&routeId=125
GET /api/vehicles?routeId=125
GET /api/stops # All stops
GET /api/routes # All routes
# Historical Data APIs
GET /api/stats/db # Database statistics
GET /api/history/vehicle/:vehicleId?hours=24
GET /api/history/route/:routeId/vehicles?hours=24
GET /api/history/stop/:stopId/arrivals?routeId=125&hours=24
GET /api/stats/route/:routeId/delays?hours=24
GET /api/stats/stop/:stopId/delays?hours=24
GET /api/stats/route/:routeId/hourly?days=7
```
## Configuration
Edit [config.ts](config.ts) to set defaults:
```typescript
export const config: AppConfig = {
defaultStop: {
stopId: '1571',
name: 'АМЕРИКАН КОЛЕЏ-КОН ЦЕНТАР',
lat: 41.98057556152344,
lon: 21.457794189453125,
},
defaultRoute: {
routeId: '125',
shortName: '7',
name: 'ЛИНИЈА 7',
},
server: {
port: 3000,
},
tracking: {
refreshInterval: {
web: 5000, // 5 seconds
terminal: 10000, // 10 seconds
},
minutesAhead: 90,
}, + analytics)
bus-tracker-json.ts # Terminal tracker (CLI args)
lib/
gtfs.ts # GTFS loader
database.ts # TimescaleDB time-series storage
public/
index.html # Live tracker UI
analytics.html # Analytics dashboard
infrastructure/
compose.yml # TimescaleDB Docker setup
gtfs/ ure
```
bus/
├── config.ts # Configuration (stops, routes, timing)
├── setup-gtfs.ts # GTFS data downloader
├── find-stops-routes.ts # Helper to find Stop/Route IDs
├── server.ts # Web server (modular API)
├── bus-tracker-json.ts # Terminal tracker (CLI args)
├── lib/gtfs.ts # GTFS loader
├── public/index.html # Frontend (modular UI)
└─**TimescaleDB (PostgreSQL)** for time-series data
- Leaflet.js + OpenStreetMap
- Chart.js for analytics visualizations
- GTFS + GTFS-RT Protocol Buffers
- Docker Compose for database
## Stack
- Node.js + Express + TypeScript
- Leaflet.js + OpenStreetMap
- GTFS + GTFS-RT Protocol Buffers
## License
MIT
- Generated parquet files are intentionally ignored by git (`data/*.parquet`).
- The background tracker rotates segments and uploads each closed segment when S3 is enabled.
- On process shutdown (`SIGINT`/`SIGTERM`), writers are flushed so the current segment is finalized.