No description
| src | ||
| vm-scrapper | ||
| web | ||
| .gitignore | ||
| Cargo.lock | ||
| Cargo.toml | ||
| default.env | ||
| deploy.sh | ||
| Dockerfile | ||
| README.md | ||
| rust-toolchain | ||
| scrap-immo.iml | ||
Scrap Immo
This repo is a scrapper to keep us up to date with the prices of new houses in our region.
It is made of two components:
vm-scrapper: a virtual machine running Firefox with the Greasemonkey extension with a script to grab full-page HTML data from the search response pagesscrap-immo: a webserver in Rust that receives the scrapped HTML, parses that data, organizes it, sends notifications and presents it all as pages
VM Scrapper
In my setup, I have a homelab that will continuously run a virtual machine. But this step is optional if, for example, you decide to run Firefox in your own local machine.
My setup is:
- homelab "host": Ubuntu Server 24.04
- guest "vm": Ubuntu 24.04
- Firefox
- Greasemonkey
- Firefox
- scrap-immo server
- guest "vm": Ubuntu 24.04
sudo apt update
sudo apt install -y qemu-kvm libvirt-daemon-system libvirt-clients virtinst ovmf
sudo systemctl enable --now libvirtd
sudo systemctl status libvirtd
I've noticed a problem: libvirtd runs a dnsmasq that conflicts with my pihole. To fix this, I've configured pihole to only listen to the "public" interface with:
Environment = FTLCONF_dns_listeningMode=BIND
Environment = FTLCONF_dns_interface=enp3s0
Now back to the server config:
# Download ISO
sudo mkdir -p /var/lib/libvirt/isos
cd /var/lib/libvirt/isos
sudo wget https://releases.ubuntu.com/24.04.4/ubuntu-24.04.4-desktop-amd64.iso
# Prepare disk
sudo mkdir -p /var/lib/libvirt/images
# Create VM
sudo virt-install \
--name scrap-immo \
--memory 4096 \
--vcpus 2 \
--disk path=/var/lib/libvirt/images/scrap-immo.qcow2,size=30,format=qcow2 \
--os-variant ubuntu24.04 \
--cdrom /var/lib/libvirt/isos/ubuntu-24.04.4-desktop-amd64.iso \
--network network=default,model=virtio \
--graphics vnc,listen=127.0.0.1 \
--video virtio \
--cpu host \
--boot uefi
# In my desktop, create a SSH tunnel
ssh -L 5900:127.0.0.1:5900 sitegui@192.168.1.51
# Then open Remmina and connect as VNC to localhost:5900
# Manually install Firefox
Management:
# Shutdown and start as a normal machine
# Good: no memory and no CPU will be used when off
# Bad: on start you will need to manually open Firefox
sudo virsh shutdown scrap-immo
sudo virsh start scrap-immo
# Freeze and unfreeze into the disk
# Good: no memory and no CPU will be used when off
# Bad: turning on and off requires reading/writing the memory contents to disk
sudo virsh managedsave scrap-immo
sudo virsh start scrap-immo
# Freeze and unfreeze into the memory
# Good: no CPU will be used when off, turning on and off is very fast
# Bad: memory continues to be used
sudo virsh suspend scrap-immo
sudo virsh resume scrap-immo
Preparing scrapper Firefox
- add Greasemonkey to Firefox
- navigate to Leboncoin in a new tab and do your desired search
- add a new script and copy the source from this script
- change the variables at the top of the file to your setup
- refresh the page to let the script be injected and start running
Run server
- Install Rust
- Run
cargo run --release - Open http://localhost:8080
TODO: OSRM
# Prepare cartograhy
wget https://download.geofabrik.de/europe/france/pays-de-la-loire-latest.osm.pbf
podman run -v "${PWD}:/data" docker.io/osrm/osrm-backend osrm-extract -p /opt/bicycle.lua /data/pays-de-la-loire-latest.osm.pbf
podman run -v "${PWD}:/data" docker.io/osrm/osrm-backend osrm-partition /data/pays-de-la-loire-latest.osrm
podman run -v "${PWD}:/data" docker.io/osrm/osrm-backend osrm-customize /data/pays-de-la-loire-latest.osrm
# Run OSRM service
podman run -p 5000:5000 -v "${PWD}:/data" docker.io/osrm/osrm-backend osrm-routed --algorithm mld /data/pays-de-la-loire-latest.osrm
Develop server locally
cargo install watchexec-cli
watchexec --restart --socket 8080 --interactive --debounce 1s -- cargo run
Deploy
Set git alias with:
git config alias.deploy '!git push && ssh -4 sitegui@ssh.sitegui.dev ./deploy sitegui/scrap-immo'
then simply run git deploy to push and deploy.
TODO
- move
rateandhiddeninto dynamic fields - save events into the database