JB

Truely Headless Python Scraping

How to configure Selenium to actually work without a DE.

It’s hilarious reading this just two years later and being shocked at how slow the industry was to adopt containerized workloads. Honestly, why was it so hard to configure one of the most prolific uses of Python to run without a desktop?


This one annoyed me a little bit because the information available through Google is so scattered these days, and I really struggled to find answers to my troubles.

Bad Way

A certain person has decided that this isn’t headless. F-you. Now I have to do it properly :(

You need to install python and python-pip. Then you need to use pip to install selenium.

You can install a web browser without any kind of DE installed, but there will be some dependencies. Personally, I tested with firefox-esr on Debian 11.

Next you will need something called xvfb which is a virtual display server thing. It performs all actions in memory without showing any screen input; perfect for what we need.

You’ll notice that this doesn’t require geckodriver being immediately available. No clue why, but I’m guessing because you’re technically just running firefox-esr in a virtual display.

Putting this all together:

FROM debian:11

RUN apt -y update && apt -y upgrade
RUN apt -y install python3 python3-pip firefox-esr xvfb
RUN pip install selenium

Python code:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://j7b.net/jsload")
print(driver.page_source)
driver.close()

To run this: DISPLAY=:99 python3 test.py

It’s worth noting that this will be quickly detected by WAFs. Specifically, I noticed that SMH would timeout my connection. So this works for things within your control however if you’re trying to circumvent a WAF, then you’ll run into bad times.


Gud Way

test.py

Note that the executable_path and firefox_binary are specified so that I don’t ever need to Google for them again. Ever. Please, never again.

from selenium import webdriver
from selenium.webdriver import FirefoxOptions

# geckodriver location
geckodriver_path = "/usr/bin/geckodriver"
# firefox location
firefox_path = "/usr/bin/firefox"

# Set Options
options = FirefoxOptions()
options.add_argument("--headless")

# binary = FirefoxBinary('path/to/installed firefox binary')
browser = webdriver.Firefox(options=options, executable_path=firefox_path, firefox_binary=firefox_path)
browser.get("https://j7b.net")
print(browser.page_source)

Dockerfile

FROM debian:11

RUN apt -y update && apt -y upgrade
RUN apt -y install wget unzip tar
RUN apt -y install python3 python3-pip firefox-esr
RUN pip install selenium
RUN wget -qO- https://github.com/mozilla/geckodriver/releases/download/v0.32.2/geckodriver-v0.32.2-linux64.tar.gz | tar zxvsf - -C /usr/bin

ENTRYPOINT ["/usr/bin/python3"]
CMD ["/app/test.py"]

Running

docker build /path/to/Dockerfile/parent/dir somename:sometag
docker run -v "/path/to/project:/app" somename:sometag