When using scrapy to crawl web pages, many websites render content using JavaScript, so directly fetching the source code does not retrieve the needed page content. At this point, using selenium to drive a browser to get the web content is very appropriate. However, one problem is that this requires a browser installed locally and must be run as a non-root user. Therefore, using Docker to provide a Chrome service, driven by selenium, to get the rendered web content is a good solution.

Running Chrome Docker Container

By searching, we know that the container on Docker Hub is selenium/standalone-chrome. If Docker is already installed locally, you can run this service on port 14444. For security reasons, only allow local access.

docker run -itd --name=chrome -p 127.0.0.1:14444:4444 --shm-size="2g" selenium/standalone-chrome

The parameters are very simple, only configuring backend running, port mapping, and shared memory size.

Using Selenium to Call Remote Service for Web Crawling

Selenium’s webdriver has a Remote parameter to specify the remote address.

from selenium import webdriver
from scrapy.selector import Selector

options = webdriver.ChromeOptions()
options.add_argument('--headless')  # example

driver = webdriver.Remote("http://127.0.0.1:14444/wd/hub", options=options)
driver.get("https://www.bobobk.com")

hrefs = Selector(text=driver.page_source).xpath("//article/header/h1/a/@href").extract()
for url in hrefs:
    print(url)
# Example output:
# https://www.bobobk.com/833.html
# https://www.bobobk.com/621.html
# https://www.bobobk.com/852.html
# https://www.bobobk.com/731.html
# https://www.bobobk.com/682.html
# https://www.bobobk.com/671.html
# https://www.bobobk.com/523.html
# https://www.bobobk.com/521.html
# https://www.bobobk.com/823.html
# https://www.bobobk.com/512.html

In the example, the target site is this website itself. In actual tests, JavaScript-rendered sites can be perfectly crawled using this method.

Summary

Providing browser services via Docker can effectively solve the problem where web pages rendered dynamically by JavaScript cause the failure to fetch required web content.