Using Docker to Render JavaScript Web Pages via Browser from Command Line
When using scrapy to crawl web pages, many websites render content using JavaScript, so directly fetching the source code does not retrieve the needed page content. At this point, using selenium to drive a browser to get the web content is very appropriate. However, one problem is that this requires a browser installed locally and must be run as a non-root user. Therefore, using Docker to provide a Chrome service, driven by selenium, to get the rendered web content is a good solution.
Running Chrome Docker Container
By searching, we know that the container on Docker Hub is selenium/standalone-chrome
. If Docker is already installed locally, you can run this service on port 14444. For security reasons, only allow local access.
docker run -itd --name=chrome -p 127.0.0.1:14444:4444 --shm-size="2g" selenium/standalone-chrome
The parameters are very simple, only configuring backend running, port mapping, and shared memory size.
Using Selenium to Call Remote Service for Web Crawling
Selenium’s webdriver has a Remote
parameter to specify the remote address.
from selenium import webdriver
from scrapy.selector import Selector
options = webdriver.ChromeOptions()
options.add_argument('--headless') # example
driver = webdriver.Remote("http://127.0.0.1:14444/wd/hub", options=options)
driver.get("https://www.bobobk.com")
hrefs = Selector(text=driver.page_source).xpath("//article/header/h1/a/@href").extract()
for url in hrefs:
print(url)
# Example output:
# https://www.bobobk.com/833.html
# https://www.bobobk.com/621.html
# https://www.bobobk.com/852.html
# https://www.bobobk.com/731.html
# https://www.bobobk.com/682.html
# https://www.bobobk.com/671.html
# https://www.bobobk.com/523.html
# https://www.bobobk.com/521.html
# https://www.bobobk.com/823.html
# https://www.bobobk.com/512.html
In the example, the target site is this website itself. In actual tests, JavaScript-rendered sites can be perfectly crawled using this method.
Summary
Providing browser services via Docker can effectively solve the problem where web pages rendered dynamically by JavaScript cause the failure to fetch required web content.
- 原文作者:春江暮客
- 原文链接:https://www.bobobk.com/en/525.html
- 版权声明:本作品采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可,非商业转载请注明出处(作者,原文链接),商业转载请联系作者获得授权。