我想要实现的是从python中的任何网站获取网站截图.
环境:Linux
这是一个使用webkit的简单解决方案:http: //webscraping.com/blog/Webpage-screenshots-with-webkit/
import sys import time from PyQt4.QtCore import * from PyQt4.QtGui import * from PyQt4.QtWebKit import * class Screenshot(QWebView): def __init__(self): self.app = QApplication(sys.argv) QWebView.__init__(self) self._loaded = False self.loadFinished.connect(self._loadFinished) def capture(self, url, output_file): self.load(QUrl(url)) self.wait_load() # set to webpage size frame = self.page().mainFrame() self.page().setViewportSize(frame.contentsSize()) # render image image = QImage(self.page().viewportSize(), QImage.Format_ARGB32) painter = QPainter(image) frame.render(painter) painter.end() print 'saving', output_file image.save(output_file) def wait_load(self, delay=0): # process app events until page loaded while not self._loaded: self.app.processEvents() time.sleep(delay) self._loaded = False def _loadFinished(self, result): self._loaded = True s = Screenshot() s.capture('http://webscraping.com', 'website.png') s.capture('http://webscraping.com/blog', 'blog.png')
这是我的解决方案,从各种来源获取帮助.它需要完整的网页屏幕捕获并裁剪它(可选)并从裁剪后的图像中生成缩略图.以下是要求:
要求:
安装NodeJS
使用Node的包管理器安装phantomjs: npm -g install phantomjs
安装selenium(在你的virtualenv中,如果你正在使用它)
安装imageMagick
将phantomjs添加到系统路径(在Windows上)
import os from subprocess import Popen, PIPE from selenium import webdriver abspath = lambda *p: os.path.abspath(os.path.join(*p)) ROOT = abspath(os.path.dirname(__file__)) def execute_command(command): result = Popen(command, shell=True, stdout=PIPE).stdout.read() if len(result) > 0 and not result.isspace(): raise Exception(result) def do_screen_capturing(url, screen_path, width, height): print "Capturing screen.." driver = webdriver.PhantomJS() # it save service log file in same directory # if you want to have log file stored else where # initialize the webdriver.PhantomJS() as # driver = webdriver.PhantomJS(service_log_path='/var/log/phantomjs/ghostdriver.log') driver.set_script_timeout(30) if width and height: driver.set_window_size(width, height) driver.get(url) driver.save_screenshot(screen_path) def do_crop(params): print "Croping captured image.." command = [ 'convert', params['screen_path'], '-crop', '%sx%s+0+0' % (params['width'], params['height']), params['crop_path'] ] execute_command(' '.join(command)) def do_thumbnail(params): print "Generating thumbnail from croped captured image.." command = [ 'convert', params['crop_path'], '-filter', 'Lanczos', '-thumbnail', '%sx%s' % (params['width'], params['height']), params['thumbnail_path'] ] execute_command(' '.join(command)) def get_screen_shot(**kwargs): url = kwargs['url'] width = int(kwargs.get('width', 1024)) # screen width to capture height = int(kwargs.get('height', 768)) # screen height to capture filename = kwargs.get('filename', 'screen.png') # file name e.g. screen.png path = kwargs.get('path', ROOT) # directory path to store screen crop = kwargs.get('crop', False) # crop the captured screen crop_width = int(kwargs.get('crop_width', width)) # the width of crop screen crop_height = int(kwargs.get('crop_height', height)) # the height of crop screen crop_replace = kwargs.get('crop_replace', False) # does crop image replace original screen capture? thumbnail = kwargs.get('thumbnail', False) # generate thumbnail from screen, requires crop=True thumbnail_width = int(kwargs.get('thumbnail_width', width)) # the width of thumbnail thumbnail_height = int(kwargs.get('thumbnail_height', height)) # the height of thumbnail thumbnail_replace = kwargs.get('thumbnail_replace', False) # does thumbnail image replace crop image? screen_path = abspath(path, filename) crop_path = thumbnail_path = screen_path if thumbnail and not crop: raise Exception, 'Thumnail generation requires crop image, set crop=True' do_screen_capturing(url, screen_path, width, height) if crop: if not crop_replace: crop_path = abspath(path, 'crop_'+filename) params = { 'width': crop_width, 'height': crop_height, 'crop_path': crop_path, 'screen_path': screen_path} do_crop(params) if thumbnail: if not thumbnail_replace: thumbnail_path = abspath(path, 'thumbnail_'+filename) params = { 'width': thumbnail_width, 'height': thumbnail_height, 'thumbnail_path': thumbnail_path, 'crop_path': crop_path} do_thumbnail(params) return screen_path, crop_path, thumbnail_path if __name__ == '__main__': ''' Requirements: Install NodeJS Using Node's package manager install phantomjs: npm -g install phantomjs install selenium (in your virtualenv, if you are using that) install imageMagick add phantomjs to system path (on windows) ''' url = 'http://stackoverflow.com/questions/1197172/how-can-i-take-a-screenshot-image-of-a-website-using-python' screen_path, crop_path, thumbnail_path = get_screen_shot( url=url, filename='sof.png', crop=True, crop_replace=False, thumbnail=True, thumbnail_replace=False, thumbnail_width=200, thumbnail_height=150, )
这些是生成的图像:
完整的网页屏幕
从捕获的屏幕裁剪图像
裁剪图像的缩略图
在Mac上,有webkit2png,在Linux + KDE上,你可以使用khtml2png.我尝试了前者并且效果很好,并且听说后者正在使用.
我最近遇到了QtWebKit,它声称是跨平台的(Qt将WebKit推入他们的库中,我猜).但我从未尝试过,所以我不能告诉你更多.
QtWebKit链接显示了如何从Python访问.您应该至少可以使用子进程对其他进程执行相同的操作.
可以使用硒
from selenium import webdriver DRIVER = 'chromedriver' driver = webdriver.Chrome(DRIVER) driver.get('https://www.spotify.com') screenshot = driver.save_screenshot('my_screenshot.png') driver.quit()
https://sites.google.com/a/chromium.org/chromedriver/getting-started
我无法对ars的回答发表评论,但实际上我使用QtWebkit运行了Roland Tapken的代码并且运行良好.
只是想确认Roland在他的博客上发布的内容在Ubuntu上运行得很好.我们的生产版本最终没有使用他写的任何内容,但我们使用PyQt/QtWebKit绑定取得了很大的成功.