使用Scrapy+splash写爬虫

使用scrapy和scrapy-splash开发爬虫可以高效处理JS渲染的网页。scrapy-splash是为scrapy设计的,用于与Splash协同工作,Splash是一个JavaScript渲染服务。

环境设置

  • 安装scrapy和scrapy-splash
    pip install scrapy scrapy-splash

  • 运行Splash服务(这里用Docker部署)
    docker run -p 8050:8050 scrapinghub/splash

scrapy项目设置

  • 使用scrapy startproject projectName创建项目。
  • 使用scrapy genspider spiderName xxx.com创建爬虫。
  • settings.py中添加以下设置来使用SplashMiddleware:
SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

middlewares.pyDownloaderMiddleware中配置UA和代理ip(如果需要——

def __init__(self) -> None:
    self.user_agents = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
        "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]

def process_request(self, request, spider):
    # Called for each request that goes through the downloader
    # middleware.
    if(self.ip_times % 50 == 49):
        self.ip = get_proxy()
    self.ip_times += 1
    request.meta['proxy'] = self.ip
    request.headers['User-Agent'] = random.choice(self.user_agents)
    request.cookies = cookies
    # print(request.headers['User-Agent'])
    # print(request.cookies)
    # Must either:
    # - return None: continue processing this request
    # - or return a Response object
    # - or return a Request object
    # - or raise IgnoreRequest: process_exception() methods of
    #   installed downloader middleware will be called
    return None

编写Spider

  • 使用scrapy_splash.SplashRequest来请求需要通过Splash渲染的页面,例如获取渲染后的页面源码:
import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['http://example.com']

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse_result, args={'wait': 2})

    def parse_result(self, response):
        # 这里的response是渲染后的页面内容
        pass

编写Splash Lua脚本

有时候需要点击按钮或者填写表单,这时候可以使用Splash的Lua脚本功能.例如:

lua_script = """
function main(splash, args)
  splash:go(args.url)
  splash:wait(args.wait or 2)
  return splash:html()
end
"""

class MySpider(scrapy.Spider):
    # ...其他部分

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse_result, endpoint='execute', args={'lua_source': lua_script, 'wait': 2})

注意事项

  • 请求频率: 使用Splash会增加请求的开销,因为它渲染整个页面。确保不要对目标网站发送过多的请求,以避免被封禁。
  • 错误处理: Splash可能会遇到渲染JavaScript页面时的问题。务必处理可能出现的错误,并考虑设置请求超时时间。
  • 资源优化: 如果只关心页面上的某些部分,考虑禁用图片或其他不必要的资源加载以节省带宽和时间。