使用Scrapy+splash写爬虫
使用scrapy和scrapy-splash开发爬虫可以高效处理JS渲染的网页。scrapy-splash是为scrapy设计的,用于与Splash协同工作,Splash是一个JavaScript渲染服务。
环境设置
-
安装scrapy和scrapy-splash
pip install scrapy scrapy-splash -
运行Splash服务(这里用Docker部署)
docker run -p 8050:8050 scrapinghub/splash
scrapy项目设置
- 使用
scrapy startproject projectName创建项目。 - 使用
scrapy genspider spiderName xxx.com创建爬虫。 - 在
settings.py中添加以下设置来使用SplashMiddleware:
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
在middlewares.py的DownloaderMiddleware中配置UA和代理ip(如果需要——
def __init__(self) -> None:
self.user_agents = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
"(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
if(self.ip_times % 50 == 49):
self.ip = get_proxy()
self.ip_times += 1
request.meta['proxy'] = self.ip
request.headers['User-Agent'] = random.choice(self.user_agents)
request.cookies = cookies
# print(request.headers['User-Agent'])
# print(request.cookies)
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
编写Spider
- 使用
scrapy_splash.SplashRequest来请求需要通过Splash渲染的页面,例如获取渲染后的页面源码:
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['http://example.com']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse_result, args={'wait': 2})
def parse_result(self, response):
# 这里的response是渲染后的页面内容
pass
编写Splash Lua脚本
有时候需要点击按钮或者填写表单,这时候可以使用Splash的Lua脚本功能.例如:
lua_script = """
function main(splash, args)
splash:go(args.url)
splash:wait(args.wait or 2)
return splash:html()
end
"""
class MySpider(scrapy.Spider):
# ...其他部分
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse_result, endpoint='execute', args={'lua_source': lua_script, 'wait': 2})
注意事项
- 请求频率: 使用Splash会增加请求的开销,因为它渲染整个页面。确保不要对目标网站发送过多的请求,以避免被封禁。
- 错误处理: Splash可能会遇到渲染JavaScript页面时的问题。务必处理可能出现的错误,并考虑设置请求超时时间。
- 资源优化: 如果只关心页面上的某些部分,考虑禁用图片或其他不必要的资源加载以节省带宽和时间。