导图社区 《Web Scraping with Python》1
通过反复阅读以及实践简单总结了《用PYTHON写网络爬虫》这本书的第一章节的内容并且附上了完整版最终源码。
编辑于2020-02-18 16:13:40《Web Scraping with Python》1
背景调研
robots.txt
Sitemap
提供了所有网页的链接
http://www.sitemaps.org/protocol.html
从上述网站获取网站地图标准的定义
Google搜索
WHOIS
估算网站大小
检查google爬虫的结果
http://www.google.com/advanced_search
识别网站所用技术
pip install buildwith
buildwith.parse(url)
框架
库
如果用AngularJS构建网络, 此时的网站很可能是动态加载的
如果网站使用了ASP.NET,那么在爬取时 就必须用到会话管理和表单提交
寻找网站所有者
WHOIS
pip install python-whois
import whois print whois.whois(url)
编写第一个网络爬虫
Crawling
爬取网站地图
def crawl_sitemap(url)
使用简单正则表达式
*无法依靠Sitemap文件提供每个网页的链接
遍历每个网页的数据库ID
利用网站结构的弱点,比如example.webscraping.com
import itertools for page in itertools.count(1):
补充:防止中间部分记录被删除
max_errors
def link_crawler(seed_url, link_regex)
get_links(html)
相对路径->绝对路径
import urlparse
urlparse.urljoin()
避免重复下载
crawl_queue=[seed_url]
seen = set(crawl_queue)
跟踪网页链接
利用正则表达式(Regular Expression)获取匹配网址
下载网页
def download(url)
若捕获URLError错误返回None
重试下载
num_retries
设置用户代理
user_agent= accept_language=
高级功能
解析robots.txt
import robotparser
rp = robotparser.RobotFileParser()
robotparser模块首先加载robot.txt文件, 然后通过can_fetch()函数确定指定的用户代理是否允许访问网页
支持代理
proxy = ... opener = urllib.build_opener() proxy_params = {urlparse.urlparse(url).scheme: proxy
集成到download()函数
下载限速
如果我们爬取网站的速度太快,就会面临被封禁或是造成服务器过载的风险
避免方法:在两次下载之间添加延时,从而对爬虫限速
创建类Throttle: def __init__(self, delay): def wait(self, url):
domain = urlparse.urlparse(url).netloc
last_accessed = self.domain.get(domain)
记录了每个域名上次访问的时间,并作出决策
避免爬虫陷阱
爬虫陷阱:一些网站会动态生成页面内容,这样就会出现无限多的网页
解决方法:记录到达当前网页经过了多少个链接,也就是深度。当到达最大深度时,爬虫就不再向队列中添加该网页中的链接了。
修改seen变量,该变量原先只记录访问过的网页链接,现在修改为一个字典,增加了页面深度的记录
max_depth= seen = {}
小结
介绍了网络爬虫
开发了一个能够在后续章节中复用的成熟网络爬虫
介绍了一些外部工具和模块的使用方法
用于了解网站,用户代理,网站地图,爬取延时以及各种爬虫策略
最终版本
import re import urlparse import urllib2 import time from datetime import datetime import robotparser import Queue def link_crawler(seed_url, link_regex=None, delay=5, max_depth=-1, max_urls=-1, headers=None, user_agent='wswp', proxy=None, num_retries=1): """Crawl from the given seed URL following links matched by link_regex """ # the queue of URL's that still need to be crawled crawl_queue = Queue.deque([seed_url]) # the URL's that have been seen and at what depth seen = {seed_url: 0} # track how many URL's have been downloaded num_urls = 0 rp = get_robots(seed_url) throttle = Throttle(delay) headers = headers or {} if user_agent: headers['User-agent'] = user_agent while crawl_queue: url = crawl_queue.pop() # check url passes robots.txt restrictions if rp.can_fetch(user_agent, url): throttle.wait(url) html = download(url, headers, proxy=proxy, num_retries=num_retries) links = [] depth = seen[url] if depth != max_depth: # can still crawl further if link_regex: # filter for links matching our regular expression links.extend(link for link in get_links(html) if re.match(link_regex, link)) for link in links: link = normalize(seed_url, link) # check whether already crawled this link if link not in seen: seen[link] = depth + 1 # check link is within same domain if same_domain(seed_url, link): # success! add this new link to queue crawl_queue.append(link) # check whether have reached downloaded maximum num_urls += 1 if num_urls == max_urls: break else: print 'Blocked by robots.txt:', url class Throttle: """Throttle downloading by sleeping between requests to same domain """ def __init__(self, delay): # amount of delay between downloads for each domain self.delay = delay # timestamp of when a domain was last accessed self.domains = {} def wait(self, url): domain = urlparse.urlparse(url).netloc last_accessed = self.domains.get(domain) if self.delay > 0 and last_accessed is not None: sleep_secs = self.delay - (datetime.now() - last_accessed).seconds if sleep_secs > 0: time.sleep(sleep_secs) self.domains[domain] = datetime.now() def download(url, headers, proxy, num_retries, data=None): print 'Downloading:', url request = urllib2.Request(url, data, headers) opener = urllib2.build_opener() if proxy: proxy_params = {urlparse.urlparse(url).scheme: proxy} opener.add_handler(urllib2.ProxyHandler(proxy_params)) try: response = opener.open(request) html = response.read() code = response.code except urllib2.URLError as e: print 'Download error:', e.reason html = '' if hasattr(e, 'code'): code = e.code if num_retries > 0 and 500 <= code < 600: # retry 5XX HTTP errors return download(url, headers, proxy, num_retries-1, data) else: code = None return html def normalize(seed_url, link): """Normalize this URL by removing hash and adding domain """ link, _ = urlparse.urldefrag(link) # remove hash to avoid duplicates return urlparse.urljoin(seed_url, link) def same_domain(url1, url2): """Return True if both URL's belong to same domain """ return urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netloc def get_robots(url): """Initialize robots parser for this domain """ rp = robotparser.RobotFileParser() rp.set_url(urlparse.urljoin(url, '/robots.txt')) rp.read() return rp def get_links(html): """Return a list of links from html """ # a regular expression to extract all links from the webpage webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE) # list of all links from the webpage return webpage_regex.findall(html) if __name__ == '__main__': print link_crawler('http://example.webscraping.com', '/(index|view)', max_depth=1)