Sitemap generator

Pekka Helenius ef19eb9815 Set 'last modified' (lastmod) tag optional		4 years ago
arch_linux	Add 'python-aiofile' Arch Linux PKGBUILD	4 years ago
pysitemap	Set 'last modified' (lastmod) tag optional	4 years ago
tests	mechanize removed	7 years ago
.gitignore	docstring and comments for code	5 years ago
CHANGELOG	v.0.9.3: Update information	4 years ago
LICENSE	Update LICENSE	7 years ago
NOTICE	modified: .gitignore	7 years ago
README.md	Set 'last modified' (lastmod) tag optional	4 years ago
composer.json	modified: .gitignore	7 years ago
requirements.txt	text writer	5 years ago
run.py	Set 'last modified' (lastmod) tag optional	4 years ago
setup.py	Fix invalid README ref; fix README formatting	4 years ago
sitemap.xml	backend	5 years ago
version.py	v.0.9.3: Update information	4 years ago

README.md

pysitemap

Sitemap generator

Installation

pip install sitemap-generator

Requirements

asyncio
aiofile
aiohttp

Example 1

import sys
import logging
from pysitemap import crawler
if __name__ == '__main__':
if '--iocp' in sys.argv:
from asyncio import events, windows_events
sys.argv.remove('--iocp')
logging.info('using iocp')
el = windows_events.ProactorEventLoop()
events.set_event_loop(el)
# root_url = sys.argv[1]
root_url = 'https://www.haikson.com'
crawler(root_url, out_file='sitemap.xml')

Example 2

import sys
import logging
from pysitemap import crawler
if __name__ == '__main__':
root_url = 'https://mytestsite.com/'
crawler(
root_url,
out_file='sitemap.xml',
maxtasks=100,
verifyssl=False,
findimages=True,
images_this_domain=True,
exclude_urls=[
'/git/.*(action|commit|stars|activity|followers|following|\?sort|issues|pulls|milestones|archive|/labels$|/wiki$|/releases$|/forks$|/watchers$)',
'/git/user/(sign_up|login|forgot_password)',
'/css',
'/js',
'favicon',
'[a-zA-Z0-9]*\.[a-zA-Z0-9]*$',
'\?\.php',
],
exclude_imgs=[
'logo\.(png|jpg)',
'avatars',
'avatar_default',
'/symbols/'
],
image_root_urls=[
'https://mytestsite.com/photos/',
'https://mytestsite.com/git/',
],
use_lastmodified=False,
headers={'User-Agent': 'Crawler'},
# TZ offset in hours
timezone_offset=3,
changefreq={
"/git/": "weekly",
"/":     "monthly"
},
priorities={
"/git/": 0.7,
"/metasub/": 0.6,
"/": 0.5
}
)

TODO

big sites with count of pages more then 100K will use more then 100MB memory. Move queue and done lists into database. Write Queue and Done backend classes based on
Lists
SQLite database
Redis
Write api for extending by user backends

Changelog

v. 0.9.3

Added features:

Option to enable/disable website SSL certificate verification (True/False)
Option to exclude URL patterns (list)
Option to provide custom HTTP request headers to web server (dict)
Add support for <lastmod> tags (XML)
- Configurable timezone offset for lastmod tag
Add support for <changefreq> tags (XML)
- Input (dict): { url_regex: changefreq_value, url_regex: ... }
Add support for <priority> tags (XML)
- Input (dict): { url_regex: priority_value, url_regex: ... }
Reduce default concurrent max tasks from 100 to 10

v. 0.9.2

todo queue and done list backends
created very slowest sqlite backend for todo queue and done lists (1000 url writing for 3 minutes)
tests for sqlite_todo backend

v. 0.9.1

extended readme
docstrings and code commentaries

v. 0.9.0

since this version package supports only python version >=3.7
all functions recreated but api saved. If You use this package, then just update it, install requirements and run process
all requests works asynchronously