@ -0,0 +1,138 @@ | |||||
# pysitemap | |||||
> Sitemap generator | |||||
## Installation | |||||
``` | |||||
pip install sitemap-generator | |||||
``` | |||||
## Requirements | |||||
``` | |||||
asyncio | |||||
aiofile | |||||
aiohttp | |||||
``` | |||||
## Example 1 | |||||
``` | |||||
import sys | |||||
import logging | |||||
from pysitemap import crawler | |||||
if __name__ == '__main__': | |||||
if '--iocp' in sys.argv: | |||||
from asyncio import events, windows_events | |||||
sys.argv.remove('--iocp') | |||||
logging.info('using iocp') | |||||
el = windows_events.ProactorEventLoop() | |||||
events.set_event_loop(el) | |||||
# root_url = sys.argv[1] | |||||
root_url = 'https://www.haikson.com' | |||||
crawler(root_url, out_file='sitemap.xml') | |||||
``` | |||||
## Example 2 | |||||
``` | |||||
import sys | |||||
import logging | |||||
from pysitemap import crawler | |||||
if __name__ == '__main__': | |||||
root_url = 'https://mytestsite.com/' | |||||
crawler( | |||||
root_url, | |||||
out_file='sitemap.xml', | |||||
maxtasks=100, | |||||
verifyssl=False, | |||||
exclude_urls=[ | |||||
'/git/.*(action|commit|stars|activity|followers|following|\?sort|issues|pulls|milestones|archive|/labels$|/wiki$|/releases$|/forks$|/watchers$)', | |||||
'/git/user/(sign_up|login|forgot_password)', | |||||
'/css', | |||||
'/js', | |||||
'favicon', | |||||
'[a-zA-Z0-9]*\.[a-zA-Z0-9]*$', | |||||
'\?\.php', | |||||
], | |||||
headers={'User-Agent': 'Crawler'}, | |||||
# TZ offset in hours | |||||
timezone_offset=3, | |||||
changefreq={ | |||||
"/git/": "weekly", | |||||
"/": "monthly" | |||||
}, | |||||
priorities={ | |||||
"/git/": 0.7, | |||||
"/metasub/": 0.6, | |||||
"/": 0.5 | |||||
} | |||||
) | |||||
``` | |||||
### TODO | |||||
- big sites with count of pages more then 100K will use more then 100MB | |||||
memory. Move queue and done lists into database. Write Queue and Done | |||||
backend classes based on | |||||
- Lists | |||||
- SQLite database | |||||
- Redis | |||||
- Write api for extending by user backends | |||||
## Changelog | |||||
**v. 0.9.3** | |||||
Added features: | |||||
- Option to enable/disable website SSL certificate verification (True/False) | |||||
- Option to exclude URL patterns (`list`) | |||||
- Option to provide custom HTTP request headers to web server (`dict`) | |||||
- Add support for <lastmod> tags (XML) | |||||
- Configurable timezone offset for lastmod tag | |||||
- Add support for <changefreq> tags (XML) | |||||
- Input (`dict`): `{ url_regex: changefreq_value, url_regex: ... }` | |||||
- Add support for <priority> tags (XML) | |||||
- Input (`dict`): `{ url_regex: priority_value, url_regex: ... }` | |||||
- Reduce default concurrent max tasks from `100` to `10` | |||||
**v. 0.9.2** | |||||
- todo queue and done list backends | |||||
- created very slowest sqlite backend for todo queue and done lists (1000 url writing for 3 minutes) | |||||
- tests for sqlite_todo backend | |||||
**v. 0.9.1** | |||||
- extended readme | |||||
- docstrings and code commentaries | |||||
**v. 0.9.0** | |||||
- since this version package supports only python version `>=3.7` | |||||
- all functions recreated but api saved. If You use this package, then | |||||
just update it, install requirements and run process | |||||
- all requests works asynchronously | |||||
@ -1,136 +0,0 @@ | |||||
pysitemap | |||||
========= | |||||
Sitemap generator | |||||
installing | |||||
---------- | |||||
:: | |||||
pip install sitemap-generator | |||||
requirements | |||||
------------ | |||||
:: | |||||
asyncio | |||||
aiofile | |||||
aiohttp | |||||
example 1 | |||||
------- | |||||
:: | |||||
import sys | |||||
import logging | |||||
from pysitemap import crawler | |||||
if __name__ == '__main__': | |||||
if '--iocp' in sys.argv: | |||||
from asyncio import events, windows_events | |||||
sys.argv.remove('--iocp') | |||||
logging.info('using iocp') | |||||
el = windows_events.ProactorEventLoop() | |||||
events.set_event_loop(el) | |||||
# root_url = sys.argv[1] | |||||
root_url = 'https://www.haikson.com' | |||||
crawler(root_url, out_file='sitemap.xml') | |||||
example 2 | |||||
------- | |||||
:: | |||||
import sys | |||||
import logging | |||||
from pysitemap import crawler | |||||
if __name__ == '__main__': | |||||
root_url = 'https://mytestsite.com/' | |||||
crawler( | |||||
root_url, | |||||
out_file='sitemap.xml', | |||||
maxtasks=100, | |||||
verifyssl=False, | |||||
exclude_urls=[ | |||||
'/git/.*(action|commit|stars|activity|followers|following|\?sort|issues|pulls|milestones|archive|/labels$|/wiki$|/releases$|/forks$|/watchers$)', | |||||
'/git/user/(sign_up|login|forgot_password)', | |||||
'/css', | |||||
'/js', | |||||
'favicon', | |||||
'[a-zA-Z0-9]*\.[a-zA-Z0-9]*$', | |||||
'\?\.php', | |||||
], | |||||
headers={'User-Agent': 'Crawler'}, | |||||
# TZ offset in hours | |||||
timezone_offset=3, | |||||
changefreq={ | |||||
"/git/": "weekly", | |||||
"/": "monthly" | |||||
}, | |||||
priorities={ | |||||
"/git/": 0.7, | |||||
"/metasub/": 0.6, | |||||
"/": 0.5 | |||||
} | |||||
) | |||||
TODO | |||||
----- | |||||
- big sites with count of pages more then 100K will use more then 100MB | |||||
memory. Move queue and done lists into database. Write Queue and Done | |||||
backend classes based on | |||||
- Lists | |||||
- SQLite database | |||||
- Redis | |||||
- Write api for extending by user backends | |||||
changelog | |||||
--------- | |||||
v. 0.9.3 | |||||
'''''''' | |||||
Added features: | |||||
- Option to enable/disable website SSL certificate verification (True/False) | |||||
- Option to exclude URL patterns (list) | |||||
- Option to provide custom HTTP request headers to web server (dict) | |||||
- Add support for <lastmod> tags (XML) | |||||
- Configurable timezone offset for lastmod tag | |||||
- Add support for <changefreq> tags (XML) | |||||
- Input (dict): { url_regex: changefreq_value, url_regex: ... } | |||||
- Add support for <priority> tags (XML) | |||||
- Input (dict): { url_regex: priority_value, url_regex: ... } | |||||
- Reduce default concurrent max tasks from 100 to 10 | |||||
v. 0.9.2 | |||||
'''''''' | |||||
- todo queue and done list backends | |||||
- created very slowest sqlite backend for todo queue and done lists (1000 url writing for 3 minutes) | |||||
- tests for sqlite_todo backend | |||||
v. 0.9.1 | |||||
'''''''' | |||||
- extended readme | |||||
- docstrings and code commentaries | |||||
v. 0.9.0 | |||||
'''''''' | |||||
- since this version package supports only python version >=3.7 | |||||
- all functions recreated but api saved. If You use this package, then | |||||
just update it, install requirements and run process | |||||
- all requests works asynchronously | |||||