Browse Source

README: Fix & change formatting

master
Pekka Helenius 4 years ago
parent
commit
4b8c535e9a
2 changed files with 138 additions and 136 deletions
  1. +138
    -0
      README.md
  2. +0
    -136
      README.rst

+ 138
- 0
README.md View File

@ -0,0 +1,138 @@
# pysitemap
> Sitemap generator
## Installation
```
pip install sitemap-generator
```
## Requirements
```
asyncio
aiofile
aiohttp
```
## Example 1
```
import sys
import logging
from pysitemap import crawler
if __name__ == '__main__':
if '--iocp' in sys.argv:
from asyncio import events, windows_events
sys.argv.remove('--iocp')
logging.info('using iocp')
el = windows_events.ProactorEventLoop()
events.set_event_loop(el)
# root_url = sys.argv[1]
root_url = 'https://www.haikson.com'
crawler(root_url, out_file='sitemap.xml')
```
## Example 2
```
import sys
import logging
from pysitemap import crawler
if __name__ == '__main__':
root_url = 'https://mytestsite.com/'
crawler(
root_url,
out_file='sitemap.xml',
maxtasks=100,
verifyssl=False,
exclude_urls=[
'/git/.*(action|commit|stars|activity|followers|following|\?sort|issues|pulls|milestones|archive|/labels$|/wiki$|/releases$|/forks$|/watchers$)',
'/git/user/(sign_up|login|forgot_password)',
'/css',
'/js',
'favicon',
'[a-zA-Z0-9]*\.[a-zA-Z0-9]*$',
'\?\.php',
],
headers={'User-Agent': 'Crawler'},
# TZ offset in hours
timezone_offset=3,
changefreq={
"/git/": "weekly",
"/": "monthly"
},
priorities={
"/git/": 0.7,
"/metasub/": 0.6,
"/": 0.5
}
)
```
### TODO
- big sites with count of pages more then 100K will use more then 100MB
memory. Move queue and done lists into database. Write Queue and Done
backend classes based on
- Lists
- SQLite database
- Redis
- Write api for extending by user backends
## Changelog
**v. 0.9.3**
Added features:
- Option to enable/disable website SSL certificate verification (True/False)
- Option to exclude URL patterns (`list`)
- Option to provide custom HTTP request headers to web server (`dict`)
- Add support for <lastmod> tags (XML)
- Configurable timezone offset for lastmod tag
- Add support for <changefreq> tags (XML)
- Input (`dict`): `{ url_regex: changefreq_value, url_regex: ... }`
- Add support for <priority> tags (XML)
- Input (`dict`): `{ url_regex: priority_value, url_regex: ... }`
- Reduce default concurrent max tasks from `100` to `10`
**v. 0.9.2**
- todo queue and done list backends
- created very slowest sqlite backend for todo queue and done lists (1000 url writing for 3 minutes)
- tests for sqlite_todo backend
**v. 0.9.1**
- extended readme
- docstrings and code commentaries
**v. 0.9.0**
- since this version package supports only python version `>=3.7`
- all functions recreated but api saved. If You use this package, then
just update it, install requirements and run process
- all requests works asynchronously

+ 0
- 136
README.rst View File

@ -1,136 +0,0 @@
pysitemap
=========
Sitemap generator
installing
----------
::
pip install sitemap-generator
requirements
------------
::
asyncio
aiofile
aiohttp
example 1
-------
::
import sys
import logging
from pysitemap import crawler
if __name__ == '__main__':
if '--iocp' in sys.argv:
from asyncio import events, windows_events
sys.argv.remove('--iocp')
logging.info('using iocp')
el = windows_events.ProactorEventLoop()
events.set_event_loop(el)
# root_url = sys.argv[1]
root_url = 'https://www.haikson.com'
crawler(root_url, out_file='sitemap.xml')
example 2
-------
::
import sys
import logging
from pysitemap import crawler
if __name__ == '__main__':
root_url = 'https://mytestsite.com/'
crawler(
root_url,
out_file='sitemap.xml',
maxtasks=100,
verifyssl=False,
exclude_urls=[
'/git/.*(action|commit|stars|activity|followers|following|\?sort|issues|pulls|milestones|archive|/labels$|/wiki$|/releases$|/forks$|/watchers$)',
'/git/user/(sign_up|login|forgot_password)',
'/css',
'/js',
'favicon',
'[a-zA-Z0-9]*\.[a-zA-Z0-9]*$',
'\?\.php',
],
headers={'User-Agent': 'Crawler'},
# TZ offset in hours
timezone_offset=3,
changefreq={
"/git/": "weekly",
"/": "monthly"
},
priorities={
"/git/": 0.7,
"/metasub/": 0.6,
"/": 0.5
}
)
TODO
-----
- big sites with count of pages more then 100K will use more then 100MB
memory. Move queue and done lists into database. Write Queue and Done
backend classes based on
- Lists
- SQLite database
- Redis
- Write api for extending by user backends
changelog
---------
v. 0.9.3
''''''''
Added features:
- Option to enable/disable website SSL certificate verification (True/False)
- Option to exclude URL patterns (list)
- Option to provide custom HTTP request headers to web server (dict)
- Add support for <lastmod> tags (XML)
- Configurable timezone offset for lastmod tag
- Add support for <changefreq> tags (XML)
- Input (dict): { url_regex: changefreq_value, url_regex: ... }
- Add support for <priority> tags (XML)
- Input (dict): { url_regex: priority_value, url_regex: ... }
- Reduce default concurrent max tasks from 100 to 10
v. 0.9.2
''''''''
- todo queue and done list backends
- created very slowest sqlite backend for todo queue and done lists (1000 url writing for 3 minutes)
- tests for sqlite_todo backend
v. 0.9.1
''''''''
- extended readme
- docstrings and code commentaries
v. 0.9.0
''''''''
- since this version package supports only python version >=3.7
- all functions recreated but api saved. If You use this package, then
just update it, install requirements and run process
- all requests works asynchronously

Loading…
Cancel
Save