Sitemap generator
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

151 lines
3.2 KiB

  1. # pysitemap
  2. > Sitemap generator
  3. ## Installation
  4. ```
  5. pip install sitemap-generator
  6. ```
  7. ## Requirements
  8. ```
  9. asyncio
  10. aiofile
  11. aiohttp
  12. ```
  13. ## Example 1
  14. ```
  15. import sys
  16. import logging
  17. from pysitemap import crawler
  18. if __name__ == '__main__':
  19. if '--iocp' in sys.argv:
  20. from asyncio import events, windows_events
  21. sys.argv.remove('--iocp')
  22. logging.info('using iocp')
  23. el = windows_events.ProactorEventLoop()
  24. events.set_event_loop(el)
  25. # root_url = sys.argv[1]
  26. root_url = 'https://www.haikson.com'
  27. crawler(root_url, out_file='sitemap.xml')
  28. ```
  29. ## Example 2
  30. ```
  31. import sys
  32. import logging
  33. from pysitemap import crawler
  34. if __name__ == '__main__':
  35. root_url = 'https://mytestsite.com/'
  36. crawler(
  37. root_url,
  38. out_file='sitemap.xml',
  39. maxtasks=100,
  40. verifyssl=False,
  41. findimages=True,
  42. images_this_domain=True,
  43. exclude_urls=[
  44. '/git/.*(action|commit|stars|activity|followers|following|\?sort|issues|pulls|milestones|archive|/labels$|/wiki$|/releases$|/forks$|/watchers$)',
  45. '/git/user/(sign_up|login|forgot_password)',
  46. '/css',
  47. '/js',
  48. 'favicon',
  49. '[a-zA-Z0-9]*\.[a-zA-Z0-9]*$',
  50. '\?\.php',
  51. ],
  52. exclude_imgs=[
  53. 'logo\.(png|jpg)',
  54. 'avatars',
  55. 'avatar_default',
  56. '/symbols/'
  57. ],
  58. image_root_urls=[
  59. 'https://mytestsite.com/photos/',
  60. 'https://mytestsite.com/git/',
  61. ],
  62. use_lastmodified=False,
  63. headers={'User-Agent': 'Crawler'},
  64. # TZ offset in hours
  65. timezone_offset=3,
  66. changefreq={
  67. "/git/": "weekly",
  68. "/": "monthly"
  69. },
  70. priorities={
  71. "/git/": 0.7,
  72. "/metasub/": 0.6,
  73. "/": 0.5
  74. }
  75. )
  76. ```
  77. ### TODO
  78. - big sites with count of pages more then 100K will use more then 100MB
  79. memory. Move queue and done lists into database. Write Queue and Done
  80. backend classes based on
  81. - Lists
  82. - SQLite database
  83. - Redis
  84. - Write api for extending by user backends
  85. ## Changelog
  86. **v. 0.9.3**
  87. Added features:
  88. - Option to enable/disable website SSL certificate verification (True/False)
  89. - Option to exclude URL patterns (`list`)
  90. - Option to provide custom HTTP request headers to web server (`dict`)
  91. - Add support for `<lastmod>` tags (XML)
  92. - Configurable timezone offset for lastmod tag
  93. - Add support for `<changefreq>` tags (XML)
  94. - Input (`dict`): `{ url_regex: changefreq_value, url_regex: ... }`
  95. - Add support for `<priority>` tags (XML)
  96. - Input (`dict`): `{ url_regex: priority_value, url_regex: ... }`
  97. - Reduce default concurrent max tasks from `100` to `10`
  98. **v. 0.9.2**
  99. - todo queue and done list backends
  100. - created very slowest sqlite backend for todo queue and done lists (1000 url writing for 3 minutes)
  101. - tests for sqlite_todo backend
  102. **v. 0.9.1**
  103. - extended readme
  104. - docstrings and code commentaries
  105. **v. 0.9.0**
  106. - since this version package supports only python version `>=3.7`
  107. - all functions recreated but api saved. If You use this package, then
  108. just update it, install requirements and run process
  109. - all requests works asynchronously