Sitemap generator
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

136 lines
3.1 KiB

4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
  1. pysitemap
  2. =========
  3. Sitemap generator
  4. installing
  5. ----------
  6. ::
  7. pip install sitemap-generator
  8. requirements
  9. ------------
  10. ::
  11. asyncio
  12. aiofile
  13. aiohttp
  14. example 1
  15. -------
  16. ::
  17. import sys
  18. import logging
  19. from pysitemap import crawler
  20. if __name__ == '__main__':
  21. if '--iocp' in sys.argv:
  22. from asyncio import events, windows_events
  23. sys.argv.remove('--iocp')
  24. logging.info('using iocp')
  25. el = windows_events.ProactorEventLoop()
  26. events.set_event_loop(el)
  27. # root_url = sys.argv[1]
  28. root_url = 'https://www.haikson.com'
  29. crawler(root_url, out_file='sitemap.xml')
  30. example 2
  31. -------
  32. ::
  33. import sys
  34. import logging
  35. from pysitemap import crawler
  36. if __name__ == '__main__':
  37. root_url = 'https://mytestsite.com/'
  38. crawler(
  39. root_url,
  40. out_file='sitemap.xml',
  41. maxtasks=100,
  42. verifyssl=False,
  43. exclude_urls=[
  44. '/git/.*(action|commit|stars|activity|followers|following|\?sort|issues|pulls|milestones|archive|/labels$|/wiki$|/releases$|/forks$|/watchers$)',
  45. '/git/user/(sign_up|login|forgot_password)',
  46. '/css',
  47. '/js',
  48. 'favicon',
  49. '[a-zA-Z0-9]*\.[a-zA-Z0-9]*$',
  50. '\?\.php',
  51. ],
  52. headers={'User-Agent': 'Crawler'},
  53. # TZ offset in hours
  54. timezone_offset=3,
  55. changefreq={
  56. "/git/": "weekly",
  57. "/": "monthly"
  58. },
  59. priorities={
  60. "/git/": 0.7,
  61. "/metasub/": 0.6,
  62. "/": 0.5
  63. }
  64. )
  65. TODO
  66. -----
  67. - big sites with count of pages more then 100K will use more then 100MB
  68. memory. Move queue and done lists into database. Write Queue and Done
  69. backend classes based on
  70. - Lists
  71. - SQLite database
  72. - Redis
  73. - Write api for extending by user backends
  74. changelog
  75. ---------
  76. v. 0.9.3
  77. ''''''''
  78. Added features:
  79. - Option to enable/disable website SSL certificate verification (True/False)
  80. - Option to exclude URL patterns (list)
  81. - Option to provide custom HTTP request headers to web server (dict)
  82. - Add support for <lastmod> tags (XML)
  83. - Configurable timezone offset for lastmod tag
  84. - Add support for <changefreq> tags (XML)
  85. - Input (dict): { url_regex: changefreq_value, url_regex: ... }
  86. - Add support for <priority> tags (XML)
  87. - Input (dict): { url_regex: priority_value, url_regex: ... }
  88. - Reduce default concurrent max tasks from 100 to 10
  89. v. 0.9.2
  90. ''''''''
  91. - todo queue and done list backends
  92. - created very slowest sqlite backend for todo queue and done lists (1000 url writing for 3 minutes)
  93. - tests for sqlite_todo backend
  94. v. 0.9.1
  95. ''''''''
  96. - extended readme
  97. - docstrings and code commentaries
  98. v. 0.9.0
  99. ''''''''
  100. - since this version package supports only python version >=3.7
  101. - all functions recreated but api saved. If You use this package, then
  102. just update it, install requirements and run process
  103. - all requests works asynchronously