|
|
- {
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# URL domain registrar variation analysis\n",
- "\n",
- "Author: Pekka Helenius, 2021\n",
- "\n",
- "- Analyzes given URLs and stores results into a new JSON data file\n",
- "- Outputs associated domain registrars for each input URL as a plot\n",
- " - \"Phishing campaigns register domains of websites from the same registrar (than the legitimate URL)\""
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "URL data: https://hoxhunt.com/\n",
- "URL data: https://hs.fi\n",
- "URL data: https://ts.fi\n",
- "URL data: https://facebook.com\n",
- "Generate statistics: https://hoxhunt.com/\n"
- ]
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZsAAAF+CAYAAABUEbfJAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAgAElEQVR4nOzdd7gcZfn/8fcnIRBKACkSSkJUilQVQheNiDThh4pSBCkqCIoIiiCoGCsIgoiIGAQFLIANkW8AEekIUqRIR6SEEjqEIhC4f3/cz5LJ5pQ9SSa7e/J5Xde5zu7M7O4zuzNzz9MVEZiZmdVpSLsTYGZmg5+DjZmZ1c7BxszMaudgY2ZmtXOwMTOz2jnYmJlZ7RxsbAaSVpd0laTLJf1CktqdJrNu4/Noeg421pM7I2LDiNi4PB/b1tSYdSefRxUONjaDiHi18vRl4MF2pcWsW/k8mp6DjfVI0v+T9G/gzcCT7U6PWTfyeTSNg431KCLOiYjVgYeArdudHrNu5PNoGgcbm4Gk+SpPnwNealdazLqVz6PpzdPuBFhH2kLSF8vju4G/tjMxZl3K51GFPOqzmZnVzcVoZmZWOwcbMzOrnYONmZnVzsHGzMxq52BjZma1c7AxM7PaOdiYmVntBnWwkTRa0vOShrY7LXWQdKukcXPos34p6Tuz+B6HSvp5H+t3l3TFrHzG3EzSGEkhaY511pZ0paR3lcfjJf1qTn32rKjrWJsbj2FJ+0k6or/tZnuwkXSfpJckTZH0TJnPYW9JczywRcQDEbFQRLw2pz97ToiI1SLikla2LRehFWpOUp8i4nsR8emSnlm+MJZjbdOmZW+c7OVGo/H3ejkuG893LhfHkLRf03vsX5aPL8/HSZo0m9LU2O9GOu6T9JX+3mN2G8g+9fEe2wBTIuJfLW4/3X7NieAoaYKkvXp73k06LZhL+qukzYAJwC6S3tzX9nUFgG0iYgSwPHAEcDBwck2fZdajcqOxUEQsBDxAHpeNZb8um90F7Nb00l3L8jotWtL1UeDrkj5Q8+fVYW/g9HYnoh9bABP7eG4zQdKCwNrApRHxP+A88rzpVa25jYh4NiLOAXYAdpO0eknoIpJOk/S4pPslfa2R8yl3gVdK+mHJGd0racOy/EFJj0l64+Ig6YOS/iXpubJ+fGXddHdOki6R9O3y/lNKZF6irBsu6VeSniyfe62kpXraL0lfkfSf8h63SfpwZd0Kki6V9KykJySdWZar7NNjZd3NrXwfZf2ekm6vfN5aZfkbd4qS1pX0j5L2RyQdL2nesu6y8lY3lbvpHcryrSXdqGk50DUrn/kuSTeUzzwTGN7b71zSvHZ5vEv5zlctzz8t6ezyuHpn1kjTMyVNG1Te7weSnpb0X0lb9va5s8m1wAKSViufvRowf1leu4i4DrgVeOdAXytpaPmunpB0L/DBpvV7VI6beyV9pixfkLw4LKNpOaxl+jqGevjseYFNgEubVs1bjuUpymLesWX704HRwF/K5x1ED8eApp3/Py7nyR2S3l/53N3Lvkwpx8fOfXw/awLPRMSknp6XZT0ea+X7OEfSU5LukbRnZd1ESUdXnp8p6ZSmz+7tfZtzd2+cE5p2vdpN0gPld/1qWbcFcCiwQ/mubupln1eTdGFJ92RJh5bl80k6VtLD5e9YlYFCVXK5kg5SXp8ekfQhSVtJuqu816FNH/V+4MqIeLk8v4Sm428GETFb/4D7gE17WP4AsE95fBrwZ2AEMIa8i/xUWbc7MBXYAxgKfKe89ifAfMBmwBRgobL9OGANMnCuCUwGPlTWjQECmKc8vwT4D7ASeUG5BDiirPsM8BdggfK5awML97KPHwOWKZ+5A/ACsHRZ91vgq2XdcODdZfnmwPXAooCAVSqv6ev7+Bg5PPk65XUrAMs3f9clveuTg6uOAW4H9q+kOYAVKs/XAh4D1iv7u1t5v/mAeYH7gQOAYeTd96vAd3r5Pk4DvlQeTyjfcfW3PqA8Hg/8qqffpvLbvwrsWdK0D/AwZQy/Vo618h5XtLjteOBX5En8/bLsSOCQsnx85RibNLPHfzVNzftdfrMXgQ/3dw718Fl7A3cAo4DFgIub3vuDwNvKcfPe8jlr9bZP/R1DTduuBrzQw/f5P2Cr8vsdDlzd2371cQxMZdqxtwPwbNm/BcnRk1cu2y4NrFYejwaeAUZX3usrwOE9PaefY40MoieQ5/A7gceB95d1I8lzZxNgZ+BeYESL79v8HYxnxnPiJPL69A5y0rVVmrft5TcZATwCfKmkewSwXln3LeBqcl6dJYGrgG9XjoWpwGHlO9+z7O9vynusVn7Xt1Y+60TgM03Xk6f6PF5bOYEG8tf8ZVaWX01ehIeWL3DVyrrPAJdUfqy7K+vWKD/AUpVlTwLv7OXzjwV+2MuJfQnwtcq2nwXOL48/WX6ANWdin28Eti2PTyMvuMs1bbMJGUTWB4ZUlvf3fVwAfGEg33VZtz/wp8rz5mDz08bBVll2J3lReg9NF/ny3fQWbD4FnFMe3w58GjijPL+faRe48fQfbO6pPF+gbDOyj/1/nrzINP5eZODBZjR5QzOs/B9F/cHmGXLI+QB+0PRd9/q7Nr3v34G9K883a/5Om7Y/u3EstbJPzcdQ07qNgEd7+D7/Vnm+KvBSb/vVxzHQfOz9E/gEGWyeAbYD5m/h+7kc2Lin530da+X3f40SQMr6w4FfVp5/hJx58wnKDWUrx3AP38F4Zjwnlmva9x2bt+1lf3cC/tXLuv8AW1Webw7cVzkWXgKGlucjSjrWq2x/PeUmvnJej6o8XxF4ra/fY05W2i8LPAUswbQ754b7y/qGyZXHLwFERPOyhQAkrSfpYmUR1LPk3d4SfaTj0crjFxvvQ5Y9XwCcUbKZR0oa1tMbSNq1Uvz0DLB65TMPIu8k/1mKET5Z0v934HgyhzZZWVG5cAvfxyjyQOmTpJUknSvpUUnPAd/r53tYHvhSYx/Kfowic2zLAA9FOYoqaerNpcDGkkaSwfNMYCNJY4BFyGDcqjd+n4h4sTxcqJdtIU+ARRt/5A3EgETEA8A95Hd2d0TMyvS9U8mgVTWMvNutWoLcrwPJk73HY60fyzD9VMPT/UaStpR0dSkGeYbMcfR6TAzwGHqavCg1az6/hmvgDQB6OvaWiYgXyJzO3sAjkv5P0tt72ZdFgbeTN0kzPG9Oa9Oxtgx5lz6lKQ3Va9S55LF+Z0Q0tz4b6DHcrLdrVH/6ulYsw4zXmGUqz5+MaQ2pGvPu9HbNXQN4ruk8GUHmQHs1R4KNpHXIH+oK8k7gVfJi1zCaLCqaGb8BziGj7CJk9k4DfZOIeDUivhkRqwIbkrPqzVDhJWl5Mpu7L7B4ucD9u/GZEfFoROwZEcuQOZQTVFqBRcRxEbE2mS1dCfgy/X8fD5JFIf35KVmksmJELEwWDfX1PTwIfLd6oY6IBSLit2RWfFlJ1deP7u2NIuIe8qTYD7isnKSPAnuRd/Sv9/SyFvZpTjqNLH44bRbf5wHyDrXqLfQQrCPitYg4miyiGHCQJH+nUZXnb/xGpTz+D2SuaalynE5k2jHR0/c/kGPo7vwYLdvL+p40f2Zvx0BPx97DABFxQUR8gCxCu4M8F3uyOXBR5QLa/LwvDwOLSaoG0+Zr1HfJXPzSknZq4T0bXiBzOw0jB/Da/s6Zvq4VDzPjNebhAXx21VbA/zUtWwXosR6podZgI2lhSVsDZ5DZv1vKj30W8F1JI8rF+4tkscXMGEHehfxP0rrAx2cyre+TtIayT85zZADo6cBckPzRHy+v24PM2TTe52OSlitPny7bviZpnZILG0YecP8js539fR8/Bw6UtLbSCmWbnr6H54Dny93ePk3rJwNvrTw/Cdi7pEmSFlQ2thgB/IO8Q99P0jySPgKs289XeCkZgBsVxpc0PW/2OPB6U5ra6UyyGOqs3jZQNiKp/vV0IT4T2F/S28v3OpYsoj2jj88+AjhIUrURxrCmz+opd3AW+RstJ+lNZJ1Ew7xk/dvjwFRlJfVmlfWTgcUlLVJZ1t8x9IaIeBX4G1n
- "text/plain": [
- "<Figure size 432x288 with 1 Axes>"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Generate statistics: https://hs.fi\n"
- ]
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAiYAAAJICAYAAABVMC9sAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAgAElEQVR4nOzdedzlc/3/8cfLzDD2JcPYiShalEnFlyRraVOSKJWoJCVKaZtWQlIpRZSlsrRY+klFaZGEElkryT5GtrGE4fX74/0+5szVdc1c5rrOdT7nOo/77Ta3uc7nLJ/357rOOZ/n571GZiJJktQEC3W7AJIkSS0GE0mS1BgGE0mS1BgGE0mS1BgGE0mS1BgGE0mS1BgGEzVKRKweEQ9ExIRul6UTIuKqiNhijPb13Yj43Ahf46CI+PY87n9bRPx+JPvoZxGxZkRkREwcw31eGBHPrz9Pj4iTx2rfIzFa77WIWDQizo6I+yLi9IjYNSJ+Ue9bJCKujYgVRl5iLSiDiYiIGyPi4YiYFRH3RsQfIuLdETHm74/MvCkzl8jMx8d632MhMzfIzAuG89h6wlqnw0Wap8z8Qma+s5ZnxCfR+l7basC2J084NZS2/j1R35et27vWE2lGxL4DXuMDdfv0enuLiLhllMrUOu5WOW6MiI/M7zVG21M5pnm8xquAWZn5l2E+fq7jGosgFRHHRMReQ90eBW8AVgSelpk7Zeb3MnMbgMx8BDgeOHAU96enyGCilldl5pLAGsAhlA/mcd0tkvpNDaVLZOYSwE2U92Vr2/fqw64Hdh/w1LfW7Z20TC3XG4BPRMTWHd5fJ7wbOKnbhZiP7YBz5nF7pNYArs/M2UPc/31g94hYZBT3qafAYKK5ZOZ9mXkWsDPlw/lsgIhYOiJOjIiZEfHviPh4q0alXl1eGBFfrjUuN0TEJnX7zRFxZ0Q8eSKJiFdGxF8i4v56//S2++a6IouICyLis/X1Z0XELyJi+Xrf5Ig4OSL+U/d7SUSsONhxRcRHIuKf9TWujojXtd23TkT8plbt3hURp9btUY/pznrfFcP5fdT794yIa9r294K6/ckr0IjYOCIuqmW/PSKOioiF632/rS/113qVvnPdvkNEXB5zarae27bP50fEn+s+TwUmD/V3rmXeqP68W/2dr19vvzMizqg/t1f1t8p0by3TS9pe7/CIuCci/hUR2w+131FyCbBYRGxQ970BsGjd3nGZeSlwFbDhU31uREyov6u7IuIG4JUD7n972/vmhoh4V92+OPAzYOWYU3Oz8rzeQ4Pse2FgS+A3A+5auL6XZ0VpapxWH38SsDpwdt3fhxnkPRBzPv9fq5+TayPi5W37fVs9lln1/bHrPH4/zwXuzcxbBrtdtw36XhvOfiLi08AngZ1r+feIAU1EdV/3AC8eqpzqLIOJBpWZfwJuATarm74GLA08HXgp5Qr17W1PeRFwBfA0yhXHKcALgXWA3YCjImKJ+tgH6/OXoXwxvyciXjuP4ry57msFYGHggLp991qm1ep+3w08PMRr/LMey9LAp4GTI2Klet9ngV8AywKr1mMF2AbYHFi3lnVn4D/z+31ExE7A9LptKeDVbc9r9ziwH7A88BLg5cDeAJm5eX3M82ptwak13BwPvKse77eAs6K0iy8MnEG5Gl4OOB14/RC/Cygnpy3qz5sDN9TjaN0eePJqbYdac5CZF9XbLwKuq8dxKHBcRMQ89j0aTqL8fqG8D07s8P6eFBEvBp4N/GMBnr4nsAPwfGAapfal3Z31/qUo76cvR8QLMvNBYHvgtrYapNuYx3toEM8Anmg/yVevpnxelwHOAo4CyMy3MHet1aHM+z1wQy3Hp4AfR8RyNVB9Fdi+1shuAlwOT/YnuzciVm8ryyuA/zeP24O+1+a1n3aZ+SngC8CptfxD1QpfAzxviPvUYQYTzcttwHJROqLuDHw0M2dl5o3Al4C3tD32X5n5ndo35FRKWPhMZj6Smb8AHqWEFDLzgsy8MjOfyMwrgB8w56Q4mO9k5vWZ+TBwGnOuVB+jnKDXyczHM/OyzLx/sBfIzNMz87a6z1OBvwMbt73OGsDKmfnfzPx92/YlgWcCkZnXZObtw/h9vBM4NDMvyeIfmfnvQcp0WWb+MTNn19f41nx+D3sC38rMi+vxngA8QrmyezEwCTgyMx/LzB8y7xqE37TtazPg4LbbL2XwYDKUf2fmsfVvfwKwEqUNfyhn1BPSvRFxL/CNp7CvlpOBXSJiEvCmervT7oqIh4GLKGU+YwFe442Uv9HNmXk35ff+pMz8f5n5z/q++Q0lMG822AvVxz+V99AywKxBtv8+M8+pf7+TWLAT8p3Mee+dSgkPrdqgJ4BnR8SimXl7Zl5Vy35TZi6TmTe1vc4rmbvZZuDteb3XBt3PAppF+X2pCwwmmpdVgLspVycLA+0n13/X+1tmtP38MEBmDty2BEBEvCgifh2lGeQ+Sk3H8vMoxx1tPz/Ueh3Kl+jPgVMi4raIOLSeqP5HRLy1rQnkXsoVb2ufHwYC+FOtyn5HLf+vKFePXwdmROmEt9Qwfh+rUWpo5iki1o2In0bEHRFxP+VKbl6/hzWA/Qec1FcDVq7/bs25V+X8nzDU5jfAZhExFZhACZObRsSalJqg/7nanIcn/z6Z+VD9cYkhHgvw2npCWiYzl2HoK/wh1ZPZPyi/s79n5s1P9TXazKaEunaTKMG03fKU4zqAUts06HttPlYG2ss6198oIraPiD9GxN317/sK5vGeeIrvoXsoQXuggZ+vyfHUO7cO9t5budb07Ez5jN8eEf8vIp45xLEsQ7kI+MNgtweWtf299lT2M0xLAveO4PkaAYOJBhURL6ScaH8P3MWcWoWW1YFbF/Dlv0+pMl4tM5cGvkkJBk9JvTr7dGauT6m63YE51ftPiog1gGOBfSg98ZcB/tbaZ2bekZl7ZubKlGaSb0QdDZOZX83MjYANKE06H2L+v4+bgbWHcQhHA9cCz8jMpYCDmPfv4Wbg8+0n9cxcLDN/ANwOrDKgCWX1wV8GMvMflJPQvsBvM3MW5Ut/L8oV9BODPW0YxzSWTgT2Z+TNODcBaw7YthaDBLtaU/Ul4L8sQKCi/J1Wa7v95N8oSmfLHwGHAyvW9+k5zHlPDPb7fyrvob+X3cQqQ9w/mIH7HOo9MNh77zaAzPx5Zm5Nqd24lvJZHMy2wPk5Z0TewNvzLujw9zMczwL+OoLnawQMJppLRCwVETtQ2pxPrk0uj1OaUD4fEUvWE/0HWfDq8yWBuzPzvxGxMaUPyYKU9WUR8ZzatHI/JSwM9iW2OOULdWZ93tspNSat19kpIlatN++pj308Il5Ya3cmUfrF/Bd4fBi/j28DB0TERrX9e536mMF+D/cDD9Sru/cMuH8GpQ9Ly7HAu2uZIiIWj9KReElK88JsYN+ImBgROzKnqWoov6GEtVazzQUDbg80k1Jd/vQh7h9rp1L6AZ021AOidJBu/zfYSftU4AMR8cz6e50GvIPyGRjKIcCHI6K9g/GkAfsarNbhNMrfaNWIWBZoH3a8MLAI5fc8O0rHzm3a7p8BPC0ilm7bNr/30JMy8zHgPObdXDjQwPfgUO+BFepxTYrSx+pZwDkRsWJEvLr2AXkEeIDBP6Mw/2acIT3F/czvtVah9NP644I8XyNnMFHL2RExi3JV/jHgCObu3Po+ysn5BkotyvcpHTEXxN7AZ+r+Psk8TizzMRX4IeWL+RrKCfV/wlJmXk3pA3IR5Yv2OcCFbQ95IXBxRDxAqcl5f2b+i9IB8VhKWPk3pQPr4fU5Q/4+MvN04PN12yxKX4TlBin/AZRQNqvu59QB908HTqjNNm/MMhpkT0rz0j2Upoy31X0+CuxYb99Dqdb+8aC/tTl+Qzmx/XaI23OpVeefBy6sZerqqIXMfDgzz6t9jwazCqUJsf3fYDVZxwLfAc4G7qPUwHwsM8+dx+7/H+X3vGfbtnMG7Gv6EPv6OeVq/M+0/Y1qrdW+lM/DPZT3xllt919L6Y91Q/39r8z830MDfYu5+4bNz8HAx+v+Dpj
- "text/plain": [
- "<Figure size 648x576 with 1 Axes>"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Generate statistics: https://ts.fi\n"
- ]
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAh8AAAJICAYAAADB1oaeAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAgAElEQVR4nOzdd5QlZbm28esmjkgSQZAwYMKcRxAFRURFRT0eI4iCAdSjYMD8GTAHED2KAQQRMaEes4hZMKKgoAJiIEiQASSLIuH5/nhrD3vanpmmmV27u+f6rdVresd6umaHu95UqSokSZL6stK4C5AkSSsWw4ckSeqV4UOSJPXK8CFJknpl+JAkSb0yfEiSpF4ZPnqSZKckP+p+/pbkv8Zd07glmZ/kqiQrj7uWUUhySpLte9rWJ5K87WY+x+uSHLqU2/dI8pObs40VWZItklSSVXrc5k+T3Lf7fb8kn+pr2+OS5vAklyb5ZZLtkpw+dPsvk9x9nDXK8NGbqjqmqravqu2BvwLfG0cdSc5K8s8kVya5LMnPkrwgSe+vhar6a1WtWVXX973tPlTV3avqR1O5b/eldMcRl7RUVfWOqnpeV8/N/qLsXms7TrhuUYDpgufg54budTm4/Izuy7KS7DPhOV7aXb9fd3n7JOcup5oGf/egjrOSvGZZz7G83ZS/aSnP8Tjgyqr6zRTvv9jf1UdYSnJIkr0mXr6ZQXdb4BHAplW1VVX9uKruPHT7AcBbbkbZWg4MHz1LcntgYVVdNcYyHldVawGbA+8CXg0cNsZ6tALqgueaVbUmLZA/bui6T3d3+yOw+4SHPqu7fpTW7ep6MvCGJI8Y8fZG4QXAkeMuYhl2Ao5eyuXp2Bw4q6r+sYTbvwY8LMltb+Z2dDMYPvr338CXx10EQFVdXlVfA54G7J7kHgBJ1knyySQXJTk7yesHLSPdEclPk7yvazk5I8mDuuvPSXJhkkVfFkkem+Q3Sa7obt9v6LbFjqy6Lqm3ds9/ZZLvJFm/u21ekk8l+Xu33V8l2XCyvyvJa5L8pXuOU5M8cei2OyY5NsnlSS5OclR3fbq/6cLutt9OZX90t++Z5LSh7d2vu37RkWSSrZL8vKv9b0kOSrJad9tx3VOd3B1tP627fuckJ+XGFqp7DW3zvkl+3W3zKGDekv6fu5rv3/2+W7fP79Zdfl6Sr3S/DzfLD2q6rKtpm6HnOyCtSfvMJI9e0naXk18Ba6RrJu/+vUV3/chV1QnAKcB9bupjk6zc7auLk5wBPHbC7c8eet2ckeT53fW3BL4FbJwbW2A2XtpraJJtrwbsABw74abVutfylWndggu6+x8JzAe+3m3vVUzyGsiN7/8Pdu+TPyR5+NB29+j+liu718czlrJ/7gVcVlXnDl8G1gI+CmzTbfey7vbHdO+vK5Ocl+QVkzznc4FDhx775kxoRaqqfwEnAo9cUm0aPcNH/x5HS94zRlX9EjgX2K676oPAOsDtgYfSjjSfPfSQrYHfArcGPgN8DngAcEdgN+CgJGt29/1H9/h1aR++L8zSx7vs2m3rNsBqwOADZveups267b4A+OcSnuMv3d+yDvBm4FO58SjnrcB3gFsBm3Z/K7QPoocAW3a1Pg34+7L2R5KnAPt1160NPH7occOuB14GrA9sAzwc+B+AqnpId597d0f9R3UB5uPA87u/92Dga0lW775YvkI7ql0P+ALwpCXsC2hfQNt3vz8EOKP7OwaXJ35BDa6HrgWgqn7eXd4aOL37O94DHJYkS9n28nAkbf9Cex18csTbWyTJA4F7AH+exsP3BHYG7gssoLWiDLuwu31t2uvpfUnu1x2xPxo4f6gl6HyW8hqaxJ2AGwZf7EMeT3u/rkv7HDoIoKqeyeKtT+9h6a+BM7o63gR8Kcl6XWj6APDormX1QcBJsGh812VJ5g/V8hjgmxMvV9VptPf3z7vtrtvdfhjw/O657wH8YOIfXVWHTXjsm5awf04D7r2E29QDw0ePkmwE/LuqJvtyGrfzgfXSBn8+DXhtVV1ZVWcB7wWeOXTfM6vq8G6sxlG0QPCWqrqmqr4D/JsWRKiqH1XV76rqhqr6LfBZbvzim8zhVfXHqvon8HluPOK8lvYlfMequr6qTqyqKyZ7gqr6QlWd323zKOBPwFZDz7M5sHFV/auqfjJ0/VrAXYBU1WlV9bcp7I/nAe+pql9V8+eqOnuSmk6sql9U1XXdcxy8jP2wJ3BwVR3f/b1HANcAD+x+VgXeX1XXVtUXWXpLwLFD29oOeOfQ5YcyefhYkrOr6mPd//0RwG2BSVugOl/pvnQu645gP3wTtjXwKWCXJKsCT+8uj9rFSf4J/JxW81em8RxPpf0fnVNVl9D2+yJV9c2q+kv3ujmWFoq3m+yJuvvflNfQusCVk1z/k6o6uvv/O5LpfQFfyI2vvaNoYXTQqnMDcI8kt6iqv1XVKV3tf62qdavqr0PP81gW72KZeHmia4G7JVm7qi6tql9Po/aBK2n7SGNi+OjXE4CvjruIJdgEuIR2NLMaMPwFenZ3+8DCod//CVBVE69bEyDJ1kl+mNZlcTntqGT9pdRxwdDvVw+eh/ZB+W3gc0nOT/Ke7svoPyR51lB3xWW0o6TBNl8FBPhl1+z8nK7+H9COAj8ELEwb+Lb2FPbHZrSWlqVKsmWSbyS5IMkVwDuWsR82B/ad8MW9GbBx93NeLX5WyP8IPEOOBbbrwu/KtMD44CRb0Fp0TlpW/UMW/f9U1dXdr2su4b4A/9V96azbHcEu6Uh9ibovrD/T9tmfquqcm/ocQ66jBbdhq9K+2IatT/u7XkFrNZr0tbYMGwPDtS72f5Tk0Ul+keSS7v/3MSzlNXETX0OX0sL0RBPfX/Ny0weUTvba27hrsXka7T3+tyTfTHKXJfwt69KC/s8mu7wET6Lto7PTuk63Wcp9l2UtWhePxsTw0aOqOriqDhp3HRMleQDty/QnwMXc2DowMB84b5pP/xla8+5mVbUOrS/3JjfTd0dZb66qu9Gac3fmxqb4RZJsDnwMeDFw6+4L7/eDbVbVBVW1Z1VtTOvS+HC6WSZV9YGquj9wd1r3yytZ9v44B7jDFP6EjwB/AO5UVWsDr2Pp++Ec4O3DX9xVtUZVfRb4G7DJhO6O+ZM/DVTVn2lfNPsAx1XVlbQvob1oR8I3TPawKfxNffoksC83v8vlr8AWE667HZOEt67F6b3Av5hGaKL9P202dHnR/1GS1YH/o8282LB7nR7Nja+Jyfb/TXkN/altJpss4fbJTNzmkl4Dk732zgeoqm9X1SNoLWJ/oL0XJ/Mo4Pt140y3iZf/Y9td6+ITaF2yX6G1jE7XXYGTb8bjdTMZPlZgSdZOsjOtD/hTXffI9bQ39duTrNV9mb+c6Td1rwVcUlX/SrIVbUzHdGp9WJJ7dt0gV9ACwWRTdG9J++C6qHvcs2ktH4PneUqSTbuLl3b3vT7JA7pWmlVp41T+BVw/hf1xKPCKJPdPc8fuPpPthyuAq7qjwRdOuH0hbUzJwMeAF3Q1Jckt0wbvrkXrCrgO2CfJKkn+mxu7lZbkWFogG3Sx/GjC5YkuojWh334Jt/ftKNq4nCV+4aQNSh7+meyL+SjgpUnu0u3XBcBzaO+BJXkX8Kokw4N6V52wrclaDz5P+z/aNMmtgOEpu6sBq9P283VpA3eHB0AuBG6dZJ2h65b1Glqkqq6lTedfWtfeRBNfg0t6Ddym+7tWTRvzdFfg6CQbJnl8N/bjGuAqJn+PwrK7XBYCm+bGQdmrpU2/Xqf7265YynMvVRf87g98dzqP1/Jh+FgxfT3JlbSj6/8HHMjiA0r3pn0Bn0FrDfkMbfDjdPwP8JZue29k+kcrGwFfpH3onEb70vyPQFRVp9LGZPyc9gF2T+CnQ3d5AHB8kqtoLTIvqaozaYP+PkYLJGfTBo0e0D1mifujqr4AvL277kraEdl6k9T/ClrwurLbzlETbt8POKLrYnlqtVkWe9K6gi6ldTvs0W3z37RZU3t0tz0N+NKke+1Gx9K+vI5bwuXFdF0qbwd+2tX0wGU
- "text/plain": [
- "<Figure size 648x576 with 1 Axes>"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Generate statistics: https://facebook.com\n"
- ]
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAikAAAJICAYAAACkO3ThAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAgAElEQVR4nO3deZxkZX3v8c9XGAEBQWUiAjNgFKOCxmXcYkyIUYOIck1co3GLTsQFNRrjkhg0LhiX5HqJEtzBxCWJUVREjYq4XEAkLAoSuW4sgsg2IAQZ+N0/ztNQNNU9BUx1Pz39eb9e9eo65zx1zu9UV5361nOWSlUhSZLUm1stdgGSJEnjGFIkSVKXDCmSJKlLhhRJktQlQ4okSeqSIUWSJHXJkCKpW0n2TPKtJF9P8sEkWeyaJC0cQ4qknp1RVb9VVQ9rw2sWtRpJC8qQIqlbVXX1yOBVwFmLVYukhWdIkdS1JI9L8l3g14ALF7seSQvHkCKpa1V1RFXtCZwD7LvY9UhaOIYUSd1KssXI4DrgysWqRdLC23yxC5Ckeeyd5M/b/R8AX1zMYiQtrPgryJIkqUfu7pEkSV0ypEiSpC4ZUiRJUpcMKZIkqUuGFEmS1CVDiiRJ6pIhRZIkdcmQIkmSumRIkSRJXTKkSJKkLhlSJElSlwwpkiSpS4YUSZLUJUOKJEnqkiFFkiR1yZAiSZK6ZEiRJEldMqRIkqQuGVIkSVKXDCmSJKlLhhRJktQlQ4okSeqSIUWSJHXJkCJJkrpkSJEkSV0ypEiSpC4ZUiRJUpcMKZIkqUuGFEmS1CVDiiRJ6pIhRZIkdcmQIkmSumRI0USSrE5yeZLNFruWaUjyvSR7LdCyPpTkjbdwHq9J8r55pj8ryTduyTKWsyS7Jakkmy/gMr+Z5L7t/oFJPrJQy76lkrwxyS+SnDfFZRyd5LlTmO8tfj9uTEkel+Rji11HLwwpHUvy4yRXJrksySVJvpXk+UkW/P9WVT+tqm2q6pqFXvZCqKo9quroSdq2D6+7TrmkeVXVm6vqua2eW/yB2l5rj5g17rqg0wLqzO3a9rqcGX5a+1CtJAfMmsdL2/gD2/BeSc7eSDXNrPdMHT9O8qoNzWNjuynrNM88HgtcVlX/NWH7G6zXQoSqJIcmWTt7OMkq4OXAPatqx2ktf7moqiOAPZPce7Fr6YEhpX+PraptgV2Bg4C/BN6/uCVpuWkBdZuq2gb4KcPrcmbcP7dm/w08c9ZDn9HGT9P2ra4nAH+d5JFTXt40PB84fLGL2IC9gSPHDO8KXFhVP1+UqjZNHwXWbrDVMmBIWSKq6tKWsJ8MPDPJngBJtktyWJILkvwkyV/N9LS0b53fTPL3rSfmh0l+q40/K8nPk1z3oZLkMUn+K8m6Nv3AkWk3+KbWul7/ts3/siRfTLJDm7Zlko8kubAt99tJ7jhuvZK8Ksn/a/M4LcnjR6bdNcnXklzaupI/3sanrdPP27RTJnk+2vTnJTl9ZHn3a+Ov+2aa5IFJ/m+r/WdJDk5y6zbtmDark9u39ye38fsmOSnX93jde2SZ901yYlvmx4Et5/o/t5rv3+4/vT3n92zDz03yqXZ/dHfATE2XtJoeMjK/tye5OMmPkjx6ruVuJN8GbpNkj7bsPYCt2vipq6oTgO8B97mpj02yWXuufpHkh8BjZk1/9sjr5odJ/qyN3xr4PLBTru/R2Wm+19CYZd8aeDjwtVmTbt1ey5dl2B25prU/HFgNfKYt75WMeQ3k+vf//2nvk+8n+f2R5T6rrctl7fXxtHmen3sDl1TV2aPDwN2BL42s/4fa9H9Ncl5b7jEzr4k2bask72iv9UuTfCPJVm3ag9v755IkJ+fGu2DvkuT49rhPJ7n9yHwf156nSzJsn+4xMu0ebdwlrc3j5ljPbZN8Ncm7kmTM9FVJPplh+3JhkoPb+Ftl2Nb8JMN26bAk27VpM9vOZ2fYrl6coUf8ARm2XZfMzGfE0cx6DS5bVeWt0xvwY+ARY8b/FNi/3T8M+DSwLbAbw7fWP23TngWsB54NbAa8sT32H4EtgEcBlwHbtPZ7AfdiCK/3Bs4H/lebthtQwOZt+Gjg/wF3Y/ggOho4qE37M+AzwG3acu8P3HaOdXwisFNb5pOBXwJ3atM+Cry2TdsS+O02/g+A7wDbAwHuMfKY+Z6PJwLnAA9oj7srsOvs57rV+2Bg8zaP04GXjtRcwF1Hhu8H/Bx4UFvfZ7b5bQHcGvgJ8DJgBcO3/auBN87xfBwGvLzdP7Q9x6P/65e1+wcCHxn3vxn5318NPK/VtD9wLpBJX2ttHt+YsO2BwEeA1wBvbeP+Dnh1G3/gyGvs7Jv7+h+tafZ6t//ZFcDjN/QeGrOs5wPfB1YBtwe+OmvejwHu0l43v9uWc7+51mlDr6FZbfcAfjnm+fwfYJ/2/3sLcOxc6zXPa2A917/2ngxc2tZva2Ad8But7Z2APdr91QwBZPXIvF4FvGXc8Bzr/xyG9+AWwD8AJ41M+0eG7cXObd1+q7XbGbiwrfOtgEe24ZUj25xzgD1b/f/O9e+BuzFsOx7Z1vWVwJkM778V7f5r2vDDGbZ7M+v+IYZt4x2A45n7vbkZcDLw9235o9uk57Rl/DqwDfBJ4PBZ/5tD2mMe1f63nwJ+ra33z4HfHVnW7dtjxm43l9Nt0QvwNs8/Z+6QcizDh/dmwFUM+4Jnpv0ZcHS7/yzgByPT7tVe+HccGXchcJ85lv8PwN+3+zNvtNGQ8lcjbV8AHNXuPwf4FnDvm7HOJwH7tfuHMXxQ7zKrzcMZwseDgVuNjN/Q8/EF4CU35blu014K/MfI8OyQ8h7gb2c95gyGD7PfYVY4aM/NXBvCPwWOaPdPB54LfKwN/4TrPxgPZMMh5cyR4du0NjvOs/6XM3w4zdyu4KaHlNUMQXhF+7uK6YeUS4Ar2/23z3qu5/y/zprvV4Dnjww/avZzOqv9p2ZeS5Os0+zX0KxpDwXOG/N8/ufI8D2BK+dar3leA7Nfe8cDf8LwIXsJ8EfAVhM8P18HHjZueEPrz/BlooDtGMLHlcBvjmn3l7QP9pFxXwCe2e4fTfsiNPKc/Irhff/XwCdGpt2KIdDsBTwMOI8bbis+OvKa/BDwAeC7wF/Msx4PAS4Y95oAvgy8YGT4Nxi+JMyE1AJ2Hpl+IfDkkeF/54ZfhFa0x6yeq57lcnN3z9K0M3ARsAPXf1Of8ZM2fcb5I/evBKiq2eO2AUjyoNbVeUGSSxm+Xe4wTx2jR/JfMTMfhn3rXwA+luTcJH+XZMW4GSR5Rq7fTXIJw7ekmWW+kuGb6/Gti/Y5rf6vAAczfCM7P8MBfLed4PlYxdAzMa8kd0vy2dZdvQ548waeh12Bl8+sQ1uPVQw9RDsB51Tb8ozUNJevAQ9LsiPDxvfjwEOT7MawkT9pQ/WPuO7/U1VXtLvbzNEWhl6z7WduDMHzJqmqnzJ8o3wzQ0A+66bOY8R6ho31qBUMG/9ROzCs1ysYPpTGvtY2YCdgtNYb/I+SPDrJsUkuav/ffZjnNXETX0MXM/Q6zDb7/bVlbvqBseNeeztV1S8ZelaeD/wsyeeS3H2OddmeYbfOt8YNj2m/WZKDMuzGXccQqGBY/x0YehPGvQ93BZ4463302wy9PDNm/49WtHnuxMj/rKqubW13btPOauNGHzu6nXwMQ4/wIePWqVkF/KSq1o+ZdoPlt/ubA6O7uWdvd8duh5uZ18Ml89SzLBhSlpgkD2B4c30D+AXDBnvXkSarGb5B3Bz/AhwBrKqq7RjesDfaL7shVXV1Vb2+qu7J0JW7L8MBlDeQZFfgvcCLgDu0D8bvziyzqs6rqudV1U4MPSLvTjurpqreVVX3Z+gqvxvwF2z4+TiLoct+Q97D0PW/e1XdlqGbeL7n4SzgTaMf8FV1m6r6KPAzYOdZ+7dXzzWjqjqT4QPpAOCYqrqM4cNqLUMPwrXjHjbBOi2kwxjO9jjsFs7npwzfQkfdmTEhr6quqap3MHSj3+RwxfB/WjUyfN3/KMkWDN90387QC7k9wwGjM//Tcc//TXkN/WB
- "text/plain": [
- "<Figure size 648x576 with 1 Axes>"
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "#!/bin/env python\n",
- "\n",
- "\"\"\"\n",
- "URL data extractor\n",
- "\n",
- "Pekka Helenius <pekka [dot] helenius [at] fjordtek [dot] com>\n",
- "\n",
- "Requirements:\n",
- "\n",
- "Python 3\n",
- "Python 3 BeautifulSoup4 (python-beautifulsoup4)\n",
- "Python 3 whois (python-whois; PyPI)\n",
- "Python 3 JSON Schema (python-jsonschema)\n",
- "Python 3 Numpy (python-numpy)\n",
- "Python 3 matplotlib (python-matplotlib)\n",
- "\n",
- "TODO: URL domain part length comparison analysis\n",
- "TODO: URL non-TLD part length comparison analysis\n",
- " - in phishing webpages, URL tends to be much longer than legitimate webpages\n",
- " however, domains themselves tend to be much shorter (without TLD)\n",
- " - phishing URLs often contain more number of dots and subdomains than legitimate URLs\n",
- " - legitimate: robots.txt redirects bots to a legitimate domain rather than to the original phishing domain\n",
- "\n",
- "TODO: Website visual similarity analysis\n",
- "TODO: consistency of RDN usage in HTML data\n",
- "\"\"\"\n",
- "\n",
- "######################################\n",
- "\n",
- "%matplotlib inline\n",
- "import matplotlib.pyplot as plt\n",
- "\n",
- "from bs4 import BeautifulSoup as bs\n",
- "from collections import Counter\n",
- "from datetime import date, datetime\n",
- "import json\n",
- "import os\n",
- "import re\n",
- "import requests\n",
- "from time import sleep\n",
- "import urllib\n",
- "from whois import whois\n",
- "\n",
- "# Target URLs\n",
- "urls = [\n",
- " \"https://hoxhunt.com/\",\n",
- " \"https://hs.fi\",\n",
- " \"https://ts.fi\",\n",
- " \"https://facebook.com\"\n",
- "]\n",
- "\n",
- "# Some web servers may block our request unless we set a widely used, well-known user agent string\n",
- "request_headers = {\n",
- " 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'\n",
- "}\n",
- "\n",
- "# Date format for domain timestamps\n",
- "dateformat = \"%Y/%m/%d\"\n",
- "\n",
- "# All webpages may not like fetching data too fast\n",
- "# Sleep time in seconds\n",
- "sleep_interval_between_requests = 0.5\n",
- "\n",
- "# Write JSON results to a file?\n",
- "use_file = True\n",
- "# Full file path + name\n",
- "filename = os.getcwd() + \"/\" + \"url_info.json\"\n",
- "\n",
- "# Generate plot from existing JSON data?\n",
- "plot_only = False\n",
- "\n",
- "# Save generated plot images?\n",
- "save_plot_images = True\n",
- "\n",
- "# DPI of plot images\n",
- "plot_images_dpi = 150\n",
- "\n",
- "# Common link attribute references in various HTML elements\n",
- "link_refs = {\n",
- " 'a': 'href',\n",
- " 'img': 'src',\n",
- " 'script': 'src'\n",
- "}\n",
- "\n",
- "############################################################################\n",
- "############################################################################\n",
- "\n",
- "class json_url_data(object):\n",
- "\n",
- "# def __init__(self):\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Set a new HTTP session and get response.\n",
- "\n",
- " Returns a requests.models.Response object.\n",
- " \"\"\"\n",
- " def set_session(self, url, method='get', redirects=True):\n",
- " \n",
- " # HTTP response status codes 1XX, 2XX and 3XX are OK\n",
- " # Treat other codes as errors\n",
- " sc = re.compile(r\"^[123]{1}[0-9]{2}\")\n",
- " \n",
- " sleep(sleep_interval_between_requests)\n",
- " \n",
- " try:\n",
- " session = requests.Session()\n",
- " response = session.request(method, url, headers=request_headers, allow_redirects=redirects)\n",
- " \n",
- " if not sc.match(str(response.status_code)):\n",
- " raise Exception(\"Error: got invalid response status from the web server\")\n",
- " return response\n",
- " \n",
- " except:\n",
- " raise Exception(\"Error: HTTP session could not be established. URL: '\" + url + \"' (method: \" + method + \")\") from None\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Fetch HTML data.\n",
- "\n",
- " Returns a bs4.BeautifulSoup object.\n",
- " \"\"\"\n",
- " def get_html_data(self, url):\n",
- " \n",
- " try:\n",
- " data = bs(self.set_session(url).content, 'html.parser')\n",
- " return data\n",
- " except:\n",
- " raise Exception(\"Error: HTML data could not be retrieved\")\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Get URL redirects and related HTTP status codes.\n",
- "\n",
- " Returns a list object.\n",
- " \"\"\"\n",
- " def get_url_redirects(self, url):\n",
- " \n",
- " response = self.set_session(url)\n",
- " list_data = []\n",
- " \n",
- " if response.history:\n",
- " \n",
- " for r in response.history:\n",
- " list_data.append({'redirect_url': r.url, 'status': r.status_code})\n",
- " \n",
- " return list_data\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Extract title HTML element contents from given HTML data.\n",
- "\n",
- " Returns a string object.\n",
- " \"\"\"\n",
- " def get_webpage_title(self, url):\n",
- " \n",
- " html_data = self.get_html_data(url)\n",
- " \n",
- " title = html_data.title.string\n",
- " return title\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Get WHOIS domain data.\n",
- "\n",
- " Returns a dict object.\n",
- " \"\"\"\n",
- " def get_whois_data(self, url):\n",
- " dict_data = whois(url)\n",
- " return dict_data\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Get domain name based on WHOIS domain data.\n",
- " \"\"\"\n",
- " def get_domain_name(self, url):\n",
- " domain_name = self.get_whois_data(url).domain_name\n",
- " \n",
- " if type(domain_name) is list:\n",
- " return domain_name[0].lower()\n",
- " else:\n",
- " return domain_name.lower()\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Get initial and final URLs\n",
- " \n",
- " Compare whether the final (destination) URL\n",
- " matches with the initial URL in a request.\n",
- " \n",
- " Returns a dict object.\n",
- " \"\"\"\n",
- " def get_startfinal_urls(self, url):\n",
- " \n",
- " response = self.set_session(url)\n",
- " end_url = response.url\n",
- " \n",
- " start_match = False\n",
- " final_match = False\n",
- " \n",
- " # dr = re.compile(r\"^([a-z]+://)?([^/]+)\")\n",
- " # dr_group_lastindex = dr.match(url).lastindex\n",
- " # domain_name = dr.match(url).group(dr_group_lastindex)\n",
- " \n",
- " domain_name = self.get_domain_name(url)\n",
- " \n",
- " if re.search(domain_name, end_url):\n",
- " final_match = True\n",
- " \n",
- " dict_data = {\n",
- " 'startfinal_urls': {\n",
- " 'start_url': {\n",
- " 'url': url\n",
- " },\n",
- " 'final_url': {\n",
- " 'url': end_url, 'domain_match': final_match\n",
- " }\n",
- " }\n",
- " }\n",
- " \n",
- " return dict_data\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Get domain registrar\n",
- " \n",
- " Returns a dict object.\n",
- " \"\"\"\n",
- " def get_domain_registrar(self, url):\n",
- " dict_data = {'domain_registrar': self.get_whois_data(url).registrar }\n",
- " return dict_data\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Do comparison between the domain name, extracted\n",
- " from WHOIS domain data and contents of a title HTML\n",
- " element, extracted from HTML data based on a given URL.\n",
- " \n",
- " Returns a dict object.\n",
- " \"\"\"\n",
- " def get_domain_title_match(self, url):\n",
- " \n",
- " domain_name = self.get_domain_name(url)\n",
- " title = self.get_webpage_title(url)\n",
- " \n",
- " # If is string:\n",
- " if type(domain_name) is str:\n",
- " if re.search(domain_name, title, re.IGNORECASE):\n",
- " match = True\n",
- " else:\n",
- " match = False\n",
- " \n",
- " # If is list:\n",
- " elif type(domain_name) is list:\n",
- " for d in domain_name:\n",
- " if re.search(d, title, re.IGNORECASE):\n",
- " match = True\n",
- " break\n",
- " else:\n",
- " match = False\n",
- " else:\n",
- " match = False\n",
- " \n",
- " dict_data = {\n",
- " 'webpage_title': title,\n",
- " 'domain_in_webpage_title': match\n",
- " }\n",
- " \n",
- " return dict_data\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Get a single timestamp from given data\n",
- " \n",
- " Two scenarios are considered: dates argument is either\n",
- " a list or a string. If it is a list, then we need\n",
- " to decide which date value to extract.\n",
- " \n",
- " Returns a date object.\n",
- " \"\"\"\n",
- " def get_single_date(self, dates, newest=False):\n",
- " \n",
- " dates_epoch = []\n",
- " \n",
- " if type(dates) is list:\n",
- " for d in dates:\n",
- " dates_epoch.append(d.timestamp())\n",
- " else:\n",
- " dates_epoch.append(dates.timestamp())\n",
- " \n",
- " return datetime.fromtimestamp(sorted(dates_epoch, reverse=newest)[0])\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Get domain time information based on WHOIS domain data.\n",
- " \n",
- " Returns a dict object.\n",
- " \"\"\"\n",
- " def get_domain_timeinfo(self, url):\n",
- " \n",
- " whois_data = self.get_whois_data(url)\n",
- " domain_creation_date = self.get_single_date(whois_data.creation_date, newest = False)\n",
- " domain_updated_date = self.get_single_date(whois_data.updated_date, newest = False)\n",
- " domain_expiration_date = self.get_single_date(whois_data.expiration_date, newest = False)\n",
- " \n",
- " dict_data = {\n",
- " 'domain_timestamps':\n",
- " {\n",
- " 'created': domain_creation_date.strftime(dateformat),\n",
- " 'updated': domain_updated_date.strftime(dateformat),\n",
- " 'expires': domain_expiration_date.strftime(dateformat)\n",
- " }\n",
- " }\n",
- " \n",
- " return dict_data\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Get domain time information based on WHOIS domain data,\n",
- " relative to the current date (UTC time).\n",
- " \n",
- " Returns a dict object.\n",
- " \"\"\"\n",
- " def get_domain_timeinfo_relative(self, url):\n",
- " \n",
- " date_now = datetime.utcnow()\n",
- " \n",
- " whois_data = self.get_whois_data(url)\n",
- " domain_creation_date = self.get_single_date(whois_data.creation_date, newest = False)\n",
- " domain_updated_date = self.get_single_date(whois_data.updated_date, newest = False)\n",
- " domain_expiration_date = self.get_single_date(whois_data.expiration_date, newest = False)\n",
- " \n",
- " dict_data = {\n",
- " 'domain_timestamps_relative':\n",
- " {\n",
- " 'current_date': (date_now.strftime(dateformat)),\n",
- " 'created_days_ago': (date_now - domain_creation_date).days,\n",
- " 'updated_days_ago': (date_now - domain_updated_date).days,\n",
- " 'expires_days_left': (domain_expiration_date - date_now).days\n",
- " }\n",
- " }\n",
- " \n",
- " return dict_data\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Determine whether URL matches syntaxes such as\n",
- " '../foo/bar/'\n",
- " '/foo/../../bar/,\n",
- " 'https://foo.bar/foo/../'\n",
- " \n",
- " etc.\n",
- " \n",
- " Returns a boolean object.\n",
- " \"\"\"\n",
- " def is_multidot_url(self, url):\n",
- " \n",
- " multidot = re.compile(r\".*[.]{2}/.*\")\n",
- " \n",
- " if multidot.match(url):\n",
- " return True\n",
- " return False\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Get HTML element data from HTML data contents.\n",
- " \n",
- " Two fetching methods are supported:\n",
- " - A) use only HTML element/tag name and extract raw contents of\n",
- " these tags\n",
- " - B) use both HTML element/tag name and more fine-grained\n",
- " inner attribute name to determine which HTML elements are extracted\n",
- " \n",
- " Special case - URL link references:\n",
- " - attributes 'href' or 'src' are considered as link referrals and \n",
- " they are handled in a special way\n",
- " - A) link referrals to directly to domain are placed in 'self_refs' list\n",
- " (patterns: '/', '#', '../' and '/<anything>')\n",
- " - B) link referrals to external domains are placed in 'ext_refs' list\n",
- " (patterns such as 'https://foo.bar.dot/fancysite' etc.)\n",
- " \n",
- " - Both A) and B) link categories have 'normal' and 'multidot' subcategories\n",
- " - normal links do not contain pattern '../'\n",
- " - multidot links contain '../' pattern\n",
- " \n",
- " Returns a dict object.\n",
- " \"\"\"\n",
- " \n",
- " def get_tag_data(self, url, tag, attribute=None):\n",
- " \n",
- " html_data = self.get_html_data(url)\n",
- " domain_name = self.get_domain_name(url)\n",
- " data = []\n",
- " \n",
- " if attribute != None:\n",
- " \n",
- " for d in html_data.find_all(tag):\n",
- " \n",
- " # Ignore the HTML tag if it does not contain our attribute\n",
- " if d.get(attribute) != None:\n",
- " data.append(d.get(attribute))\n",
- " \n",
- " if attribute == 'href' or attribute == 'src':\n",
- " \n",
- " self_refs = { 'normal': [], 'multidot': []}\n",
- " ext_refs = { 'normal': [], 'multidot': []}\n",
- " \n",
- " # Syntax: '#<anything>', '/<anything>', '../<anything>'\n",
- " rs = re.compile(r\"^[/#]|^[.]{2}/.*\")\n",
- " \n",
- " # Syntax: '<text>:<text>/'\n",
- " rd = re.compile(r\"^[a-z]+:[a-z]+/\")\n",
- " \n",
- " # Syntax examples:\n",
- " # 'http://foo.bar/', 'https://foo.bar/, 'foo.bar/', 'https://virus.foo.bar/'\n",
- " rl = re.compile(r\"^([a-z]+://)?([^/]*\" + domain_name + \"/)\")\n",
- " \n",
- " for s in data:\n",
- " \n",
- " # Ignore mailto links\n",
- " if re.match(\"^mailto:\", s): continue\n",
- " \n",
- " if rs.match(s) or rl.match(s) or rd.match(s):\n",
- " if self.is_multidot_url(s):\n",
- " self_refs['multidot'].append(s)\n",
- " else:\n",
- " self_refs['normal'].append(s)\n",
- " else:\n",
- " \n",
- " if self.is_multidot_url(s):\n",
- " try:\n",
- " ext_refs['multidot'].append({'url': s, 'registrar': self.get_whois_data(s).registrar })\n",
- " except:\n",
- " # Fallback if WHOIS query fails\n",
- " ext_refs['normal'].append({'url': s, 'registrar': None })\n",
- " pass\n",
- " else:\n",
- " try:\n",
- " ext_refs['normal'].append({'url': s, 'registrar': self.get_whois_data(s).registrar })\n",
- " except:\n",
- " ext_refs['normal'].append({'url': s, 'registrar': None })\n",
- " pass\n",
- " \n",
- " data = None\n",
- " \n",
- " dict_data = {\n",
- " tag: {\n",
- " attribute + '_ext': (ext_refs),\n",
- " attribute + '_self': (self_refs)\n",
- " }\n",
- " }\n",
- " \n",
- " else:\n",
- " dict_data = {\n",
- " tag: {\n",
- " attribute: (data)\n",
- " }\n",
- " }\n",
- " \n",
- " else:\n",
- " for d in html_data.find_all(tag):\n",
- " data.append(d.prettify())\n",
- " \n",
- " dict_data = {\n",
- " tag: (data)\n",
- " }\n",
- " \n",
- " return dict_data\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " How many external URL links have same registrar than\n",
- " the webpage itself?\n",
- " \"\"\"\n",
- " def get_registrar_count(self, registrar, urls):\n",
- " \n",
- " i = 0\n",
- " \n",
- " for u in urls:\n",
- " for k,v in u.items():\n",
- " if k == 'registrar' and v == registrar:\n",
- " i += 1\n",
- " \n",
- " o = len(urls) - i\n",
- " \n",
- " dict_data = {\n",
- " 'same_registrar_count': i,\n",
- " 'other_registrar_count': o\n",
- " }\n",
- " \n",
- " return dict_data\n",
- "\n",
- "######################################\n",
- "\n",
- " \"\"\"\n",
- " Get values existing in a dict object,\n",
- " based on a known key string.\n",
- " \n",
- " Returns a list object.\n",
- " \n",
- " TODO: Major re-work for the fetch function\n",
- "\n",
- " TODO: Support for more sophisticated JSON key string filtering\n",
- " (possibility to use multiple keys for filtering)\n",
- " \"\"\"\n",
- " class json_fetcher(object):\n",
- "\n",
- " def __init__(self, dict_data, json_key):\n",
- " self.json_dict = json.loads(json.dumps(dict_data))\n",
- " self.json_key = json_key\n",
- "\n",
- " ##########\n",
- " # Ref: https://www.codespeedy.com/how-to-loop-through-json-with-subkeys-in-python/\n",
- " def fetch(self, jdata):\n",
- "\n",
- " if isinstance(jdata, dict):\n",
- "\n",
- " for k,v in jdata.items():\n",
- " if k == self.json_key:\n",
- " yield v\n",
- " elif isinstance(v, dict):\n",
- " for val in self.fetch(v):\n",
- " yield val\n",
- " elif isinstance(v, list):\n",
- " for l in v:\n",
- " if isinstance(l, dict):\n",
- " for ka,va in l.items():\n",
- " if ka == self.json_key:\n",
- " yield va\n",
- "\n",
- " elif isinstance(jdata, list):\n",
- " for l in jdata:\n",
- " if isinstance(l, dict):\n",
- " for k,v in l.items():\n",
- " if k == self.json_key:\n",
- " yield v\n",
- " elif isinstance(l, list):\n",
- " for lb in v:\n",
- " for ka,va in lb.items():\n",
- " if ka == self.json_key:\n",
- " yield va\n",
- "\n",
- " ##########\n",
- " def get_data(self, flatten=True):\n",
- "\n",
- " data_extract = []\n",
- " flat_data = []\n",
- "\n",
- " for i in self.fetch(self.json_dict):\n",
- " data_extract.append(i)\n",
- "\n",
- " # Flatten possible nested lists\n",
- " # (i.e. JSON data contains multiple keys in\n",
- " # different nested sections)\n",
- " def get_data_extract(ld):\n",
- " for l in ld:\n",
- " if isinstance(l, list):\n",
- " for la in get_data_extract(l):\n",
- " yield la\n",
- " else:\n",
- " yield l\n",
- "\n",
- " if flatten == True:\n",
- " for u in get_data_extract(data_extract):\n",
- " flat_data.append(u)\n",
- " \n",
- " return flat_data\n",
- " else:\n",
- " return data_extract\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Compile URL related data.\n",
- " \"\"\"\n",
- " def get_url_data(self, url):\n",
- " \n",
- " # Dict object for simple, non-nested data\n",
- " data_simple = {}\n",
- "\n",
- " # Pre-defined dict object for specific data sets\n",
- " webpage_data = {}\n",
- " \n",
- " startfinal_url = self.get_startfinal_urls(url)\n",
- " redirect_url = self.get_url_redirects(url)\n",
- " domain_registrar = self.get_domain_registrar(url)\n",
- " domaintitle_match = self.get_domain_title_match(url)\n",
- " \n",
- " domain_time_relative = self.get_domain_timeinfo_relative(url)\n",
- " domain_time = self.get_domain_timeinfo(url)\n",
- " \n",
- " html_element_iframe = self.get_tag_data(url, 'iframe')\n",
- " html_element_a_href = self.get_tag_data(url, 'a', link_refs['a'])\n",
- " html_element_img_src = self.get_tag_data(url, 'img', link_refs['img'])\n",
- " html_element_script_src = self.get_tag_data(url, 'script', link_refs['script'])\n",
- "\n",
- " iframes_count = {\n",
- " 'iframes_count':\n",
- " len(self.json_fetcher(html_element_iframe, 'iframe').get_data())\n",
- " }\n",
- " \n",
- " multidot_urls_count = {\n",
- " 'multidot_url_count':\n",
- " len(self.json_fetcher(html_element_a_href, 'multidot').get_data()) + len(self.json_fetcher(html_element_img_src, 'multidot').get_data()) + len(self.json_fetcher(html_element_script_src, 'multidot').get_data())\n",
- " }\n",
- " \n",
- " ###################\n",
- " def get_total_registrars():\n",
- "\n",
- " same_registrar_counts = 0\n",
- " other_registrar_counts = 0\n",
- " for k,v in link_refs.items():\n",
- " \n",
- " html_element = self.get_tag_data(url, k, v)\n",
- " \n",
- " same_registrar_counts += self.get_registrar_count(\n",
- " domain_registrar['domain_registrar'],\n",
- " html_element[k][v + '_ext']['normal']\n",
- " )['same_registrar_count']\n",
- " \n",
- " other_registrar_counts += self.get_registrar_count(\n",
- " domain_registrar['domain_registrar'],\n",
- " html_element[k][v + '_ext']['normal']\n",
- " )['other_registrar_count']\n",
- " \n",
- " registrar_counts = {\n",
- " 'same_registrar_count': same_registrar_counts,\n",
- " 'other_registrar_count': other_registrar_counts\n",
- " }\n",
- " return registrar_counts\n",
- " \n",
- " # Avoid unnecessary nesting of the following data\n",
- " data_simple.update(domain_registrar)\n",
- " data_simple.update(domaintitle_match)\n",
- " data_simple.update(iframes_count)\n",
- " data_simple.update(multidot_urls_count)\n",
- " data_simple.update(get_total_registrars())\n",
- " \n",
- " url_data = dict({\n",
- " url: [\n",
- " data_simple,\n",
- " startfinal_url,\n",
- " {'redirects': redirect_url},\n",
- " \n",
- " domain_time_relative,\n",
- " domain_time,\n",
- " \n",
- " {'webpage_data': [\n",
- " html_element_iframe,\n",
- " html_element_a_href,\n",
- " html_element_img_src,\n",
- " html_element_script_src\n",
- " ]\n",
- " }\n",
- " ]\n",
- " })\n",
- " \n",
- " return url_data\n",
- "\n",
- "\n",
- "\n",
- "class write_operations(object):\n",
- "\n",
- " def __init__(self):\n",
- " self.filename = filename\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Set JSON file name, append number suffix\n",
- " # if file exists already.\n",
- " \n",
- " Returns file name path.\n",
- " \"\"\"\n",
- " def set_filename(self):\n",
- " \n",
- " c = 0\n",
- " while True:\n",
- " if os.path.exists(self.filename):\n",
- " if c == 0:\n",
- " self.filename = self.filename + \".\" + str(c)\n",
- " else:\n",
- " self.filename = re.sub(\"[0-9]+$\", str(c), self.filename)\n",
- " else:\n",
- " break\n",
- " c += 1\n",
- " return self.filename\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Append to a JSON file.\n",
- " \"\"\"\n",
- " def write_to_file(self, data):\n",
- " \n",
- " try:\n",
- " json_file = open(self.filename, \"a\")\n",
- " json_file.write(data)\n",
- " json_file.close()\n",
- " return 0\n",
- " except:\n",
- " return 1\n",
- "\n",
- "######################################\n",
- " \"\"\"\n",
- " Fetch all pre-defined URLs.\n",
- " \"\"\"\n",
- " def fetch_and_store_url_data(self, urls, use_file):\n",
- "\n",
- " data_parts = {}\n",
- " fetch_json_data = json_url_data()\n",
- "\n",
- " for u in urls:\n",
- " print(\"URL data: %s\" % u)\n",
- " try:\n",
- " data_parts.update(fetch_json_data.get_url_data(u))\n",
- " except:\n",
- " print(\"Failed: %s\" % u)\n",
- " pass\n",
- "\n",
- " json_data = json.dumps(data_parts)\n",
- "\n",
- " if use_file == True:\n",
- " self.write_to_file(json_data)\n",
- "\n",
- " return json_data\n",
- "\n",
- "######################################\n",
- "\"\"\"\n",
- "Visualize & summarize data.\n",
- "\"\"\"\n",
- "\n",
- "class data_visualization(object):\n",
- "\n",
- " def __init__(self, url, json_data):\n",
- " self.url = url\n",
- " self.json_data = json_data\n",
- "\n",
- " self.data = json.loads(json.dumps(self.json_data)).get(self.url)\n",
- " self.json_url_obj = json_url_data()\n",
- " self.domain_registrar = self.json_url_obj.get_domain_registrar(self.url)['domain_registrar']\n",
- " self.webpage_data = self.json_url_obj.json_fetcher(self.data, 'webpage_data').get_data()\n",
- "\n",
- " def get_urls_count_summary(self):\n",
- "\n",
- " unique_refs = []\n",
- "\n",
- " for k,v in link_refs.items():\n",
- " if v in unique_refs: continue\n",
- " unique_refs.append(v)\n",
- "\n",
- " def link_count(refs, suffix):\n",
- "\n",
- " urls_cnt = 0\n",
- "\n",
- " for u in self.webpage_data:\n",
- " for l in refs:\n",
- " urls = self.json_url_obj.json_fetcher(u, l + suffix).get_data()\n",
- " for n in urls:\n",
- " urls_cnt += len(n['normal'])\n",
- " urls_cnt += len(n['multidot'])\n",
- " return urls_cnt\n",
- "\n",
- " data = {\n",
- " 'local_urls': link_count(unique_refs, '_self'),\n",
- " 'external_urls': link_count(unique_refs, '_ext')\n",
- " }\n",
- " \n",
- " return data\n",
- "\n",
- " def get_registrars(self):\n",
- "\n",
- " registrars = []\n",
- " #registrars.append(self.domain_registrar)\n",
- "\n",
- " for w in self.webpage_data:\n",
- " webpage_registrars = self.json_url_obj.json_fetcher(w, 'registrar').get_data()\n",
- " for wa in webpage_registrars:\n",
- " if wa != None:\n",
- " registrars.append(wa)\n",
- " return registrars\n",
- "\n",
- " def get_registrar_count_summary(self):\n",
- " \n",
- " domain_counter = dict(Counter(self.get_registrars()))\n",
- " data = {'fetched_domains': domain_counter, 'url_domain_registrar': self.domain_registrar }\n",
- " return data\n",
- "\n",
- "######################################\n",
- "\"\"\"\n",
- "Execute the main program code.\n",
- "\n",
- "TODO: this code must figure out the correct JSON file\n",
- "if multiple generated files are present.\n",
- "\"\"\"\n",
- "if __name__ == '__main__':\n",
- "\n",
- " if plot_only == False:\n",
- " write_obj = write_operations()\n",
- " write_obj.set_filename()\n",
- " data = write_obj.fetch_and_store_url_data(urls, use_file)\n",
- "\n",
- " url_str_pattern = re.compile(r\"(^[a-z]+://)?([^/]*)\")\n",
- "\n",
- " if os.path.exists(filename):\n",
- " with open(filename, \"r\") as json_file:\n",
- " json_data = json.load(json_file)\n",
- " else:\n",
- " json_data = data\n",
- "\n",
- " # Get URLs from an available JSON data\n",
- " for key_url in json_data.keys():\n",
- " \n",
- " print(\"Generate statistics: %s\" % key_url)\n",
- "\n",
- " fig = plt.figure()\n",
- " fig_params = {\n",
- " 'xtick.labelsize': 8,\n",
- " 'figure.figsize': [9,8]\n",
- " # 'figure.constrained_layout.use': True\n",
- " }\n",
- " plt.rcParams.update(fig_params)\n",
- " \n",
- " domain_string = url_str_pattern.split(key_url)[2].replace('.','')\n",
- " summary = data_visualization(key_url, json_data)\n",
- " \n",
- " summary_registrars = summary.get_registrar_count_summary()['fetched_domains']\n",
- "\n",
- " x_r = list(summary_registrars.keys())\n",
- " y_r = list(summary_registrars.values())\n",
- " \n",
- " # Show bar values\n",
- " for index,data in enumerate(y_r):\n",
- " plt.text(x=index, y=data+0.5, s=data, fontdict=dict(fontsize=8))\n",
- " \n",
- " title_r = \"Domains associated with HTML URL data (\" + key_url + \")\"\n",
- " xlabel_r = \"Fetched domains\"\n",
- " ylabel_r = \"Domain count\"\n",
- "\n",
- " plt.bar(x_r, y_r, color=\"green\", edgecolor=\"black\")\n",
- " plt.title(title_r)\n",
- " plt.xlabel(xlabel_r)\n",
- " plt.ylabel(ylabel_r)\n",
- " plt.xticks(rotation=45, horizontalalignment=\"right\")\n",
- "\n",
- " if save_plot_images == True:\n",
- " plt.savefig(os.getcwd() + \"/\" + \"domain_figure_\" + domain_string + \".png\", dpi=plot_images_dpi)\n",
- " plt.show()\n",
- "\n",
- " #fig_u = plt.figure()\n",
- " \n",
- " #summary_urls = summary.get_urls_count_summary()\n",
- " \n",
- " #x_u = list(summary_urls.keys())\n",
- " #y_u = list(summary_urls.values())\n",
- " #title_u = \"Local and external URL references (\" + key_url + \")\"\n",
- " #xlabel_u = \"Fetched URLs\"\n",
- " #ylabel_u = \"URL count\"\n",
- " \n",
- " #plt.bar(x_u, y_u, color=\"blue\", edgecolor='black')\n",
- " #plt.title(title_u)\n",
- " #plt.xlabel(xlabel_u)\n",
- " #plt.ylabel(ylabel_u)\n",
- " #plt.show()\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Analysis\n",
- "\n",
- "| Website | Analysis | Top registrars |\n",
- "|--------------|--------------------------------------------------------------------------------|----------------------------------------|\n",
- "| HoxHunt | Great variation of different registrars | `MarkMonitor Inc.`, `CloudFlare Inc.` |\n",
- "| HS.fi | Average variation of different registrars, relies mostly on its own registrar | `Sanoma` |\n",
- "| TS.fi | Great variation of different registrars, uses mostly its own regisrtrar | `TS-Yhtymä Oy` |\n",
- "| Facebook | Very low variation of different registrars, relies on a single regisrtrar | `RegistrarSafe LLC` |"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.8.5"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
- }
|