有一个网站https://www.houzz.ru/ideabooks
的分页非常奇怪。页面与其他地方一样列出......但实际上链接到以下格式的页面: Page 1- https://www.houzz.ru/ideabooks
; 第 2 页- https://www.houzz.ru/ideabooks/p/11
; 第 3 页- https://www.houzz.ru/ideabooks/p/22
; 等帮助。range()中应该写什么?先感谢您。
有一个代码
from bs4 import BeautifulSoup
from time import sleep
import time
import json
url = "https://www.houzz.ru/ideabooks/"
headers = {
'Accept': '*/*',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
}
r = requests.get(url=url, headers=headers)
soup = BeautifulSoup(r.text, "lxml")
# print(soup)
data = []
for p in range(???):
print(p)
url = f"https://www.houzz.ru/ideabooks/p/{p}"
r = requests.get(url=url, headers=headers)
sleep(5)
soup = BeautifulSoup(r.text, "lxml")
all_name_links = soup.find_all(class_="gallery-text__title hz-track-me")
for item in all_name_links:
item_text = item.text
item_href = item.get("href")
data.append([item_text, item_href])
with open("all_name_links.json", "w") as file:
json.dump(data, file, indent=4, ensure_ascii=False)```
尝试使用 range 中的所有 3 个参数,因为 range 有 (begin, end-1, STEP)。将页面步幅设置为 10。但也允许最终页面有 1 的偏差。希望能帮助到你。🙌🙌