RError.com

RError.com Logo RError.com Logo

RError.com Navigation

  • 主页

Mobile menu

Close
  • 主页
  • 系统&网络
    • 热门问题
    • 最新问题
    • 标签
  • Ubuntu
    • 热门问题
    • 最新问题
    • 标签
  • 帮助
主页 / 问题 / 1530943
Accepted
zrx
zrx
Asked:2023-07-16 21:53:25 +0000 UTC2023-07-16 21:53:25 +0000 UTC 2023-07-16 21:53:25 +0000 UTC

网站解析和文本翻译

  • 772

面临这样的任务:
需要解析网站并将生成的文本翻译成俄语,同时保持文本的结构

以下是需要解析的HTML :

<div>
   <style data-emotion="css gy2lsd">
      .css-gy2lsd{
         margin:0;
         font-family:'DM Sans',sans-serif;
         font-weight:400;
         font-size:1rem;
         line-height:1.5;
         margin:15px 0;
      }
   </style>
   <p class="MuiTypography-root MuiTypography-body1 css-gy2lsd">
      <strong>Nexus </strong>is the pioneering AI-powered navigator <em>designed to assist users</em> in effectively navigating their entire network.
   </p>
   <p class="MuiTypography-root MuiTypography-body1 css-gy2lsd">
      <strong>Key Features:</strong>
   </p>
   <style data-emotion="css 196imzh">
      .css-196imzh{
         list-style:none;
         margin:0;
         padding:0;
         position:relative;
         padding-top:8px;
         padding-bottom:8px;
         list-style-type:disc;
         padding-left:16px;
      }
      
      .css-196imzh .MuiListItem-root{
         display:-webkit-box;
         display:-webkit-list-item;
         display:-ms-list-itembox;
         display:list-item;
         }
      </style>
   <ul class="MuiList-root MuiList-padding css-196imzh">=
      <li class="MuiListItem-root MuiListItem-gutters MuiListItem-padding css-1uxc9es">
         <style data-emotion="css 1tsvksn">.css-1tsvksn{-webkit-flex:1 1 auto;-ms-flex:1 1 auto;flex:1 1 auto;min-width:0;margin-top:4px;margin-bottom:4px;}</style>
         <div class="MuiListItemText-root css-1tsvksn">
            <style data-emotion="css iwjgli">.css-iwjgli{margin:0;font-family:'DM Sans',sans-serif;font-weight:400;font-size:1rem;line-height:1.5;display:block;}</style>
            <span class="MuiTypography-root MuiTypography-body1 MuiListItemText-primary css-iwjgli"><strong>AI-powered network navigation:</strong> Leverage cutting-edge AI technology to effectively navigate your entire network.</span>
         </div>
      </li>
      <li class="MuiListItem-root MuiListItem-gutters MuiListItem-padding css-1uxc9es">
         <div class="MuiListItemText-root css-1tsvksn"><span class="MuiTypography-root MuiTypography-body1 MuiListItemText-primary css-iwjgli"><strong>Comprehensive relationship context:</strong> Access contextual knowledge about your connections to facilitate smoother interactions.</span></div>
      </li>
      <li class="MuiListItem-root MuiListItem-gutters MuiListItem-padding css-1uxc9es">
         <div class="MuiListItemText-root css-1tsvksn"><span class="MuiTypography-root MuiTypography-body1 MuiListItemText-primary css-iwjgli"><strong>Assistance with various networking tasks: </strong>Seek help from Nexus for reasons to reconnect, outreach email ideas, and personalized gift recommendations.</span></div>
      </li>
   </ul>
   <p class="MuiTypography-root MuiTypography-body1 css-gy2lsd"><strong>Use Cases:</strong></p>
   <p class="MuiTypography-root MuiTypography-body1 css-gy2lsd"><strong>• Strengthening client relationships:</strong> Deepen your connections with key clients using Nexus's insights and guidance.</p>
   <p class="MuiTypography-root MuiTypography-body1 css-gy2lsd"><strong>• Managing stakeholder relationships:</strong> Effectively handle multiple stakeholders with Nexus's support and recommendations.</p>
   <p class="MuiTypography-root MuiTypography-body1 css-gy2lsd"><strong>• Enhancing networking efficiency: </strong>Save time and improve networking outcomes by leveraging Nexus's comprehensive understanding of your network.</p>
   <p class="MuiTypography-root MuiTypography-body1 css-gy2lsd">Experience the future of networking today with Nexus, the first AI navigator for your entire network.</p>
</div>

解析不是问题,问题在于翻译。 主要.py

from bs4 import BeautifulSoup
from googletrans import Translator

def get_soup(url):
    # Запрос на сайт
    q = requests.get(url)
    result = q.content
    # Экземпляр модуля bs4
    soup = BeautifulSoup(result, 'lxml')

    return soup

def translate_string(string) -> str:
    translator = Translator()
    translation = translator.translate(string, dest='ru')
    return translation.text

def main():
    soup = get_soup('https;//123')
    response = soup.select('div.MuiBox-root.css-0 div')
    text = ''
    text += translate_string(str(i)) for i in response[0]

有必要确保在翻译一行时标签不会被翻译,现在它看起来像这样:

<данные стилей -emotion="css 196imzh">.css-196imzh{стиль-списка:нет;маржа:0;отступ:0;позиция:относительная;отступ-сверху:8px;отступ-снизу:8px;тип-стиля-списка:диск ;padding-left:16px;}.css-196imzh .MuiListItem-root{display:-webkit-box;display:-webkit-list-item;display:-ms-list-itembox;display:list-item;}< /style>

也就是说,您只需翻译文本,而不是标签。我尝试像这样实现,但我的尝试是徒劳的:

    for i in response[0]:
        line = str(i)
        text_in_line = i.get_text()
        translated = transkate_string(i.get_text())
        line = line.replace(text_in_line, translated)
        text += line

因为text_in_line- 这只是文本,没有类型标签<strong>,因此replace没有改变任何东西。
您能告诉我如何实施吗?

python
  • 1 1 个回答
  • 30 Views

1 个回答

  • Voted
  1. Best Answer
    gord1402
    2023-07-17T00:11:21Z2023-07-17T00:11:21Z

    您可以浏览所有包含文本的元素并进行翻译:

    from bs4 import BeautifulSoup, Comment
    from googletrans import Translator
    
    
    # https://stackoverflow.com/questions/1936466/how-to-scrape-only-visible-webpage-text-with-beautifulsoup
    def tag_visible(element):
        if element.name is not None:
            return False
        if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
            return False
        if isinstance(element, Comment):
            return False
        return True
    
    
    with open('test.html', 'rb') as f:
        soup = BeautifulSoup(f.read(), "lxml")
    
    to_translate = [elem for elem in soup.find_all(string=True) if tag_visible(elem)]
    translator = Translator()
    
    for element in to_translate:
        print(str(element))
        result = translator.translate(str(element), dest="ru").text
        element.replaceWith(BeautifulSoup(result, "html.parser"))
    
    with open('test_translated.html', 'wb') as f:
        f.write(soup.prettify('utf-8'))
    

    对于 UTF-8,您需要添加<meta charset="UTF-8">.

    • 1

相关问题

  • 是否可以以某种方式自定义 QTabWidget?

  • telebot.anihelper.ApiException 错误

  • Python。检查一个数字是否是 3 的幂。输出 无

  • 解析多个响应

  • 交换两个数组的元素,以便它们的新内容也反转

Sidebar

Stats

  • 问题 10021
  • Answers 30001
  • 最佳答案 8000
  • 用户 6900
  • 常问
  • 回答
  • Marko Smith

    我看不懂措辞

    • 1 个回答
  • Marko Smith

    请求的模块“del”不提供名为“default”的导出

    • 3 个回答
  • Marko Smith

    "!+tab" 在 HTML 的 vs 代码中不起作用

    • 5 个回答
  • Marko Smith

    我正在尝试解决“猜词”的问题。Python

    • 2 个回答
  • Marko Smith

    可以使用哪些命令将当前指针移动到指定的提交而不更改工作目录中的文件?

    • 1 个回答
  • Marko Smith

    Python解析野莓

    • 1 个回答
  • Marko Smith

    问题:“警告:检查最新版本的 pip 时出错。”

    • 2 个回答
  • Marko Smith

    帮助编写一个用值填充变量的循环。解决这个问题

    • 2 个回答
  • Marko Smith

    尽管依赖数组为空,但在渲染上调用了 2 次 useEffect

    • 2 个回答
  • Marko Smith

    数据不通过 Telegram.WebApp.sendData 发送

    • 1 个回答
  • Martin Hope
    Alexandr_TT 2020年新年大赛! 2020-12-20 18:20:21 +0000 UTC
  • Martin Hope
    Alexandr_TT 圣诞树动画 2020-12-23 00:38:08 +0000 UTC
  • Martin Hope
    Air 究竟是什么标识了网站访问者? 2020-11-03 15:49:20 +0000 UTC
  • Martin Hope
    Qwertiy 号码显示 9223372036854775807 2020-07-11 18:16:49 +0000 UTC
  • Martin Hope
    user216109 如何为黑客设下陷阱,或充分击退攻击? 2020-05-10 02:22:52 +0000 UTC
  • Martin Hope
    Qwertiy 并变成3个无穷大 2020-11-06 07:15:57 +0000 UTC
  • Martin Hope
    koks_rs 什么是样板代码? 2020-10-27 15:43:19 +0000 UTC
  • Martin Hope
    Sirop4ik 向 git 提交发布的正确方法是什么? 2020-10-05 00:02:00 +0000 UTC
  • Martin Hope
    faoxis 为什么在这么多示例中函数都称为 foo? 2020-08-15 04:42:49 +0000 UTC
  • Martin Hope
    Pavel Mayorov 如何从事件或回调函数中返回值?或者至少等他们完成。 2020-08-11 16:49:28 +0000 UTC

热门标签

javascript python java php c# c++ html android jquery mysql

Explore

  • 主页
  • 问题
    • 热门问题
    • 最新问题
  • 标签
  • 帮助

Footer

RError.com

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

帮助

© 2023 RError.com All Rights Reserve   沪ICP备12040472号-5