ParaCrawl @ParaCrawl
Joined December 2018-
Tweets51
-
Followers244
-
Following7
-
Likes36
HPLT News and Tools!!! If you are interested in filtering your datasets for quality and using them to train MT and LLMs, you are interested in this thread 👇
HPLT News and Tools!!! If you are interested in filtering your datasets for quality and using them to train MT and LLMs, you are interested in this thread 👇
Interested in Open and Community-Driven MT initiatives? CrowdMT is for you! 🎙️Invited speakers from Wikimedia Foundation and Apertium announced. 📜Accepted papers and abstracts announced. Time to register at events.tuni.fi/eamt23/registr… Details: hplt-project.org/events
A new ParaCrawl parallel corpus is available! 🌍 languages: Polish-Czech 🎒 size: 24 million sentences 🗒️ license: CC0 🎯 location: paracrawl.eu bonus section 🧐 more info: paracrawl.eu/moredata
Indeed, this is the first data release of the #Macocu effort. You will find both monolingual and bilingual (with English) corpora on ELRC-Share and CLARIN repositories and the website. Insights coming soon! Most of the code also ready for you to try it out!
Indeed, this is the first data release of the #Macocu effort. You will find both monolingual and bilingual (with English) corpora on ELRC-Share and CLARIN repositories and the website. Insights coming soon! Most of the code also ready for you to try it out!
Check out MultiParacawl 9, including 36 parallel corpora for Ukrainian and a total of 705 bitexts. Thanks OPUS and @TiedemannJoerg to share this great resource! paracrawl.eu/news/item/18-m…
@anas_ant If you have an MT system, try bleualign (github.com/bitextor/bleua…) from @ParaCrawl . Scales to ParaCrawl-sized data.
We're back with more language resources: English-Ukrainian parallel corpus with aprox. 13M sentence pairs has been released. More info and downloads: paracrawl.eu/news/item/17-e… Please, spread the word and use it!
Done! All #ParaCrawl v9 corpora are now available at paracrawl.eu, some also on Corset corset.paracrawl.eu to further inspect or filter them and a new Bitextor is also out github.com/bitextor! Thanks to #CEF and the EU for co-funding this great project!
Summer was for work! Now #ParaCrawl v9 corpora are done and again bigger than the previous ones!🤩 Extrinsic evaluation through MT almost finished and, according to old BLEU and new COMET, the quality of the MT output improves! 🥳 We will share corpora and more results soon!🕑
Very clear TODO from #ParaCrawl's last stakeholder board meeting: we need better language identification, specially for closely-related languages and for under-resourced ones. Such a basic thing! Trying here to improve current results mixing Fastext and Hunspell, take a look👇
Very clear TODO from #ParaCrawl's last stakeholder board meeting: we need better language identification, specially for closely-related languages and for under-resourced ones. Such a basic thing! Trying here to improve current results mixing Fastext and Hunspell, take a look👇
A new version of ParaCrawl is being cooked! We are aiming at not only more bilingual but also monolingual data. And we are applying neural cleaning this time with bicleaner-ai (github.com/bitextor/bicle…). Stay tuned!🧐
Milestone reached! We just published Corset, a data selection portal to get relevant data from massive amounts of parallel data such as ParaCrawl corpora. Thanks #CEFTelecom! Users welcome! Test it here: corset.paracrawl.eu Code & docs here: github.com/paracrawl/cors…
We almost forgot to tell you, ParaCrawl 8 is out! First highlight: wow the size of it! Check yourself at paracrawl.eu/releases #ParaCrawl #crawling #parallelcorpus #CEFTelecom #MT
Bitextor 8 is out! Many improvements and features that will make it into next @ParaCrawl data release, including ones from Snakemake 6 by @johanneskoester. Check all the changes: github.com/bitextor/bitex…
Our corpora were evaluated as part of the great effort at "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets" (🧐 arxiv.org/pdf/2103.12028…). We will keep our efforts in trying to deliver high-quality corpora out of web crawled content. ParaCrawl v8 about to come!
Our corpora were evaluated as part of the great effort at "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets" (🧐 arxiv.org/pdf/2103.12028…). We will keep our efforts in trying to deliver high-quality corpora out of web crawled content. ParaCrawl v8 about to come!
We are exploring neural document alignment and neural parallel data filtering in ParaCrawl. The massive amount of data to be processed might be an issue for neural approaches in our setting. Scale or fail...
Working in ParaCrawl V8, due 31st March 2021, which will include: ➕ parallel data, mainly from Internet Archive --> we are processing 1.1 petabyte of .warc files, about 41% of 77T of compressed text ➖ MACHINE TRANSLATION --> we are focusing on this issue Stay tuned!
Marcin Junczys-Dowmun.. @marian_nmt
2K Followers 396 Following NLP. NMT. Main author of Marian NMT. Research Scientist at Microsoft Translator. Non-NLP silliness and stuff on @emjotdeMikel Artetxe @artetxem
6K Followers 221 Following Co-founder @RekaAILabs and Honorary Researcher @IxaGroup (University of the Basque Country) | Past: Research Scientist @AIatMeta (FAIR)Huda Khayrallah @HudaKhay
1K Followers 911 Following Machine Translation/#NLProc/ML Researcher at Microsoft. Past: @UCBerkeley CS ugrad; @LiltHQ research intern; @jhuCLSP/@jhuCompSci PhDRudy Loock 🌻 @RudyLoock
2K Followers 890 Following University Professor, Linguistics/Translation Studies @univ_lille @STL_ULille @CNRS. @Master_TSM, @affumt_fr vice-president, EMTNet Also @ https://t.co/r1HtIZjexO🌨🌞 @w_uf9
93 Followers 553 FollowingTianqi Zhang @tzhangzh12
416 Followers 905 Following English, Spanish, Catalan to Mandarin Chinese Conference Interpreter | Ph.D in Translation and TerminologyAdam @ka2
10 Followers 502 FollowingAk Esved @mathaehuman
2 Followers 28 FollowingTobias Domhan @tdomhan
564 Followers 873 Following Machine Learning Scientist at Amazon Berlin. let's try this mastodon thing: @[email protected]kekeho @k3k3h0
3K Followers 4K Following Hiroki TAKEMURA / @KeioSFC B4 / {@WIDE_Project, RG} Delight / @IPAjp 未踏2022 スパクリ / ex-NIT,NC / アイコン: @wonderernmn / 興味: Decentralized System / Web3は関与しませんWaldemar Boszko @WaldemarBoszko
23 Followers 398 FollowingJ. C. Segovia @jcsegoviacastel
256 Followers 1K Following Gerente Tecnología Integrada TIC, Alianza APTA, Consultor TIC, padre, esposo y creyente.Micho @mihranmihran
124 Followers 747 Following My grandfather once told me that his dad spoke 11 languages - Thanks for starting a beef, grandpa! 🔥Aiham @aihkas
243 Followers 5K Following I fix bugs for breakfast; Software Engineer. Chess player. I have zero respect for societies with a homelessness problem. Private account.burcu ilkay karaman @aladut
401 Followers 4K Following Professor, Linguist, Translator, Professional Tour Guide, Forensic Expert, Profesör, Dilbilimci, Çeviribilimci, Profesyonel Tur Rehberi, Adli BilirkişiAda Wan @adawan919
157 Followers 1K Following Transdisciplinarian (stats, datasci, ml, lang/socSci, tech, art, science, philosophy). (Use-inspired) fundamental research.Opinions my own. Accidental activist.manzZzari @manzZzari
166 Followers 2K Following game + web dev, musician, artist. 99% retweet account: #gamedev / #illustration / #animation / #sounddesign / #languagelearninggamesYeb Havinga @YebHavinga
101 Followers 505 Following Trying to keep up with capabilities of large language models. For Dutch GPT and T5, check out https://t.co/ZOBPojaoZIBingo @fubinfri
2 Followers 24 Followingstvhuang @_stvhuang
0 Followers 610 FollowingAmulya Ratna Dash @amulya_r
31 Followers 671 Following Senior AI Engineer at @IQVIA_global, PhD scholar at @bitspilaniindiaGözde Büklüm @meyyalhanim
174 Followers 995 Following PhD candidate. Proud @UniBogazici alumni. Research areas: historical TS, Turkish language and literature, machine translation, translation technologies.Maurits van Wijland @mvwijland
129 Followers 317 Following proud dad of 2 extrodonary boys | pondering and constant learning as a techno-humanizer | into Tai JitsuPinzhen "Patrick" Che.. @pinzhen_chen
77 Followers 224 Following Working on LLMs and MT @EdinburghNLP @InfAtEdAntonio Valerio Micel.. @AVMiceliBarone
962 Followers 2K Following ML / NLP School of Informatics, The University of EdinburghYiyi Hu @Elaineable11
122 Followers 645 Following PhD candidate in Translation Studies at Fudan UniversityClarin.si @ClarinSlovenia
256 Followers 276 Following The Slovene national consortium of the European research infrastructure @CLARINERIC, providing language resources and technologies, expertise and knowledge.BaohaoLiao @baohao_liao
145 Followers 262 Following PhD for NLP @UvA_Amsterdam. Previously study @RWTH @sjtu1896渋谷系NLPer @enullper
1K Followers 2K Following Natural Language Processing / 対訳コーパス編纂業 / Speaks: 🇯🇵 🇧🇷 🇺🇸 🇪🇸Matthew Kenney @baykenney
3K Followers 3K Following Co-Founder @aspect_labs | Previously: Senior ML Engineer at @Apple, Asst. Research Prof at @DukeU Duke Data Science | @penn_state '15 | @Cornell '11Ian Bjelovar @IanBjelovar
41 Followers 698 FollowingLilian Bordeau @BordeauLilian
114 Followers 5K FollowingJosé G. C. de Souza @accezz
502 Followers 1K Following PhD, Computer Science, NLP, Machine Translation and Machine Learning.chitreddy sairam @chittiman
324 Followers 4K Following NLP Engineer working on Indic languages | Earlier taught Physics for IIT-JEE | Alumni @iitdelhi | Self-taught programmer | Fascinated by the field of LanguagesDeyi Xiong @DeyiXiong
16 Followers 264 FollowingVictor Sanh @SanhEstPasMoi
9K Followers 2K Following Dog sitter by day, Scientist at @huggingface 🤗 by nightLils @langsupade
41 Followers 5K FollowingOliver Blake @O_b1ake
258 Followers 764 Following Project Officer @LIBEREurope | Opinions are my own.Sneha Kudugunta @snehaark
2K Followers 747 Following addicted to tpus @GoogleDeepMind @uwcse | varying proportions of AI and mediocre jokes (not mutually exclusive) | she/her/hersLPX ☄️ @LPX_404
203 Followers 476 Following Currently: Governance @EvmosDAO --- Previously: @ShapeShift DAO, Matic (pre-Polygon), @ENSdomains, @HackerNoon, D3 Consortium, DAO Transparency IndexGiuseppe Deriard @gderiard
8 Followers 86 FollowingSunit Bhattacharya @official_sunit
153 Followers 969 Following PhD researcher at @ufal_cuni of @CharlesUniPRG. I dabble with multimodal translation in AI systems and a bit of neuro-cognitive-linguistics. He/Him.B1ff Jones @b1ffjones
7 Followers 89 FollowingFrank Facundo @telescientia
27 Followers 253 Following Científico de datos, informático y electrónico Télécom Paris - Sciences Po ParisMariano Rico @marianorico
581 Followers 300 Following Teacher & Researcher @oeg_upm, focused on Linked Data, Semantic Web and ML+NLP techs. Responsible for https://t.co/u3tXdIQnjr and creator of DylanQ & KeyQ.Nitika Mathur @probablyNitika
177 Followers 172 Followingcarmen 🌍🤖😷 @stedomedo
111 Followers 472 Following Dreamer. From Silicon Saxony to Silicon Valley to Isar Valley. Enjoys transitioning ML R&D concepts into production. ❤️ Languages, food, sports, 80's. She/HerMarcin Junczys-Dowmun.. @marian_nmt
2K Followers 396 Following NLP. NMT. Main author of Marian NMT. Research Scientist at Microsoft Translator. Non-NLP silliness and stuff on @emjotdeCINEA🇪🇺 @cinea_eu
36K Followers 2K Following @EU_Commission European Climate, Infrastructure & Environment Executive Agency #CINEA_EU #EUGreenDeal Data protection https://t.co/O3DcSWlGqLOmniscien @omniscientech
1K Followers 4K Following Omniscien Technologies is a leading supplier globally of high-performance, high-quality Machine Translation (MT) and Language Processing technologies.Barry Haddow @bazril
1K Followers 698 Following Researcher in Informatics at University of Edinburgh. Mainly working on machine translation.Prompsit @Prompsit
625 Followers 433 Following We speak Natural Language Processing, Data Analysis and Artificial Intelligence, among many other languages!Jed Sundwall @jedsundwall
3K Followers 5 Following Basic dad. @OurRadiantEarth @TechsOnTexts. “Normally, I’m against big things. I think the world’s going to be solved by millions of small things.” – Pete SeegerTAUS @T21Century
7K Followers 576 Following We generate, collect & annotate language training #data for #AI and #MachineLearning and offer #NLP services and resources. Follow us for the latest data news!Happy to announce our first HPLT model release!
First datasets, then models! Initial HPLT models (LLMs and MT) are out: hplt-project.org/models, some still running 🏃 We explain what we are doing in the deliverables section: hplt-project.org/deliverables Meanwhile, we keep cooking IA peta-data-bytes 🥘, enriching, dashboarding 📊
We just published version 1.2 of HPLT datasets. What's new? - we fixed a bug in monolingual dedup, please redownload! 🛠️ - we filtered out very ugly monolingual documents🤮 - we anonymised the bilingual datasets🕵️♀️ hplt-project.org/datasets/v1.2
HPLT News and Tools!!! If you are interested in filtering your datasets for quality and using them to train MT and LLMs, you are interested in this thread 👇
[1/6] After about 14 months of hard work, together with multiple people we present you with OpusTrainer and OpusCleaner! OpusCleaner is your one stop data fetching/preprocessing/cleaning pipeline, complete with GUI and designed to implicitlyvisualise your data before ...
We really enjoyed the @vardialworkshop at the EACL 2023. We presented two papers on #MaCoCu language variety tools: British-American English variety classifier, models for automatic discrimination between Bosnian, Croatian, Montenegrin and Serbian, and more. More in🧵⬇️
I presented our new British-American English classifier and a genre classifier, and showed that they reveal big differences between the massive #MaCoCu parallel web corpora. All the datasets and classifiers are freely available! Find them in our paper: aclanthology.org/2023.vardial-1…
We really enjoyed the @vardialworkshop at the EACL 2023. We presented two papers on #MaCoCu language variety tools: British-American English variety classifier, models for automatic discrimination between Bosnian, Croatian, Montenegrin and Serbian, and more. More in🧵⬇️
Visiting the @eaclmeeting #eacl2023 conference in Dubrovnik these days 🙂☀️ We will be presenting our work on South Slavic and English language variety identification from the #MaCoCu project at the #VarDial worshop. Hope to see you there on Friday!
After a great 2nd physical meeting of the #hplt project in Oslo❄️ we have some highlights to share: 🤩 OPUS includes now ELRC and mtdata 🧐 1PB of new web data is ready to be processed 🦾 Ready to work on LUMI (AMD) 😻 LLM/MT models completed for fi/no and cs-uk @DataEcoEU
Download for free this great book on Machine translation for everyone: lnkd.in/eP94jD4u
HPLT (aka Hippolyta) is a space that combines petabytes of natural language data with large-scale model training. This Horizon Europe projects gathers top MT and LM researchers, HPC centres and ourselves as partners. 3 years of ambitious goals ahead! hplt-project.org
What are we crawling about at macocu.eu? Very interesting work by @IjsTk presented in our bi-weekly meeting. Lots of promotion multilingual websites in the .tr, .bg and .sl domains. Lots of legal in the .mt one. Stay tuned for more results! #macocucorpora
Massive AND high-quality corpora for Bulgarian, Croatian, Slovene, Macedonian, Icelandic, Maltese and Turkish, collected by the #MaCoCu project, are now available in our repository! Check them out and share the word: ➡️macocu.eu ➡️clarin.si/repository/xml…
Indeed, this is the first data release of the #Macocu effort. You will find both monolingual and bilingual (with English) corpora on ELRC-Share and CLARIN repositories and the website. Insights coming soon! Most of the code also ready for you to try it out!
Massive AND high-quality corpora for Bulgarian, Croatian, Slovene, Macedonian, Icelandic, Maltese and Turkish, collected by the #MaCoCu project, are now available in our repository! Check them out and share the word: ➡️macocu.eu ➡️clarin.si/repository/xml…
Looking for parallel English-Ukranian data? Then you need to check this 👇
We're back with more language resources: English-Ukrainian parallel corpus with aprox. 13M sentence pairs has been released. More info and downloads: paracrawl.eu/news/item/17-e… Please, spread the word and use it!
@anas_ant If you have an MT system, try bleualign (github.com/bitextor/bleua…) from @ParaCrawl . Scales to ParaCrawl-sized data.
Welcome to the #CEFTelecom #EuroPat project with a focus on #eTranslation in the patent domain. ➡️Know more: europa.eu/!hB97MK #ConnectingEurope
We will be "Unleashing European Patent Translations" for a couple of years from now as part of the #EuroPat project, a EU co-funded #CEFTelecom project by @inea_eu . Take a look! ec.europa.eu/inea/en/connec…
We will be "Unleashing European Patent Translations" for a couple of years from now as part of the #EuroPat project, a EU co-funded #CEFTelecom project by @inea_eu . Take a look! ec.europa.eu/inea/en/connec…
I'm excited to share our MT metric evaluation study on a whopping 4380 human-judged MT systems! It isn't a surprise that BLEU ranked suboptimal and pretrained methods rule. arxiv.org/abs/2107.10821
Thank you @Prompsit for presenting neural machine platform MutNMT developed under @MultiNmt project to @tradumatica master's students and lecturers! [BTW, any master interested in presentations, contact @Prompsit !]